Keying Module





Olivia Blum
Census Planning & Evaluation Division

Tel. 972-2-655 3303
FAX: 972-2- 655 3531
e-mail: blum@cbs.gov.il




With the completion of the scanning stage, the system contains the images of the
questionnaires that have been scanned and the ASCII values of the fields that have
been optically identified. Theoretically, if the identification level of the optical
reader was similar to the human eye's, we could have waived the keying stage. But
the identification is incomplete for several reasons:
Identification depends on the quality of the handwriting on the questionnaire,
and on the quality of the scanning.
The writing on the questionnaire is free-form, and not on a pre-defined
network of dots, so that definition of each character cannot be a single
value function.
Alphabetical fields in Hebrew are not identified because of the high cost/
benefit ratio.

In retrospect, the rate of erroneous identification by the optical reader used by the
data entry system is about 30%, so that the keying stage, used as a support for the
optical identification was not optional at all but rather, a necessity.

The guiding principles of the keying stage (see also paper 3.1)
1. The value of a character is determined only when two identifying sources
agree by assigning it the same value.
2. The identifying sources are: OCR, an external file (the National Population
Register), and two rounds of keying, which are gradually integrated into the
system.
3. A keying item includes characters and fields from the questionnaires of the
entire Enumeration Area (EA).
4. The keying is performed at three levels: verification keying, correction
keying and full-field keying.
5. Verification keying and full-field keying is performed on homogeneous items:
a verification item composed of one hundred characters (10X10) for which
the same value was assigned, and a full-field keying item is composed of
fields of the same variable, which come from different questionnaires.
6. Correction keying is of characters coming from different field and
questionnaires, and is performed in triplets.
7. The level of keying depends on the level of reliability of the OCR
recognition and on the identified value being within the legitimate range.
8. Unsuccessful capture of a character in one level of keying transfers the
character (or the entire field) to a more intensive level of keying (from
verification to correction, and from correction keying to full-field keying).

The National Population Register file which contains most of the variables from the
short form serves as an external supporting file for the ODE system. Its
introduction immediately following the scanning stage enables verification of the field
values that have been assigned by the OCR and automatic determination of a field
value, in a batch procedure, without any human involvement. This saves keying of
about 60% (!) of the characters listed on the census questionnaires.
But since not all records are automatically linked with the identification values
suggested by the OCR, and since not all the variables on the census questionnaires
are found in the National Population Register, a supplementary action is required.
This action is keying from an image, performed in two rounds, as presented in table
below:




The Keying Module



Step
I.D. #
Census variables
with PR value
Census variables
with no PR value
Record Linkage 1
- Search in the PR
- Record Linkage
Compare values
(OCR-PR) & Select
value

1st Keying Round
- Unlocated ID#
& not SuperSure
- Record not linked
& ID# not
SuperSure
- Fields of unlinked

records that has
not
SuperSure
status
- OCRPR value (in
linked records)
All fields
(except for
alphabetical ones
that are not coded
automatically)
Record Linkage 2
- Smart Search in
the PR
- Record Linkage


Compare Values
Insert PR ID#
into the census
record
Compare values
(OCR, PR, 1st
Keying)
& Select value
Compare values
(OCR, 1st Keying)
& Select value
Edit Checks

Checks within fields
Checks within fields
only if
OCR1st keying
2nd Keying Round
- Unlocated ID#
- Top IDBottom ID#
(These are census
IDs)
- Values out of
legitimate range
- Suggested value
has no verification
& has not
SuperSure status
- Values out of
legitimate range
- OCR 1st keying
& have not
SuperSure
status
Record Linkage 3
- Smart Search
- Record Linkage


Select Value
- Insert PR ID# to
the
census record
Select verified
value
(OCR, PR, 1st
Keying, 2nd Keying)

Select a verified
value
(OCR, 1st Keying,
2nd Keying)

Linking individual census records with external
file record


The National Population Register is integrated within the ODE as an external support
throughout the entire data capture process. During the keying stage, this saves
human intervention in entering field values and enables definition of the individual
census records; during the editing stage, supplementary actions are carried out to
link the records to the Register.
Israel's situation is special, in that a conventional census is conducted, but various
census-related processes are aided by administrative records that can also be used
as a partial or full alternative to the census. Therefore, as in cases where the
entire census was conducted from administrative records, the ability to link the
records is very important, as well as the way it is done.
Each individual in the census and in the Register has the same, single-value
identifying variable: the identification number. But since linkage according to
identification number is not sufficiently reliable, the linkage is divided into two parts:

1.location of the identification number in the National Population Register file;
2.linking records using rigid criteria.

The identification number is a nine-digit number, where in most cases the first digit
is a zero, and in all cases, the last digit is a control digit. In the process of
finding the identification number, we use these characteristics, so that manipulations
on the number increase as the handling of the field makes the value captured more
reliable.
In total, three attempts are automatically made to link records to the Register.
The first attempt is made immediately after receiving suggestions for
identification from the optical reader. Since the rate of error in
identification not low, searching for the identification number is simple,
looking for full matching of nine digits and when less than nine digits are
written, the system supplements the number by adding leading zeroes. It
should be noticed that in 70% of the individual records, the identification
number is preprinted on an adhesive label. Almost 100% of these records
in the Register are located at first attempt.
A second attempt is made after the first keying round. Since there was
human involvement, entry of the number is perceived as more reliable and
therefore, the system checks to see if the last digit is appropriate as the
control digit. If it is not, the system calculates it (for numbers with less
than 9 digits), adds it as the last digit and supplements the identification
number up to nine digits by adding leading zeroes.
A third attempt is made following the second round of keying. Now, the
number that was entered is considered to be reliable and therefore,
manipulation is more complex. Since many people in Israel have became
familiar with the control digit in their identification numbers many years
after receiving it, there is a tendency to err with the control digit more than
with any of the other digits. Therefore, when attempting to find the
identification number for the third time, the system drops the last digit and
recalculates a control digit.

Once the identification number has been found, linkage with the Population Register
is performed on the basis of criteria which remain permanent throughout the three
attempts. Each criterion is a profile composed of four variables that are found on
the short form and in the Register. Each criterion contains the identification
number plus another three variables from among the following:
full date of birth (year 1, month and day)
partial date of birth (year 1 and month, or year 1 and day)
year of immigration to Israel 2
country of birth other than Israel
family status that is divorced or widowed
The variables which are included in the census record and are not used for
linkage purposes are those dealing with relation to the reference person and
parents' country of birth.
At the end of the keying stage, about 80% of the individual census records were
automatically linked to the National Population Register. The remaining 20% are
composed of records that were identified, but whose characteristics did not enable
automatic linkage (single, born in Israel, recorded only the year of his birth),
records that are not listed in the Register (tourists and foreigners who have been
in Israel for over a year), records which have no identification number (dwellings
of those who refused to participate and closed dwellings), and records where the
identification number was garbled. During the editing stage, where queries to the
National Population Register are interactive and can include names, about another
15% are linked.

The first keying round

Fields and characters are sent for a first round of keying only if the value
assigned to them by the OCR is not supported by external file.
The first keying round is conducted at three levels of keying, according to the
status of the level of identification by the optical reader:
Super -sure identification status does not refer fields for keying. This
status is assigned to fields and characters whose values have been
verified by an additional variable in the questionnaire or an additional
identification source.
Sure identification status refers individual characters for verification
keying in homogeneous "carpets", according to the character (one hundred
characters whose assigned identification is 0, then those whose assigned
identification is 1, and so on).
Doubtful identification status refers characters from different fields for
corrective keying in strings of three characters.
Fail identification status refers fields for full field keying in homogeneous
screens, according to the variable (keying item of year of birth, keying item
of day of the month, etc.). In addition, full-field keying is also reserved
for all alphabetic fields which undergo automatic coding (addresses, country
of birth, relation to the reference person in the household).

Second keying round

Referral to the second round of keying depends on disagreement between two of
the three previous identification sources (OCR, Register, first keying round). This
occurrence is relatively rare (less than 2%), and characterizes fields from
questionnaires where the recording on them is particularly weak or which were not
scanned sensitively enough. A second round of keying is also reserved for fields
which, based on an examination of their values, fall outside the legitimate range (for
example, year of birth 1790, etc.).

The second round of keying does not consider the identification level of the OCR,
and it is performed at two levels of keying: correction and full-field keying. For
this round, too, the basic unit is the character, so that all comparative tests between
identification sources is at the character level. This characteristic contributes to
the reduction in the rate of keying, since only those characters that are not agreed
upon are sent for keying, rather than whole fields.

At the end of the keying process, there are still keying tasks which need to be
completed during the editing stage, in spite of the controls imbedded in the
procedure and despite the corrections: fields that could not be positively identified
within partial images or characters and fields which, although they have been
assigned 3-4 identification values, still do not have two sources that have assigned
the same value. However, the rate of such cases is minimal and can be solved
within the system through an additional round of keying, keying from the image of
the full questionnaire page or at least from a portion of the entire question (not just
the box that was filled in).
The keying stage is a good example of the simultaneous processes which are not
homogeneous. The central component which aids this is the Register file:
The Register supports the OCR values, without considering the identification
level (including doubtful identification);
The Register is a partial alternative to keying;
Record linkage with the Register enables performance of one of the main
editing tasks: defining the individual record;

Inserting the identification number from the Register into the census record
facilitates future linkage of records to any administrative file which bears
the same single-value identification number.

The end of the keying stage is only half way to the declared objective of the data
capture process: getting a structured raw file. Creating a structured file is
completed in the editing stage.



Copyright © 1997-1999 The State of Israel. All rights reserved.
See "Terms of Use" for the conditions under which this service may be used.