Logical Structure and Guiding
Principles




Olivia Blum
Census Planning & Evaluation Division

Tel. 972-2-655 3303
FAX: 972-2- 655 3531
e-mail: blum@cbs.gov.il


Planning for a system to capture data from census questionnaires begins with
defining objectives and means.

The two main objectives of the process of data capture from census
questionnaires are:
obtaining a census file that represents the respondents' answers, as
accurately as possible;
maintaining the ability to identify physical units (questionnaire, enumeration
area) and analytical units (individual, household), in the computerized file.

The secondary objectives derive from the need to create an accurate census file
by means of a process, which is not only economically efficient, but also which is
rational. The central consideration guiding these objectives is optimizing the division
of activity - concerning correcting mistakes in values found in questionnaires, filling
in missing values, and coding - between the data capture stage and data processing
in the central computer.
The tasks to carry out at the data capture stage are:
editing of data within records (micro editing), only if the tools available to
the editor at the data capture stage are significantly better than those
available to the editor at the processing stage.
automatic coding of variables for which categories are well-defined.
computer-assisted complementary coding.
embedding processes of corroboration and control into the data capture
process in order to verify accurate data capture.

The available new technology, at the starting point of planning for the data
capture process for the 1995 Israeli Census of Population and Housing, was the
Windows environment and the improved Optical Character Recognition (OCR)
technology.
The Windows technology enabled simultaneous viewing of data from a number of
sources, external and internal to the system, and of different types, ASCII files
and files of scanned images. Windows technology along with the powerful computers
enabled easier and more convenient accessibility to external files such as the
National Population Register (NPR), data tables such as the various coding
dictionaries and process-information tables.
Improvements in the ability by OCRs to identify free-form handwriting pointed to this
technology as a useful method for the data capture process.

Negotiation between objectives and means, under resource restraints, were
effectively brought together in the optical data capture system described in this
paper.

Current vs. Previous Data
Capture Process


The conventional process of entering census data involves manual preparation of
questionnaires in pre-defined units, correcting values, and coding the texts on the
questionnaire pages and keying them into the computer. These tasks are,
essentially, no different with the optical system, but the technological improvements
enables a number of essential changes in the process:
switching tasks from manual performance to computerized performance;
transferring computerized tasks from the central computer to the data
capture process, in those cases where such switching is functional, both
for the data capture process and for the preparation of the final census
file;
transferring manual tasks from the data capture process to the central
computer, shifting from micro-editing to macro-editing;
changing the sequence of activities so that keying precedes editing and
coding and the paper questionnaires are discarded earlier in the
process;
modification of actions (adaptation to improved technological capability).


Changes in steps essence and sequence affect the quality and type of both census
files (first and final). The first file is the product of the data capture process,
and in the improved technological environment becomes a raw data file per se.
It reflects the respondents' answers even if they are irrational (date of first
marriage preceding date of birth) or not unique (both Romania and Hungary marked
as country of birth).
This raw file aims to maintain the structural units as they were originally defined
(Enumeration Area - EA), as they were received from the field (individual and
household records), as they were originally planned, if they were confirmed in the
field (residential building identified by address), and those defined as physical units
for data capture (separate questionnaire sheets and all pages of a questionnaire).
Having a real raw data file, to which we can return and from which we can
reconstruct editing actions (correction / completion / imputation), enables us to
create a high quality census file (end-file).

Aiming toward a raw data file, that represents the respondents' answers, is
beneficiary to the data capture process in several aspects:
lessening of the subjective component in data editing, and thereby
preventing over-editing by defining guiding principles for comprehensive,
simultaneous treatment of problems to be carried out automatically in the
central computer;
Isolation of the data capture process from the argument regarding the
nature of the file that is given to the users. This is not a value judgment
about whether raw data or only an edited data file should be supplied to
sophisticated users, but rather, this is about creating a file which will
enable, for the first time, selection of one of the options.
increase in the speed at which editing tasks are performed;
decrease in the staff and management personnel resources required to
carry out the process.
facilitating assimilation of control processes:
work control - supervising the quality of the interim products, via control
procedures structured into the data capture process; and
staff control - supervising the quality of the workers, via an automatic
generation of continuous statistical information.


Preventing over-editing means provision of an alternative definition to the term "data
editing" to be carried out throughout the data capture process. There is no
correction of errors which may be trivial; all of the respondents' answers, with all
the logical and factual errors they contain, are captured as they are. Editing tasks
become to mainly verify precise data capture. This alternative notion of editing
within the data capture process, shift traditional editing tasks to the central
computer, to be carried out in macro operations and therefore the ability to
recreate the raw file and the interim files is improved significantly. Sweeping and
uniform actions can be more readily canceled and recreated as compared with micro
editing, because in spite of the general guidelines, there is also individual,
subjective judgment involved in handling each and every record.
The ability to recreating files implies an ability to create several files that have
undergone different editing processes. This attribute enables the evaluation of
editing processes using comparative methods, and designated editing of the raw file,
according to various needs, both internal and external. It contributes not only to the
current census file, but also as an input to the decision-making process regarding
the manner in which large data files should be edited in the future.

Preservation of structural units during the process is expressed through
operations designed to verify:
exhaustiveness - the integrity of each unit (questionnaire, EA) by including
all the relevant components;
exclusiveness - inclusion of components relevant to one unit only, without
any non-relevant components;
uniqueness - verification of non-duplication.

These requirements mean performance, in as automatic a manner as possible, of
editing activities which define the process units and the census analysis units.
Automation of the definition of the structural units is based on prior planning of
all the variables on the questionnaire which will enable automatic definitions, and on
linking records with auxiliary files which serve as an external backup source.
The need for human involvement rises when there are enumerator's errors;
incorrect printing on questionnaires, and when there is an identification failure by
the optical reader. Most problems of identification of structural units are system
related problems, in actions that come to substitute the manual handling of the
questionnaires.

Automation of maximum tasks in the data capture process was also implemented in
work and staff control. Automated work control means verifying that the result of
any action carried out during the data capture process is corroborated by at least
two sources, and that it does not logically contradict other results (variable values
or structural units). Automated staff control means continuous production of
statistical reports which enables both, managers and employees, to see that the
problems they were working on have been solved and that new problems have not
arisen as a result of their handling. These statistical reports are based on
attaching identified work packets, such as enumeration area, to an identified person
who has a specified function in the data capture process.

The Logical Structure


The outcome of the objectives and principles outlined above is an optical data
capture system with a modular structure, but which has no homogeneous stages task
wise. This is a system whose main process includes operations which are similar
in essence, but which are carried out in a different order than before, while the
sub-processes involve an internal and external support system, both within and
between the stages.

There are three process units in this system:
"Enumeration Area" (EA) is the working unit of the production line. All questionnaires
of the EA are scanned and transferred from one stage to the next together. A file
is sent to the central computer and to the archive in EA's units.
"Item" is the working unit of a module that has a human-machine interface. Keying item
is comprised of characters from the whole EA, editing item as well as coding item
are comprised of problems detected in one household.
"Editing problem" is the working unit within an editing item. If a household contains
logically contradicting values in many fields, several editing problems are detected.
However, since all problems are in the same household record, only one editing item
is created.

The sequence of steps in the current data capture process is: scanning
questionnaires and identifying the values written in their field, smart keying from
images, micro-editing and coding and preparing (and sending) files to the central
computer and to an optical archive.
In the conventional path, shredding the paper questionnaires was possible only at the
end of the process, after the final census file was produced. The shift to an
optical reading system and creating a retrievable optical archive enable us to
advance this operation to the beginning of the data capture process, immediately
after scanning. Having questionnaires images instead of paper questionnaires makes
the retrieving environment more user-friendly and the storage space needed for the
questionnaires is shrunk (from a huge warehouse to about 80 CD-ROMs).
Keying turns to be smart keying, selective and from images, rather than full keying
from the paper questionnaires. Keying precedes editing and coding, meaning that we
first make sure that the data capture (optical recognition and keying) is accurate
and only then we start editing.
Many micro-editing tasks are postponed to the macro-editing stage while record
linkage with external files (the NPR in this case), that was done only in the central
computer in the preparation of the final census file, is integrated into the data
capture process.
Different sequence of steps as well as redefinition of their essence brings about a
different type of data capture file.

Sub-processes of internal support are embedded in the process and are
expressed in the mutual dependency of adjacent stages. In every single stage tasks
of other stages are performed. For example:
The editing task of defining an individual record is already included in the
keying stage, while corrections and finalization of the keying stage are
performed during editing.
Regular editing tasks (within a household) are supported by senior editing,
both voluntarily, by actively sending editing items to a senior editor, and
involuntarily, through the automatic creation of senior editing items in cases
of failed handling or when the editor suggests that duplicate or erased
records be voided.
During the coding stage a supporting sub-process is the referral of
unsolved problems to subject matter experts.

The external support system relies on external files (NPR and coding tables), and
administrative forms and files (Enumeration Area leading form and the organizational
file of all Enumeration Areas in the census). The data capture process refers to
this support system from its very beginning and throughout its procession.

Guiding Principles in each Data
Capture Stage


Scanning Module

The main tasks performed during this stage are:
1. completeness and exclusiveness checks, that is
verification of scanning of both sides of each page and scanning of the
expected amount of questionnaires (according to the EA leading form), and
insuring that each area is scanned only once (supervision via the
organizational file).
1. optical mark recognition (OMR) and optical character recognition (OCR)
which assign a value for each field that has been defined as a target for
optical identification. Each value has a status which characterizes the level
of reliability of the optical recognition. These statuses (Super Sure, Sure,
Doubtful and Fail) dictate the treatment of the character or the field
during the follow-up stage of the data capture process.
2. efficient use of computer resources:
saving the image of each form (questionnaire) only once, enables the
subtraction of the fixed layout of the questionnaires as soon as they are
scanned,
compressing the information received in the optical recognition stage, before
inserting it into the database.

Keying Module

The main characteristics of this stage are:
The guiding principle for determining a value of a field as a value which
has been correctly entered, is that two identification sources indicate the
same value.
The sources of identification are internal (optical reader, 1st keying
round and 2nd keying round) and external (National Population Register).
The sequence for entering the identification sources is: OCR, NPR, 1st
keying, NPR, second keying, NPR.
The progress of the data capture process during the keying stage is
expressed through gradual addition of sources of identification and the
search for preliminary equivalence between two out of all the sources
that were involved in the process up to that point.
Since the first two sources of identification are the OCR and the
National Population Register, then in successful cases, where linkage of
census records and the NPR were successful, all the values in the
fields common to both the census and the Register are compared, and all
cases of equivalence cause the termination of treatment of the relevant
fields, regardless of the level of reliability of the optical identification.

The importance of this part of the process is in the tremendous saving
of human involvement in the keying stage. All the linkage and comparison
actions are performed automatically; also, fields which were identified with
low levels of identification but which were supported by the NPR are not
returned for additional handling.
Linkage to the Population Register during the keying stage, in addition to
its being a factor which enables external support for the data capture,
serves to "support" or define the existence of an individual unit; a
record which is linked to the Register is a record for an existing
person, together with all his census characteristics. This is actually an
editing task whose performance has been moved up since it fulfills the
objectives of both stages, keying and editing.
The keying itself is not homogeneous. There are three types (levels) of
keying: verification keying in "carpets", correction keying in strings of
three characters, and full-field keying.
The level of keying is determined by the level of reliability of OCR
recognition, by the value captured (whether it falls within a legitimate
range or not), and by the corroborating auxiliary information.
Since the keying items have no direct connection to any physical unit
smaller than an EA, and since several keying operators are working
simultaneously on the same area, the keying control processes are
structured at the work station of each keying operator, rather than in the
server.

Editing and Coding Module

There are automatic preparatory operations for editing, held before, during and
after the keying stage:
Automatic coding of Country of Birth and Relation to Reference Person;
Detecting written response in open categories ("other");
Automatic definition of structural units;
Verification that values are within range;
Edit checks between field (consistency checks).

A failure in a preparatory task creates an editing problem (the smallest process
unit in the ODE system). An editing item includes all editing problems found in one
household.

The editing and coding stage has a number of unique characteristics:
1. The objective of receiving a structured raw data file dictates four main
editing tasks:
definition of structural units;
modification and completion of the keying stage in those cases where a
final value has not been defined, and in cases where the keying operators
were unable to identify what was written;
confirmation or modification of captured values in fields which were found
to have logical contradictions, while keeping to the values written on the
questionnaire;
coding open categories (where the answer given was "other").
2. Editing tasks are differentially assigned to editors and senior editors:
Editors get editing items that include problems within a household, while
senior editors receive problems between households (involving pages of
more than one questionnaire).
A household which has been refereed to a senior editor for handling will
not receive additional handling from a non-senior editor, even if the editing
tasks required have been defined as those within the responsibility of a
regular editor. Similarly, a household will not be sent to more than one
editor at the same time.
Completeness checks along with range checks are performed in the
editor's working station (PC), while more profound edit checks are
performed in the server. A failure in a batch check in the server may
create a new problem. A failure in an interactive check in the PC brings
back an untreated old problem. Since editing can not be supported by a
following stage (it is the last one that has a human-machine interface)
control checks are activated within the stage. Editor may receive from the
server, the same household for a second round of editing while senior
editor may receive an editing item (with similar or different problems) up to
three times.
3. The editing work environment enables:
simultaneous viewing of the image of the full page of the questionnaire,
images of all pages from the household being handled at that time as well
as images of any household in the enumeration area;
simultaneous viewing of what is written on the questionnaire, as recorded
by the enumerator and the respondents, with the ASCII values of those
same fields as they are found in the database at that time;
modification, confirmation or completion of fields of households which are
being handled;
separating and joining pages and questionnaires;
assigning a "canceled" status to individual records and to household
records.
interactive accessibility to an external file, coding dictionaries and process-
information tables;
accessibility, usually one-way, to virtual boxes, through which problems can
be sent to experts, editing items to a waiting box to be treated later and
entire households can be sent to a transfer box to be later allocated to
their appropriate enumeration area;

Special coding


Special coding is not performed on writing detected in an open category in a closed
question ("other") but rather in subjects asked in open questions and answered in
alphabetic characters. That is geocoding of addresses, economic coding of economic
branch and occupation.

Geocoding
The questionnaire has three address fields: the residential address, which is
received with a numeric code from the field, address 5 years ago, and the address
of place of employment, all of which undergo keying and automatic coding.
A coding failure signifies the inability to find suitable values in the dictionary or
the receiving of only a partial code. A logical failure in coding signifies that the
lowest address unit is not included in the next highest unit, i.e. the code for the
street that was entered does not exist in this specific certain locality, or that the
particular street in the locality has no house number such as the one which was
entered. Each of these failures creates a coding item which is sent to the
geocoder.

Economic Coding
Coding items for the economic branch coder and the occupation coder are created
when the optical character reader identifies written text in the relevant question
fields. The OCR was not developed to recognize Hebrew and Arabic characters,
but the existence of handwritten text is a trigger for creating an economic coding
item.
The coding process at the various coding stations is computer assisted. The coder
sees the image of the questionnaire page on which the text is written, defines a
query to the coding table using text or numeric code, and selects the appropriate
option. The economic sector and occupation fields are not keyed at all, and coding
is performed directly from the questionnaire images.
The economic branch coders (and the geocoder) have accessibility to the employers'
file of Israel, which contained names, codes, and addresses of places of
employment. This file helps to code partial or unclear information given by the
respondents.

The work environment at the coding stations is user friendly and enables
viewing the image of the questionnaire page that has relevant written text;
viewing the image of all pages of the household;
accessibility via queries to the coding tables;
assigning a numeric code.
continuous receiving of statistical data;


Control of economic coding is not included in the optical data capture system, but
does depend on its technology. Coding items are retrieved from the optical archive,
sampled and sent for second coding in stand alone PCs. An expert receive a coding
item in cases where the first code given by the ODE coder does not match the
second code. The expert's code replaces the code generated by the ODE only
when it is different from the first code.

File to the Optical Archive


At the end of the data capture process, a file is prepared to be sent to the optical
archive. This file includes all the data from the questionnaires, administrative data,
all statistics reports produced and used in the process and all audit trails of each
field. This file can be retrieved and already serves us for evaluation purposes.

File to the central computer


The file that is sent to the database in the central computer does not include all the
information accumulated in the ODE. It includes the complete census information and
a "tail" for each field, which enables
identification of the main characteristics of the data capture process, such
as, the status of the linkage to the external file, or a "canceled" status of
the record,
precise and more sensitive macro editing of the data. For example, if the
respondent answered that he is the son in law as well as the brother in law
of the reference person, the first answer is entered to the field and the
second answer is found in the tail.
The file that is sent to the central computer is the one that is edited to become the
final census file.


Concluding Remarks


Continuous interaction between the census planners and the system developers
created a functional system for Israel's 1995 census of population and housing.
However, the optical data capture system is a modular system, which enables use of
some of the modules or use of existing modules in combination with alternative
modules that can be developed in accordance with specific needs. This feature
allowed for differential intervention in the modules during the course of
development, where limited resources and minimum requirements have led to
developmental priorities.

Computerization of tasks that had been performed manually in the past, and the
reduction of micro-editing tasks while taking maximum advantage of the technological
improvements, have contributed to the building of a swift and high-quality census data
capture system.



Copyright © 1997-1999 The State of Israel. All rights reserved.
See "Terms of Use" for the conditions under which this service may be used.