Technology and Human
Engineering





Israel givol
Chief Scientist Unit

Tel. 972-2-655 3251
FAX 972-2-655 3606
e-mail: stgivol@netvision.net.il





Introduction


The ODE is a system which was specially built to transfer the data from the
questionnaires of the 1995 Census of Population and Housing in Israel into ASCII
code. Development of the system began in 1990, after the completion of the
operation of an optical system which was designed to replace conventional keying
methods for the census in Switzerland. Previous experience with technological
systems based on turning paper into an image followed by massive processing of the
image was not especially impressive. The Bureau held many discussions to clarify
the possibilities for entering data from the questionnaire, the principle ones being:
the traditional system, Direct Data Entry (DDE) which was well-known and
familiar;
an improved traditional system, where the keying is performed from a
displayed image of the questionnaire following scanning, while the rest of
the process is identical to the conventional procedure;
a system in which the computer can identify the characters instead of a
keying operator by combining Optical Character Recognition (OCR) systems;

and finally, to begin developing a new method called Optical Data Entry (ODE)
.

Development of the method was based on several new technical abilities in the field
of image processing, improved computer capabilities, and the ability of databases to
perform production line management tasks in addition to their traditional role of file
management. To reduce the risks of development, we utilized a development method
known as "evolutionary prototype." The Bureau managed the project while four
implementation centers were operated by companies who could carry out the
assignment according to the Request For Proposal (RFP). The task of development
of applications was given to a single supplier, Israel IBM Science & Technology
(IIS&T), which was able to perform all the programming tasks independently. The
tasks of supplying computer equipment, location of operations and hiring of
personnel were given to companies who were awarded the jobs through a public
tender.

Important milestones of the system:


In 1992, the Bureau conducted an experiment which was partially based on the
system used in Switzerland, to check if it could be used to carry out data capture
from the Israeli census questionnaires. The controlled experiment took place over
a period of one-half year, and the key results were:
1.Special attention must be given to the construction of the questionnaire
and primarily: the actual paper, background colors, the black frames and
the way in which the questionnaire is to be filled out.
2.The scanning was the bottleneck of the system. It determined the speed
of the entire process. Also, massive loss of data could happen at this
step.
3.Worker's activities, performed while sitting in front of nothing but a
computer screen for hours on end, without break, create a hostile
environment and leads to a great many errors entering the system.
4.Opportunities for designing an advanced work environment are possible
and we needed to develop several unique tools:
for keying, we developed three out of six methods that had been tested;
for editing, we developed Windows systems linked to a specific question
on the questionnaire;
for coding, files system for quick queries.

In 1993, we began developing the system of data entry from the Census of
Population and Housing questionnaires, whose principle requirements were:
1.not to perform any manual actions on the questionnaires once they left
the responding household;
2.a scanning system which would not restrict the operator to the specific
page order, or to counting the pages of the questionnaires;
3.The keying would be performed using advanced methods;
4.Editing and coding would be performed on the basis of a (screen) image
consisting of five layers:
filling in,
the questionnaire,
the combination of the two,
the combination of the two together with the ASCII values,
linking windows (secondary lists, dialog boxes);
5.The census equipment would be determined through open tender, and
therefore the codes would be written at an international standard;
6.The cost of the project could not exceed the budgetary framework
earmarked for the previous census (1983 Census).

In 1994, we performed a dress rehearsal with the system, and several issues
required special attention:
1.the performance issue (system production): it became clear that by
building separate modules that were linked together, we did not
successfully reach the required pace demand (we reached about 25% of
what was planned).
2.the issue of accuracy: all points were handled surprisingly well, and
our achievements were impressive.
3.We implemented a drastic change to the system which included, inter alia:
replacing the hardware with computers that were four times more
powerful than planned;
changing the architecture of the system and giving the PCs additional
tasks at the expense of the network server (in the area of logical
checks and preventing editing errors);
creating a line management system which would enable us to perform three
important tasks: performance of tasks divided into day / night, setting
proper priorities for completion of processes, and guarding against
overloading the system to prevent a collapse.
improving the efficiency of the workers vis a vis the computers by
creating simple and convenient operating tools.

In 1995, we handled three large-scale projects:
1.Adjusting the system to the final questionnaire format and switching the
system over to use Compaq and Data General computer hardware;
2.setting up an ideal operations site, at a location that was carefully
selected and where special attention was paid to the operators' working
conditions;
3.recruitment, selection, training and placement of about 150 people for
system assignments:
system managers, editors, key operators, scanning operators,
computer operators (system administrators and system operators),
managers and administrators.


In 1996, while the system was in operation, the main issues are:
1.the speed of the work: it is possible to achieve greater speed than was
planned.
2.accuracy: we achieved better than with any other method, but there are
still technical opportunities for improving accuracy.
3.cost: identical to what was planned (about $1 per respondent). Today it
is clear that the system cost can be reduced even further.

Structure of the System


The ODE system is composed of three technological components developed in
partnership with IIS&T (Israel IBM Science and Technology):
1.an image processing system: Optical Mark Recognition (OMR), Optical
Character Recognition (OCR), Form Drop-Out (FDO), file compression, cut
and paste and smart-key for operator effectiveness.
2.client-server ability and database management.
3.queues management and organization system that is highly capable of
controlling a production line.

The ODE application consists of six sub-systems, each of which stands on its own,
with a dedicated data flow system (including rate, scale and required accuracy of
work). An additional system handles data transfer from one stage to the next. The
six sub-systems are:

1. Scanning sub-system, through which the paper questionnaire turns into an
image. There are several processes in this sub-system:
1.scanning management system;
2.image processing system and identification of the form's unique number;
3.registration system of the questionnaire (24 different pages);
4.questionnaire adjusting and straightening system (stretching and
contracting) in order to implement FDO, OMR and OCR;
5.insert data to the database.

2. Smart-keying sub-system, through which an operator improves the machine's
results:
1.verification of all the characters' information with their OCR
identification status, to keying in smart-key tools;
2.edit checks and record linkage to the National Population Register for
values verification;
3.Comparison of values in order to determine type of further handling
(keying regimes 2 and 3 or referral directly to the editing stage).

3. Editing and coding sub-system, in which the working unit is an item based on:
logical checks at several levels: the field, the question, the individual record, the
household record or the Enumeration Area (EA). The main activity of the editor /
coder concerns correcting data capture problems that remain following the keying
stage, fields that were not coded, linking each record with the National Population
Register, checking the data capture in fields that were not within range or were
contradictory to data in another field. The work method is based on:
1.images of all pages of the questionnaires belonging to the same
household and the ASCII values of their fields;
2.a secondary window in which flipping through the pages of the same
household, or the pages of another household in the EA, is possible;
3.dialog box and list box of the tables existing in the database;
4.tools given to the editors and coders that enable them to get information
on the EA, the household or the questionnaire they are handling. They
can also get various display possibilities (on the screen) of editing and
coding problems. Throughout the editing and coding steps, they could
utilize external auxiliary files.

4. Make ASCII file sub-system to be sent to the main frame (ICBS central
computer):
Three information systems are created for this stage: the scanning images;
statistical information which was created during the process of transferring the
information and data to the central computer, which included, among others:
1.extraction of the values of the fields from the database table, including
"flags" (the status) of each and every field;
2.construction of a hierarchical file at three levels: EA, household,
individual;
3.detailed statistical data for each and every EA, including administrative
data.

5. Archive sub-system, in which all the information at the EA level is saved in
the following formats:
1.WORM: Write One Read Many (for image ASCII);
2.DAT cassette of the file that is sent to the central computer (ASCII
only);
3.CD/R disk (re-writable) containing the information (image ASCII)
arranged in a way that facilitates quick retrieval by pre-determined
keys.

The archive sub-system is built in such a way that it enables quick return to a final
status of certain activity, where three retrieval keys have been defined: according
to the individual identification number, according to EA number and the dwelling
number in the EA, and according to questionnaire number. The existing system
enables transfer of all questionnaires for paper recycling, because all the
information is stored on only a few dozen CDs.

6. Command and line-control sub-system. We can define three control
components which operated automatically in the system:
1.inspections of hardware, basic software and communications software,
testing for the number of files created at the start and at the end of the
process, and examining the computers' work load. The data received
enabled detailed planning of the system's daily and weekly activities.
2.creation of a statistical mechanism of collected data during the
processing, thereby enabling analysis of the massive amount of data
created by the system (numbers of problems, general and relative work
times, automatic vs. manual record linkage, quality of keying and level of
accuracy of the OCR system).
3.The system manager received detailed information on all the activities
within the system, which included: information on the status of the flow
of questionnaires of the EAs handled at that time, status of work at the
PCs, and production information with which an activity is terminated. This
information enabled the manager to identify the timing at which an EA
ended the process and could have been transferred to the Bureau's
central computer.

Lessons and conclusions


The technological lessons that can be drawn from the ODE system are in three
spheres:
1.in the scanning process: improvement in scanning quality and expanding
OCR capability.
2.the PC station: reduction of errors and technical problems and widening
the scale of tasks for implementation at the station.
3.In the sphere of control, integration of all control tasks (staff control,
process control and product control) is called for.

Organization and management lessons:
1.more efficient preparation of work vis א vis the companies and service
suppliers (implementation of trials and preliminary tests).
2.construction of a non-designated system that will permit rapid and cheap
conversion to data capture of other surveys.

The process of transferring information from paper to ASCII code, as was done
with the Israeli census, is the first stage towards development of a system which
includes additional components that should be included:
1.staff control (in addition to what already exists for keying).
2.process control (mainly for editing and coding).
3.completeness of data.
4.including CAPI and CATI at the beginning of the process, in addition to
the paper questionnaires.
5.automatic or semi-automatic coding.
6.improving accuracy.
7.improving speed.

The overall cost of $1 per respondent is a cost which enables us to perform the
appropriate development, high-quality operation, and achieving of three important
results:
1.accurate and reliable information.
2.in-house personnel who are professional and highly motivated to
perform additional tasks.
3.valuable, advanced computer equipment which will improve the general
performance of the CBS.

In conclusion, I propose that a discussion be held on the subject of the information
flow process (scheduler). It is necessary to decide on the desired system
characteristics in light of the following four variables:
1.the level of identification of the OCR system at five levels - 100% with
0% identification errors, 80% with 5% identification errors, 60% with 20%
identification errors, 40% with 50% identification errors, and 0% i.e.,
cannot be deciphered.
2.Identification by the OCR is also based on prior information on the
reasonable response, but the identification level of an entire field can
be checked.
3.Handling of information begins from the isolated position, but 15% of
the information includes erasures and cancellations or just plain
extraneous lines that were added in error. The question then is whether
to handle the information at the household level or the individual level
prior to handling it at the isolated position level.
4.Every field on the questionnaire has different accuracy requirements.
The question is whether to establish one process (for maximum accuracy)
or several processes according to the type of field and the control files
which exist for it.



Copyright © 1997-1999 The State of Israel. All rights reserved.
See "Terms of Use" for the conditions under which this service may be used.