The World of Scanning






Oren Kagan
Israel IBM Science & Technology

Tel. 972-4-829 6420
FAX 972-2-829 6112
e-mail: okagan@vnet.ibm.com




The "world of scanning" includes all of the procedures and technologies for
transferring information written on paper into a computerized medium, i.e. data
stored in a computer-accessible medium, with minimal loss of information and maximum
utilization of the computer's resources.

Technologies of the "world of scanning" include:
scanning - "photographing" to create an image in the computer's memory.
compressing the data to enable efficient storage in the computer's memory.
identifying the image and decoding the written, textual data (optical character
recognition, or OCR).

In addition, other technologies to be borne in mind are:
data verification technologies - keying in (Smartkey), editing and coding.
communications, retrieval, and data display.
forms logic.
database.

Scanning

Scanning documents is a process of photographing the document so that an
image of it is created in the computer's memory.
The scanning process is a sampling process, in which the document is
sampled at a pre-defined density (resolution) and the dots are scanned in a
definite order.
The scanning order is from left to right, from top to bottom (raster).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
...









The scanning resolution is measured in units of dots per inch (DPI).
Accepted dpi values for scanning documents are usually 200 dpi or 300
dpi. (For the Israeli census, we used 240 dpi.)
A document on A3 paper (11" x 17") has: 11 x 17 x 240^2 = 10,771,200
dots or pixels.
The color of the pixel, or how many bits represent each pixel color, is
expressed as bits per pixel, or BPP.
1 BPP (for a black and white scan, where black = 1 and white = 0), or
according to the levels of the gray scale, up to 8 BPP (256 gray scale
levels).
Therefore, the size of the scanned file we get at 240 dpi, 1 BPP, is
10,771,200/8 = 1.35 Mbytes, and at 8 BPP, we get 10.7 Mbytes.
This is why the data must be compressed.
When we scan 1 BPP (black and white) we need to define a threshold which
will enable the scanner to define which gray scale levels will be scanned as
all white and which gray scale levels will be scanned as all black.
Sometimes the threshold is dynamic and identifies differences relative to the
surrounding environment.

Compression

Most of the prevalent compression methods (MMR, MR, G3, G4, CCITT) are
based on coding the sequence of black and white dots, and sometimes, by
gradually adapting to the systematization within the document.
A totally black image, or a totally white image will require a minimum amount
of storage space. Similarly, an image that is full of dots (one white, one
black...) will also require the minimum if we use adaptive compression.
In contrast, white noise (random) cannot be compressed at all.
If we are talking about real images (rather than synthetic ones), a rule of
thumb to remember is: the less data (black) there is in the image the better
it will compress.
The "form dropout" method uses this principle to obtain maximum compression.

Form Dropout

The compression method using form dropout works well when there is a
known set of scanned forms.
The method works on the following principles:
identifying the type of form (from the pre-defined set), or by identifying
some recognized field stamped on the form,
or by identifying the "form signature."
straightening out the scanned form against a template image, which includes:
moving, straightening and correction of linear (zoom) and non-linear
deviations (creases, mechanical instability of the scanner, etc.)
subtraction of the scanned form from the template image, to receive the
resulting "dropped out" image.
conventional compression of the dropped out form (MMR).
The "form dropout" method improves the compression by a ratio of (x 10)
relative to conventional compression, but naturally, this depends on the type
of form and the average amount of written fill-in relative to the amount of
printed permanent text.
Additional advantages are:
When the form is identified, you can tell what type it is.
When the form is straightened, you can anticipate the precise location of the
identification fields.
If dropout fails, you can tell that, apparently, the scanning was particularly
poor.
Disadvantages:
over-subtraction, dropout of overlapping areas, between printed text and
written fill-in.
sensitivity to noise.
problems when a form is not identified, or if dropout has failed.
Reconstruction is executed by combining the dropped out image of the
written fill-in with the image of the form template.

Scanning difficulties
double feed of forms
completeness of scanning packets
adjustment of the scanner in terms of black level threshold
automatic adjustment of the scanner + feedback check.
data entry of fill-in answers that were written weakly (in pencil, for example)
adjustment, scanning at gray scale levels
physical reinforcement of the written fill-in, by going over the
questionnaires that have weak writing.
mechanical wear and tear, the bulbs get weaker over time, mechanical
problems primarily with the feeder
ongoing maintenance, cleaning and care.
identification failure of a scanned questionnaire
repeat the scanning

Planning the questionnaire
If we could design the ideal questionnaire for the purpose of automatic computerized
processing and scanning, such planning must take the following points into
consideration:
the medium: paper
the size of the paper
the thickness of the paper - to prevent double feed and reflection of text
from one side to the other
texture - how the ink is absorbed, transparency
method of attachment (physical) / folding
the medium: ink
color of the printing (level of black), intensity of the printing
background color
colors in general - contrast - noise
dropout ink method for the scanner
graphics / design
distance between fill-in fields and printed fields - spacing
clear directions for the respondent filling in the form to separate the
letters / without any colored squares that will be seen by the scanner
a form that is simple to identify and to deskew - printed lines and frames
distance of the text from the margins

Deciphering the data (OCR)
Identifying the fill-in on the questionnaire:
pre-printed areas on the questionnaires
bar code labels
numeric fill-in fields (numerals)
X-choice fields
textual fill-in fields (alpha)
erased fields, erased questions, or erased pages
The OCR technology includes:
characters separation
characteristics detection
measuring the "distance" between a scanned character and characters in a
database - mostly by using neural nets.
feedback and finding the global optimum for variety of characters
separation options.
feedback and finding the global optimum using logical constraints.

Neural nets
Advantages:
training of the system according to the real data, adapting the algorithm to
environmental conditions: scanner, resolution, scanning distortion, style of
writing (throughout the country / region).
quick adaptation, quick addition of supplementary nets to special characters,
foreign languages or other formats of writing.
broad theoretical support, a clear and precise model
can learn and adapt themselves during the deciphering process.
Disadvantages:
difficulty following the behavior of the model, identifying bottle-necks.
difficulty adding improvements, beyond training.


Challenges for the near future:
improving the OCR technology
technology based on 8 BPP / Gray scale scanning
compression
form dropout and reconstruction
OCR (for deciphering handwriting)
as above, developing technologies based on color scanning
variable scanning resolution in different areas.
handling general forms, identifying form layout.
handling a permanent set of forms where each form in the set has a large
number of variations (mutants).
integrated hardware for scanning, compression and deciphering.
automatic definition of forms (automatic analysis of the form).
integration for full application, combined within an expert system.



Copyright © 1997-1999 The State of Israel. All rights reserved.
See "Terms of Use" for the conditions under which this service may be used.