Federal Identifier for the National Library of Canada

Optical Character Recognition (OCR) as a Digitization Technology

by Susan Haigh
Network Notes #37
ISSN 1201-4338
Information Technology Services
National Library of Canada

November 15, 1996

1. Introduction

This Network Notes provides an overview of Optical Character Recognition as it pertains to library digitization activities. The technologies and processes surrounding imaging and OCR are outlined. It is suggested that the decision to run OCR should be based on the projected use of the document and the conduciveness of the original to highly accurate OCR results. Factors that affect accuracy and throughput rates within an OCR operation are provided as a framework for determining the cost-effectiveness of this process compared with its alternatives, which include retaining the document as an image and rekeying the original into electronic form.

2. An Overview of Imaging and OCR

In the library context, digitization usually refers to the process of converting a paper- or film-based document into electronic form. 1 The electronic conversion is accomplished through imaging a process whereby a document is scanned and an electronic representation of the original, in the form of a bitmap image, is produced. Optical Character Recognition is a subsequent, and optional, process that transforms a bitmapped image of printed text into text code, thereby making it machine-readable.

    2.1 The Imaging Process

    The imaging process involves recording changes in light intensity reflected from the document as a matrix of dots (sometimes called "rasters"). The light/colour value(s) of each dot is stored in binary digits. One bit would be required for each dot in a binary (black/white) scan; up to 32 bits could be required per dot for a colour scan.

    The resolution, or number of dots per inch (dpi) produced by the scanner determines how closely the reproduced image corresponds to the original. The higher the resolution, the more information is recorded and, therefore, the greater the file size and its resultant storage and transmission bandwidth requirements. For example, 300 dpi achieves optimal OCR accuracy rates; 600 dpi is considered archival reproduction quality for an image, and could also be required if OCR is undertaken on extremely small-font text.

    The image file format is typically a TIFF (Tagged-Image File Format) file.

    2.2 The OCR Process

    OCR involves five discrete processes subsequent to image capture:

    1. Identification of text and image blocks in the image: Most software uses white space to try to recognize the text in appropriate order. But complex formatting such as cross-column headings or tables must be manually delineated by "zoning" (identifying and numbering text blocks) prior to OCR. Images interspersed throughout the text will usually be ignored by the OCR software; they will be dropped from simple output formats such as ASCII.

    2. Character recognition: The most common method of character recognition, called "feature extraction", identifies a character by analyzing its shape and comparing its features against a set of rules that distinguishes each character/font.

    3. Word identification/recognition: Character strings are then compared with dictionaries appropriate to the language of the original.

    4. Correction: The OCR output is stored in a proprietary file format specific to the OCR software used (and the TIFF image is then usually discarded). The software highlights non-recognized characters or suspicious strings, and an operator inputs corrections.

    5. Formatting output: This post-OCR process converts the file into one or more of the output formats offered by the software, including ASCII, Word, RTF, other file formats, and in the case of Adobe Capture, PDF. These output formats will be discussed and assessed in a forthcoming Network Notes.

    Selection criteria for OCR software include the OCR recognition method, recognition speed, font glossaries, output formats supported, languages supported, sophistication of dictionaries, advanced features (e.g., spell-checkers, WYSIWYG editors), and price. Two commonly used mid-range OCR software products suitable for smaller-scale library digitization and Web publishing activities are OmniPage Pro and WordScan. In recent in-house experimentation, Adobe Capture out-performed both of these for complex-formatted documents, because of the formatting retention features of the Adobe PDF output format. 2 Potential uses of this format will be the subject of a future Network Notes.

3. Weighing the Need for Machine Readability

A document image can be displayed onscreen (with clearly legible text) or printed. To the computer, however, the image is not "readable" -- it cannot be "understood" as anything other than dots. OCR will make the text machine-readable and should therefore be considered if:

  • the text is to be reused, edited or reformatted;

  • the text should be available for full-text information retrieval (e.g., by Internet search engines);

  • the text is to be coded in HTML or SGML;

  • the text should be available to adaptive equipment for the visually impaired;

  • file size is of concern (in terms of storage or bandwidth to transmit);

  • resources are available to perform OCR and correct the output.

In short, the decision to run OCR on a document should be based on the projected use of the document. The following table compares image and OCR characteristics:

Characteristic Bitmapped image OCR text
Size of file Large Fraction (e.g., <5%) of size of bitmapped image
Original formatting Retained Lost if ASCII; retained in part or entirely with some other output formats
Editing and reformatting Not possible Possible
Information retrieval (indexing) Not possible Possible
Reproduction accuracy Provides a facsimile of the original. While it can be manipulated (e.g. sharpened) provides a true representation of the appearance of the original. Not a true representation of original. Processing can affect both text and format accuracy.

4. Weighing OCR Against Rekeying

If uses that require machine-readability have been identified for the document, a second consideration is whether the material in question is conducive to the OCR process. Both OCR accuracy rate and the throughput rate for the entire process should be compared with the alternative, which is regeneration of the document by manually rekeying the text into the computer.

    4.1 Factors Affecting OCR Accuracy

    An accuracy rate exceeding 98% is often cited as necessary for document conversion (OCR) to be more efficient than rekeying. The accuracy rate is determined by the number of edits required (insertions, deletions, substitutions) expressed as a percentage of the number of characters in the image. High accuracy rates have proven perennially difficult to achieve for certain types of library material including catalogue cards, multi-language texts, and historical items with faded type or unusual fonts. Accuracy can be affected by a number of factors:

    • Hardware and software variables such as: scanner quality; recognition method and algorithm; number and sophistication of font and word glossaries.

    • Scan resolution: The number of dots per inch can affect the clarity of the image and accuracy of OCR. Recent tests found that reducing from 300 dpi to 200 dpi increased the OCR error rate for a complex document by 75%; on the other hand, increasing from 300 to 400 dpi had negligible impact on OCR accuracy. 3

    • Generation of original: Second generation scans, such as from photocopies or microforms, will reflect quality factors that affected the first-generation copy as well as the second-generation copy. These factors may include resolution, condition, accuracy, completeness and legibility.

    • Binding of original: Inadequate gutter margins will distort text on a typical flatbed scanner. Book cradle-scanners ensure better image capture while preserving bindings.

    • Paper quality/typeface clarity:

      Broken characters resulting from pale type, and filled or touching characters stemming from excess ink or paper degradation, may not be recognized.

      Stains or marks on the paper will be captured on the bitmapped image, and OCR may try to interpret these as characters. For example, specks may be mistaken for accents.

      Inadequate contrast between text and background, such as with shaded or coloured backgrounds.

    • Typographical and formatting complexities:

      Variations in typeface (e.g., bold, italics) or font size may be lost or introduced, or result in "misunderstood" characters.

      Unusual fonts or characters, such as mathematical symbols and sub- and superscripts, may not be recognized by the software's repertoire of fonts.

      Handwriting is unrecognizable by standard OCR software. 4

      Cross-column headings, tables, indented text, footnotes, headers, text wrapped around images, and margin notes can all present problems to the presentation of the resulting text unless the scanned image has been zoned (i.e., text blocks and order delineated manually) before OCR occurs.

    • Linguistic complexities:

      Misapplication of lexicons or mixing character sets, e.g., when more than one language dictionary is loaded.

      The character sets of certain languages might not be supported.

    4.2 Factors Affecting Workflow Throughput

    During scanning:

    • Bound vs. unbound texts: A bound item must be scanned one or two pages at a time, which translates into roughly 100 pages/hour. This is too labour intensive for anything other than short articles. If the item can be unbound, a document feeder can be used. Alternatively, crisp, clear photocopies of bound originals can be scanned and document feeders can be used. (In some cases the copies could have captured two pages of original text at once).

    • Dimensions of original: Small pages (e.g., paperbacks) and catalogue cards may be too small for standard automatic document feeders. Specialized feeders may be rented or purchased.

    • Type and resolution of scan: The more information being recorded, the longer the scan takes (a high-resolution colour scan of a complex document may take one to three minutes; a binary scan of straight, black text takes seconds).

    • Large amounts of black: As with fax machines, black areas on photocopies, or black or dark pages take slightly more time to scan.

      During OCR processing:

    • Recognition speed: Higher recognition speeds result in higher throughput, but also in potentially higher error rates.

    • Zoning: Images, complex formatting such as tables, inadequate gutter margins, etc. may require text blocks to be delineated and ordered by the operator so that OCR can interpret the arrangement of text properly.

    • Text-intensive pages: The more characters, the longer the OCR takes to run. (A telephone book page, for example, takes a couple of minutes to scan.)

    • During post-OCR processing and output

    • OCR accuracy: A three-step process may be required to identify and correct errors:

      1. The software identifies unrecognized characters or character strings, which the operator corrects manually. It has recently been found that this feature allows an operator to correct 20-45% of errors while examining only 0.5% of the text.

      2. Text is run through a standard spelling checker.

      3. Painstaking visual comparison can be made against the original.

    • Need for additional manipulation: The OCR software may not provide the destination format desired. For example, HTML coding, or parsing for entering into a database may be required.

5. Conclusion

While OCR reliability is continually improving, and prices for all ranges of OCR software are decreasing, it is important to evaluate materials for OCR case by case. Each material type, and each instance of a given material type, has its own set of idiosyncratic problems for OCR. A single factor, such as a tight binding, can potentially eliminate scanning/OCR as a digitization option.

On the other hand, the effect of many of the OCR-hampering factors cited above can be minimized by ensuring appropriate system set-up, such as loading appropriate language dictionaries, adjusting brightness settings, dpi, etc., and by choosing appropriate originals and employing the hardware necessary to ensure optimum throughput (e.g., document feeders). Recent in-house experimentation suggested that OCR can likely be performed cost-effectively on clear English, French or bilingual print documents, including some material presented in columns, employing a variety of standard fonts, or printed on poor-quality paper. Nevertheless, documents that present any one of the factors noted above should be carefully analyzed to ensure that OCR is the appropriate and cost-effective digitization solution.

1 "Digitization" can also refer to the process of transforming analog audio or video recordings into digital recordings.

2 However, a French-language dictionary will not be included in the Adobe Capture until its next release in early 1997.

3 Nartker, Thomas A. Stephen V. Rice, and Frank R. Jenkins, "OCR Accuracy: UNLV's 4th Annual Test", Inform, Vol. 9, no. 7, July 1995, p. 42-45.

4 More recent technology is Intelligent Character Recognition, which employs sophisticated Artificial Intelligence-based recognition algorithms that can "learn" to recognize non-standard fonts and character styles. This technology is achieving increasing success but is focused in handprint recognition applications such as forms digitization. Handwriting recognition remains largely unsuccessful.

Canada Copyright. The National Library of Canada. (Revised: 1997-07-31).