Optical Character Recognition (OCR) as a Digitization Technologyby Susan HaighNetwork Notes #37 ISSN 1201-4338 Information Technology Services National Library of Canada November 15, 1996
1. IntroductionThis Network Notes provides an overview of Optical Character Recognition as it pertains to library digitization activities. The technologies and processes surrounding imaging and OCR are outlined. It is suggested that the decision to run OCR should be based on the projected use of the document and the conduciveness of the original to highly accurate OCR results. Factors that affect accuracy and throughput rates within an OCR operation are provided as a framework for determining the cost-effectiveness of this process compared with its alternatives, which include retaining the document as an image and rekeying the original into electronic form.
2. An Overview of Imaging and OCRIn the library context, digitization usually refers to the process of converting a paper- or film-based document into electronic form. 1 The electronic conversion is accomplished through imaging a process whereby a document is scanned and an electronic representation of the original, in the form of a bitmap image, is produced. Optical Character Recognition is a subsequent, and optional, process that transforms a bitmapped image of printed text into text code, thereby making it machine-readable.
2.1 The Imaging ProcessThe imaging process involves recording changes in light intensity reflected from the document as a matrix of dots (sometimes called "rasters"). The light/colour value(s) of each dot is stored in binary digits. One bit would be required for each dot in a binary (black/white) scan; up to 32 bits could be required per dot for a colour scan. The resolution, or number of dots per inch (dpi) produced by the scanner determines how closely the reproduced image corresponds to the original. The higher the resolution, the more information is recorded and, therefore, the greater the file size and its resultant storage and transmission bandwidth requirements. For example, 300 dpi achieves optimal OCR accuracy rates; 600 dpi is considered archival reproduction quality for an image, and could also be required if OCR is undertaken on extremely small-font text. The image file format is typically a TIFF (Tagged-Image File Format) file.
2.2 The OCR ProcessOCR involves five discrete processes subsequent to image capture: 3. Weighing the Need for Machine ReadabilityA document image can be displayed onscreen (with clearly legible text) or printed. To the computer, however, the image is not "readable" -- it cannot be "understood" as anything other than dots. OCR will make the text machine-readable and should therefore be considered if:
4. Weighing OCR Against RekeyingIf uses that require machine-readability have been identified for the document, a second consideration is whether the material in question is conducive to the OCR process. Both OCR accuracy rate and the throughput rate for the entire process should be compared with the alternative, which is regeneration of the document by manually rekeying the text into the computer.
4.1 Factors Affecting OCR AccuracyAn accuracy rate exceeding 98% is often cited as necessary for document conversion (OCR) to be more efficient than rekeying. The accuracy rate is determined by the number of edits required (insertions, deletions, substitutions) expressed as a percentage of the number of characters in the image. High accuracy rates have proven perennially difficult to achieve for certain types of library material including catalogue cards, multi-language texts, and historical items with faded type or unusual fonts. Accuracy can be affected by a number of factors: 4.2 Factors Affecting Workflow ThroughputDuring scanning: 5. ConclusionWhile OCR reliability is continually improving, and prices for all ranges of OCR software are decreasing, it is important to evaluate materials for OCR case by case. Each material type, and each instance of a given material type, has its own set of idiosyncratic problems for OCR. A single factor, such as a tight binding, can potentially eliminate scanning/OCR as a digitization option. On the other hand, the effect of many of the OCR-hampering factors cited above can be minimized by ensuring appropriate system set-up, such as loading appropriate language dictionaries, adjusting brightness settings, dpi, etc., and by choosing appropriate originals and employing the hardware necessary to ensure optimum throughput (e.g., document feeders). Recent in-house experimentation suggested that OCR can likely be performed cost-effectively on clear English, French or bilingual print documents, including some material presented in columns, employing a variety of standard fonts, or printed on poor-quality paper. Nevertheless, documents that present any one of the factors noted above should be carefully analyzed to ensure that OCR is the appropriate and cost-effective digitization solution. 1 "Digitization" can also refer to the process of transforming analog audio or video recordings into digital recordings.
2 However, a French-language dictionary will not be included in the Adobe Capture until its next release in early 1997.
3 Nartker, Thomas A. Stephen V. Rice, and Frank R. Jenkins, "OCR Accuracy: UNLV's 4th Annual Test", Inform, Vol. 9, no. 7,
July 1995, p. 42-45.
4 More recent technology is Intelligent Character Recognition, which employs sophisticated Artificial Intelligence-based
recognition algorithms that can "learn" to recognize non-standard fonts and character styles. This technology is achieving increasing
success but is focused in handprint recognition applications such as forms digitization. Handwriting recognition remains largely
unsuccessful.
|