An Introduction to Digitization Technologies and Issues

by Terry Kuny

Network Notes #14
ISSN 1201-4338
Information Technology Services
National Library of Canada

October 1, 1995

Starting Points

Digitization is one of the hot topics in librarianship today. To build a "digital library" requires that the content of a collection be available electronically. The rhetoric of the Information Highway has provided the impetus to convert many existing paper-based (or sound and video) collections into new digital media. The assumption is that digital collections will be more accessible to a broader range of users, presumably through networking technologies, and that there are new efficiencies to be gained in resource-sharing and for preservation.

It should be noted that libraries have already undertaken some of the most significant digitization initiatives in the past 25 years -- most notably, the creation of searchable OPACs. There are also many digitization projects and initiatives which provide digital information in a variety of formats, such as CD-ROM and locally-mounted databases. There are many private digital collections already being maintained by DIALOG, Mead Data, and Micromedia.

This background document examines some of the basic digitization concepts and issues. The focus will be on the digitization of paper-based artifacts and associated data-capture technologies.

Defining Digitization

Digitization refers to the process of translating a piece of information such as a book, sound recording, picture or video, into bits. Bits are the fundamental units of information in a computer system. Turning information into these binary digits is called digitization. This digitization process can be accomplished through a variety of existing technologies.

Technologies

In general, many of the technological challenges to the digitization of paper-based materials have been met. Microform digitization still poses a number of challenges. Digitization technologies for library- based applications, rather than business applications, are still relatively new and much of the hardware and software should be considered as first generation products. [Bro 93] In all cases, the effective retrieval of digitized data remains a significant challenge.

The Digitization Process

The concepts and technologies connected with digitization are complex. There is a basic process which involves different sets of hardware and software technologies at each step. Determining the appropriate technology is directly linked to the anticipated use and purpose of the material being digitized.

Document -> Data Capture -> Data Processing -> Storage -> Retrieval and Display

Documents: examples include text, line art, photographs, colour images
Data Capture: manual data entry (word processing), optical character recognition (OCR) or imaging
Data Processing: text may require conversion of diacritics or special characters; images may need enhancement, amplification or compression
Storage: examples include hard disk, magnetic tape or optical CD-ROM
Retrieval/Display: a myriad of technologies for viewing and displaying, including concerns about network deliverability

The widespread use of word processors and desktop publishing software means that virtually all text creation is now digital. The challenges arise if the text is in a non-digital medium such as paper or microform. This paper deals primarily with the technologies of converting text or images into a digital format. The technologies and issues pertaining to the digitization of media such as sound or video are similar to those for the digitization of text or images.

Data Capture -- Manual Data Entry

The simplest method of converting an image of a page (or the real page of text) into digital text is to enter it manually. This is usually not a problem if the original electronic text is available. However, in most digitization exercises, the original document is not in computer-accessible format, i.e., paper- or image-based.

Manually entered digital text has the advantage of greater accuracy than some types of data (directories, numerical datasets) not amenable to automated means of digitization. However, manual data entry is time-consuming and labour-intensive -- and very expensive.

Data Capture -- The Scanning Process

The scanning process uses hardware similar to photocopiers (scanners) to take digital pictures of objects. Scanners can be simple desktop machines or very large and complex systems that process thousands of documents. The physical form of the object can have a great impact on the type of scanning equipment that can be used. Many of the current scanning systems have been designed for business applications where documents are often single sheets or within a small range of sizes which makes them amenable for automatic scanning. The fragility, odd sizes, and bound volumes of some library materials pose greater difficulties for scanning.

Optical Character Recognition (OCR)

Another simple digitization process is that of scanning printed pages to build a digital database of text. This process uses OCR (Optical Character Recognition) software which takes a picture of the page and then turns it into digital text, which can be edited or fully indexed.

OCR software must distinguish between black and white areas of text. A problem that frequently arises is with texts that have variable or little contrast. Similarly, OCR software can often have difficulty accurately digitizing text with typographic variation, unusual or foreign characters, or antiquated forms of lettering. Historic documents, newspapers and manuscripts have frequently proven impervious to effective OCR scanning.

As a result, OCR accuracy is only from 95 to 98%. This means that between two and 5% of the conversion of pictures of words into text will be inaccurate. If textual accuracy is required, OCR processed texts must be manually and closely edited, increasing the expense of the digitization process considerably. If the text is inaccurate, then any indexes that are built using the text will also be inaccurate.

Excalibur Technologies and Pattern Recognition Technologies

The next generation of OCR is represented by PixTex, a product being developed by Excalibur Technologies. This software uses a technology called Adaptive Pattern Recognition, which attempts to mimic aspects of the neural patterns of the brain. The software can be taught to recognize variations and relationships in pattern, such as patterns of text rather than readable text. The retrieval of search terms uses what Excalibur calls "fuzzy matching".

Some capabilities of this technology are being tested at the British Library, but preliminary results seem to indicate that the technology met its Waterloo when trying to deal with variations in typefaces, poorly printed documents, or documents where the contrast between text and paper is diminished due to age -- the same problems faced by current OCR software. [BL 94]

Document Imaging

A simple method of capturing text is to take an electronic picture of each page of text with the same type of scanner as one would use for OCR. However, the difference is that the images are stored as graphics files rather than text files. A similar technology is used for fax transmission. Each page is stored as one picture. The text on the page cannot be edited or indexed.

Many digitization projects are of this nature and there are some advantages to this approach. It is very good for such image-based collections as photographic collections or art gallery catalogues. A frequently-used example of this type of digitization project is the Library of Congress American Memory Project. There are, unfortunately, some significant drawbacks to storing large texts as images. There is a page problem because each image must be loaded and viewed separately. Only the bibliographic record of the document (not the actual text) is searchable. The text cannot be edited. The images can be large, there may be many images to load, and they may be difficult to access even on fast networks. The display quality, whether on-screen or in print versions, can often be poor. The page problem and display issue are well-illustrated in some of the following digitization projects:

Project Open Book

Sample Dissertations (Cornell University)
http://oitnext.cit.cornell.edu:80/

Tobacco Control Archives
http://galen.library.ucsf.edu:80/tobacco/

Some of the important technological decisions to be made before building document image archives include:

Display Technology: Can refer to either print or screen. What is appropriate for display on a computer screen is not always the best quality for printing. If network access is a concern, determining which is likely to be the preferred viewing format may be important.
Dynamic Range: If colour or contrast is important, the number of colours or shades of gray to be used must be determined. Larger numbers of colours or grays increase the size of the image. Dynamic range is usually indicated in bits, such as 8-bit (256 colours) or 24-bit (16 million colours).
Resolution of Image: Refers to the clarity or amount of detail available in an image. Resolution is usually measured in a unit of picture elements called a pixel.
Compression: The size of image files can be reduced by a variety of methods. Common compression methods include CCITT Group III or Group IV (used by most fax machines), JPEG, JBIG, and LZW. Compression is important in managing the size of image collections. Methods which remove information from an image are called lossy compressions. An example of this is type of compression is JPEG. Lossless compression of an image will provide an identical image when decompressed as existed before compression.

Access Concerns

Despite the current interest in "being digital", there is still considerable complexity involved with providing access to digital data. Being digital does not mean being accessible.

Standards

There are national and international standards for image-file formats and compression methods. These exist to ensure that data will be interchangeable among systems. However, there are also many proprietary methods of capturing data. There are few widely-accepted standards governing how data should be stored or accessed.

There are many different standards for capturing, retrieving and displaying digital text and electronic documents. Documents captured and stored in one format may not be available in the future. Translation software that converts data from one format into another may be available. However, this solution is, at best, a makeshift one as the software may not keep up with changes in formats, might have difficulties interacting with proprietary formats, may provide inadequate display results, and will involve substantive conversion costs in terms of software, hardware and labour. These are particular concerns when considering the digitization of large document collections.

Digital information often requires a confluence of hardware and software technologies which are essential to the storage, retrieval and display of the data. These technologies are changing quite rapidly. For example, digital information stored on obsolete media, such as 8-track tape or punch cards, is for all intents and purposes lost, because it depends on hardware platforms which are no longer supported, or requires retrieval software that is outdated.

Information Retrieval

Effective information retrieval depends upon being able to access an accurate, indexed database. Some of the challenges include:

inadequate indexing terms
inappropriate or inaccurate indexing
difficulties in maintaining timeliness of indexing
challenges of reindexing large document collections that may change over time.

The indexes used to access the document collection can also impose significant system costs. Depending on the type of indexing, the index database can be anywhere from 50% to 200% larger than the original data collection.

Useability

Recent projects illustrate the diversity of content that can be digitized. However, there is some reason to be reluctant about claims about the superiority or usefulness of digital collections as little comparative research about the use of different media has been undertaken. Defining the intended users and uses of a digital collection remains an important task for planners of digitization initiatives.

A case in point is the well-known American Memory Project, which has been widely hailed as a model by all those involved with its development. However, user studies on use of the actual collection have been less enthusiastic and a close examination of their reasons is important before similar mistakes are made. All the signals for public use and expectations of digitized materials indicate that a clear requirements analysis should be undertaken.

When pages of text are stored as images, the content is meaningful to people but not to computers. The image of a page cannot be indexed unless it is turned back into digital text. This is an OCR-based or manual data-entry process which is very expensive. Consequently, most digital collections are pictures of pages -- fundamentally no different than microfilm or microfiche. These image collections must use indexes, often the same as those for microform collections, to provide access to the electronic collection.

In addition, documents which are imaged may have great variation in display and print resolution. High-quality images incur high costs in terms of storage. In many cases, the document images are unrealistically large for practical research across a network.

Complex Technologies, Complex Uses

The choices of digitization technologies and data formats depend upon the intended audience and how the digitized information is to used. For example, intended use will help determine the dynamic range and resolution of the digitization process. If it is important that the text be effectively accessible as searchable archive or database, this will have an impact on which combination of technologies will be appropriate. An example of this type of information is a telephone directory. Telephone directories are generally not suitable for image-based archives since the entries on each page cannot be indexed. Consequently, either manual data entry or edited OCR is necessary.

Where users are located may determine the types of technologies to be used. Internet users will not tolerate large image collections but may use text available as ASCII or hypertext markup (HTML). At this point, the Internet does not adequately support the requirements raised by large image archives.

Conclusion

This backgrounder only touches on some of the many technologies and issues involved in the digitization process. Many of the technologies are new. Vendors and products are changing rapidly. Material that has been digitized in the recent past is no longer available due to changing formats, obsolete hardware and software, and the expense of data conversion. Many materials that are being digitized in libraries and organizations around the world have very specific audiences and locations.

At this time, there are no standards, "best" practices, or exemplars that provide guidance for the belief that digitization provides assurances as a preservation medium.

"Notwithstanding our enthusiasm for what digital library services promise, we feel that glib calls to replace conventional publication entirely must be regarded skeptically. Preserving the cultural heritage ... has been better served by paper than digital means currently promise, and there is little funded work towards remedying this."
-- Gladney, Henry M. et. al. Digital Library: Gross Structure and Requirements (Report from a Workshop). IBM Research Report RJ 9840, May 1994.

There exists the real possibility that, in the rush to digitize collections, access to the digital information may be more restrictive than its analog, paper- or microform-based equivalents. An example would be a digital collection accessible only through specialized workstations using proprietary software. A collection of digitized microforms that did not substantially improve the indexing of the contents would be functionally equivalent to the microfilm, less transportable and less useful as a preservation medium.

Although digitization is an important first step in making materials available, it should be ascertained that the need for digitization exists within a user community and that the digitization efforts will actually be able to serve that community. Surveying existing technologies and practices of digitization can only lead to the conclusion that prudence and a certain amount of conservatism in choosing projects and technologies should be encouraged.

Notes

[Bes 95]. Howard Besser and Jennifer Trant, Introduction to Imaging: Issues in Constructing an Image Database. The Getty Art History Information Program Imaging Initiative. http://www.ahip.getty.edu/intro_imaging/

[BL 94]. British Library. Initiatives for Access News. December 1994, Issue 2.

[Bro 93]. Roger Broadhurst. "The digitisation of library material." Information Management and Technology. May 1993, 29(3): 128-132.

[Cul 92]. J. Culshaw, "American Memory: Taking the Library of Congress to the Masses." CD- ROM Librarian. October 1992, 7(9): 14-21.

[Gla 94]. Henry M. Gladney, et. al. "Digital Library: Gross Structure and Requirements (Report from a Workshop)." IBM Research Report. RJ 9840, May 1994.

[Pol 92]. J. A. Polly and E. Lyon, "Out of the Archives and into the Streets: American Memory in American Libraries." Online. September 1992, 16(5): 51-57.

[Wat 92]. Donald Waters. Electronic Technologies and Preservation. Commission on Preservation and Access, June 1992.

Internet Sources of Information

Text Archives

A number of text archives and digital text projects are being made available through the Internet. It is illustrative to look at the variation in how they are being developed and at the technologies being used. Examples of some projects have already been given. Other interesting large-scale text digitization initiatives include:

Project Gutenberg
URL: http://jg.cso.uiuc.edu/PG/welcome.html

Project Bartleby: The Public Library of the Internet
URL: http://www.cc.columbia.edu/acis/bartleby/

Image Collections

Clearinghouse of Image Databases
URL: http://dizzy.library.arizona.edu/images/clearinghouse.html

Mailing list

IMAGELIB mailing list
imagelib@arizvm1.ccit.arizona.edu

Organizations

Commission on Preservation and Access
URL: http://www.cpa.org

Task Force on Digital Archiving
URL: http://www.oclc.org:5046/~weibel/archtf.html

General Resources

IFLANET Digital Libraries Research and Projects
URL: http://www.nlc-bnc.ca/ifla/services/diglib.htm

Introduction to Imaging: Issues in Constructing an Image Database
URL: http://www.ahip.getty.edu/intro_imaging/

Digitization - the issues, projects and technology: a selective bibliography
URL: http://www.nlc-bnc.ca/services/dig-bib.htm