The Text Encoding Initiative (TEI)

by Sheila Comeau, consultant
Network Notes #59
ISSN 1201-4338
Information Technology Services
National Library of Canada

January 6, 1999

1.0 Introduction

In 1987, a group of humanities scholars convened at Vassar College to identify a common, non-proprietary electronic text encoding scheme which could be consistently applied to scholarly texts. This encoding standard would facilitate machine interpretation of humanities source materials such as verse, letters, and dictionaries. The Vassar group identified certain guiding principles for the scheme, namely that it should represent the textual features needed for research, allow for sophisticated and efficient text processing, be clear and easy for researchers to use, and be compliant with existing or emergent standards where appropriate.

In May 1994, the first official set of guidelines which formed the basis of the Text Encoding Initiative (TEI) standard emerged. The TEI Guidelines provided an extensible and standardized framework for the preparation of textual materials in an electronic form that could be interchanged across multiple platforms, applications, and networks. The standard was sponsored by several groups, including the Association for Computers and the Humanities (ACH), the Association for Computational Linguistics (ACL), the Association for Literary and Linguistic Computing (ALLC), and the Social Science and Humanities Research Council of Canada.

2.0 TEI Fundamentals

The rules and recommendations of the TEI encoding scheme are based on SGML (Standard General Markup Language). SGML is often described as the "grammar" imposed on the markup of a common set of documents. Every SGML document follows certain rules under which elements can be tagged in a document, and about the relationship that these elements have to each other. These rules are written using the SGML "grammar", and form what is called an SGML Document Type Definition (DTD).

The DTD serves as a template that defines the content and structure of various elements appearing in a particular set of similarly structured SGML documents. Many DTDs have been written to handle different types of documents (e.g. manuals, memos, minutes, archival records). The TEI Guidelines define the TEI DTD using various sets of elements and their corresponding markup tags, beginning with the core, or common, tagset which can be applied to most documents. The Guidelines also provide separate element sets for different document types (e.g. verse, prose, drama, dictionaries). These are identified in the base tagsets. The TEI DTD is actually made up of the core tagset, an identified base tagset, and can also include additional or auxiliary tagsets.

3.0 The TEI Header

The TEI Header makes up part of the core tagset of the DTD. The Header provides bibliographic data about the TEI document, including the file description (title, statement of responsibility, source), encoding description (approaches used in transcribing or encoding the text, description of the encoding project), text profile (information relating to the document subject matter, classification, and languages contained in the document) and a document revision history.

4.0 TEI vs. TEI Lite vs. Bare Bones TEI

The TEI Guidelines represent 1300 pages of tagging options for a variety of document types. For some, this proves a rather daunting introduction to electronic text encoding. A simplified version of the standard, called TEI Lite, was developed to provide a more manageable subset of the extensive set of SGML elements in the full scheme. TEI Lite includes most of the TEI "core" tag set, handles a wide variety of texts, is usable with a wide range of existing SGML software, and is derived from the full TEI DTD using the extension mechanisms described in the TEI Guidelines. TEI Lite has been adopted by the Oxford Text Archive, the Electronic Text Centers at the University of Virginia and by the University of Michigan.

An even smaller subset of the full scheme, called Bare Bones TEI, was published in August 1994. It includes about the same number of tags as in the original version of HTML, and is considered too limited for any serious encoding efforts. Bare Bones TEI was developed primarily as a learning tool.

5.0 Delivering TEI Documents

To be viewed in their native SGML format, TEI documents must be opened using an SGML browser or viewer. The SGML viewer interprets the markup following the parameters of the TEI DTD and other support files that outline formatting and content rules. A standard Web browser can be configured to launch the SGML viewer when a TEI file is selected on a Web site.

Because many end-users are not equipped with SGML viewer software, several TEI projects provide an HTML version of their SGML TEI documents. Although a significant amount of textual richness and flexibility is lost in the translation from SGML to HTML, many of the "value-added" features of the SGML markup, such as sophisticated indexing, bibliographic control, field-level search and retrieval, and document interchange can be harnessed by backend systems.

6.0 INVESTING IN TEI

Undertaking the development of a TEI collection involves substantial investment. An infrastructure of SGML tools is required to support a TEI project, including utilities for digitization, encoding, parsing, converting, indexing and search and retrieval. Various freeware, shareware, and commercially supported tools are available, and it is likely that a mosaic of programs will be required to prepare and publish the SGML collection. New hardware, such as scanning technologies or dedicated servers, may also be required. Implied in the investment in new tools is an investment in staff training. Workshops and written documentation on text preparation, publication procedures, preservation issues and workflow modifications must be developed and kept up to date.

Dedicated SGML search and retrieval systems may require integration with existing institutional resources, such as Web sites and online catalogues, in order to improve the accessibility and visibility of the collection. Issues relating to indexing and the use of controlled access terms across multiple systems should be examined to streamline search and retrieval operations wherever possible. Automated mechanisms may be developed to generate MARC records from TEI headers for inclusion in the OPAC.

Policies should be established regarding file maintenance (documenting errors, updates, and revisions). New end-user requirements regarding the printing, saving and sending of TEI files may require investigation. A multitude of other local issues may emerge which require capital resources, training, systems development, or workflow and policy modifications.

A TEI project should not be undertaken without a thorough analysis of the target users’ requirements. Research may reveal that a majority of electronic textual research requirements can be met with less sophisticated digitization efforts. The cost involved in developing an SGML infrastructure makes TEI endeavours unsuitable for special one-time funding sources. As with any SGML project, successful TEI encoding initiatives depend on sustained funding arrangements.

7.0 Conclusion

Without the careful resolution of integration and support issues, a TEI investment can easily become an electronic white elephant. The undertaking involves a high level of commitment in a variety of service and resource areas, and there should be a clear understanding of the benefits and costs involved. Ongoing investment is required to maintain, improve and promote the service for library clientele.

8.0 More Information

TEI Homepage
http://www-tei.uic.edu/orgs/tei/

IFLA Metadata page
http://www.ifla.org/II/metadata.htm#tei