A Glossary of Digital Library Standards, Protocols and Formatsby Susan HaighNetwork Notes #54 ISSN 1201-4338 Information Technology Services National Library of Canada May 6, 1998
Introduction This document provides a structured overview of over 90 selected technology-based standards, protocols, and document formats that are pertinent to digital library activities. Each entry provides the acronym, full name and a succinct description/commentary as appropriate. While most library staff, even those directly involved in digital library initiatives, do not need to know the details about all these standards, one hears or reads about them frequently. This primer is intended as a quick reference tool for library staff who have little time—and perhaps even less interest—to learn the technical details of technology-based standards and protocols pertinent to the library world. Some Initial Definitions Standards are a set of rules or specifications for the design or operation of a computing device. There are proprietary standards, which are those developed and promulgated by companies in the hope of assuring or increasing their market share, and open standards, which are published and available for use by anyone. Either type may become a de facto standard, a set of rules or specifications that comes into such widespread use in the marketplace that it becomes normative; or a de jure standard, a standard given the endorsement of an official standards body such as the International Organization for Standardization (ISO). Protocols are standard sets of rules that govern network communications functions by describing both the format that a message must take and the way in which messages are exchanged between computers. Formats are the various ways in which information is stored. A file format is a software algorithm for encoding the data, as well as any information about the data (e.g. structure, layout, compression algorithm). Hundreds of different file formats exist, but only a few are essential to digital library activities. Some Caveats
ENABLING INTERNET & COMMUNICATION STANDARDS OSI: Open Systems Interconnection Has been gradually eclipsed by TCP/IP, a comparable suite of protocols. OSI-based standards (e.g. X.400 mail, X.500 directories, Z39.50, ILL protocol) now run over TCP/IP networks. The OSI architecture is split into seven layers, each of which uses the layer immediately below it and provides a service to the layer above. Outgoing data goes down the stack, across the network, then up the stack at its destination. The layers are, from lowest to highest:
TCP/IP: Transmission Control Protocol over Internet Protocol Specifically denotes network and transport layer protocols within the Internet 5-layer protocol stack, although is often used to refer to the entire stack. The Internet protocol suite generally corresponds to the datalink, network, transport, presentation and application layers as described above. The Internet non-proprietary suite of protocols allows communications among more hosts and more types of data formats than any other protocol suite. IPv6: Internet Protocol, version 6 (also IPng, "I-Ping", for Internet Protocol, next generation) IPv6 is a new system, currently under development, to be used for assigning Internet Protocol addresses in the future. A consensus of the Internet Engineering Task Force (IETF) has determined that IPv6 will be the next generation system for IP Addressing. IPv6, which is eight sections of 32-bit numbers, will eventually replace the current Internet Protocol addressing scheme, known as IPv4, which is four sections of 8-bit numbers. SMTP: Simple Mail Transfer Protocol A TCP/IP application protocol that is the most widely employed e-mail standard. Does not support transfer of non-text messages or message parts such as images, audio or video, nor word processing or spreadsheet files because their proprietary format coding is non-text. Supports only standard ASCII character set (i.e. no diacritics). MIME: Multi-Purpose Internet Mail Extensions Adds multi-part (e.g. WWW hypertext documents, word processing file attachments, etc.), multi-media (non-text such as audio, graphics) messaging support to SMTP. Also supports message encryption. X.400: Message Oriented Text Interchange System (ISO 10021/CCITT X.400 1988) Supports encryption, so more secure than SMTP without MIME, and supports non-text message parts. The only protocol from the OSI suite to be currently in common use, mainly in governments. NNTP: Network News Transfer Protocol A TCP/IP protocol that enables newsgroup articles to move smoothly through the Internet. Most newsgroups exist on the USENET network. FTP: File Transfer Protocol A TCP/IP application protocol used as a means of on-demand transfer of text, non-text (e.g. audio) and software files from one system to another. Anonymous FTP allows user to retrieve selected files from a remote system without authorization. FTAM: File Transfer Access Method OSI-based file transfer protocol, which has been largely eclipsed by FTP. HTTP: Hypertext Transfer Protocol (Version 1.1) Supports the hypertext linkages among multimedia documents that characterize the World Wide Web, a collection of HTTP servers. HTTP 1.1 is twice as fast as the initial version, 1.0. Gopher Uses hierarchical menu structures (no hypertext linkages) and a character-based (non-graphical) interface to access multimedia documents. Has been rapidly eclipsed by HTTP/World Wide Web. Telnet A TCP/IP application that allows a remote user to login to applications such as library catalogues. Becoming superseded by Web-based search interfaces that employ a CGI (Common Gateway Interface) program to query a database and return results. DATA INTERCHANGE STANDARDS AND FORMATS ASCII (American Standard Code for Information Interchange) (ISO 641) The most widely used character set encoding. 7 bits per character; limited to 128 mainly English-language characters. ISO Latin 1 (ISO 8859-1:1987 part 1) A common extension of and replacement for ASCII, "Latin 1" is the most commonly used character set of ISO 8859-n (n=1-9), a series of nine 8-bit, 256-character alphabets, for mainly European languages. Used by Windows. UNICODE (ISO 10646-1 Universal Character Set) 16-bit character code set intended to cover all the world’s writing systems. Not widely implemented at present, although XML (see below) will support it. ASCII (ISO 641) Lowest common denominator: unstructured ASCII text code (see above). Fast becoming obsolete as some of the more sophisticated document formats get more widely adopted. MSWord, WordPerfect, etc. (word processing applications) WYSIWYG approach to format coding (coding is hidden). Format coding is proprietary, so can be problems of portability among packages or even between different versions of same package. Commonly used for revisable texts. Can incorporate multimedia file parts. PageMaker, Ventura, etc. (desktop publishing applications) Proprietary format coding is input as specific commands, and effected only upon output. RTF: Rich text format A portable output/input format for many word processing packages developed by Microsoft. These describe shapes on the page, i.e. the layout of a document but not the content/structure. Used for read-only presentation of formatted, final page images for output on any printer or other output device. Increasing use for electronic publications. PostScript (".ps") Adobe-developed programming language of 420 format command operators (Level-2) which control printing (but not screen display). Allows formatted printing on any printer from any platform (Windows, UNIX, etc.). Encapsulated PostScript (".eps") are subroutines included in PostScript files, usually used for images produced with a non-PostScript package. PDF: Portable Document Format (".pdf") A further development is Adobe’s PDF proprietary format, which employs the Acrobat suite of software products to be created, edited, viewed, etc. Is printing device-independent, and supports e-publishing using sophisticated formatting and graphics including embedded links, annotations, thumbnails of pages, and chapter outlines for direct access. Adobe has indicated the intent to incorporate structure as well as layout into PDF by extending it to encompass SGML. Compared to page description languages, structured information mark-up languages describe the information (content) and structure, not the layout. They are device and processing platform independent and facilitate automatic indexing by describing headings, chapters, paragraphs, footnotes, etc. SGML: Standard Generalized Mark-up Language (ISO Standard 8879-1986) A standard meta-language, or syntax, for the specification of an unlimited number of mark-up languages. An SGML document has three elements: the Declaration (describes processing environment needed); the Document Type Definition (DTD) (a defined tag set that forms a template for describing the structure and content of a specific type of document); and the Document stream itself. SGML is independent of any system, device, language or application, and, because it separates document content definition from presentation, it allows information to be accessed or presented in ways not predicted at the time of mark-up. SGML viewing software (e.g. Panorama) parses/interprets the SGML document content according to its DTD instructions. SGML is anticipated to be a key standard in digital library development. Some relevant DTDs are : EAD (Encoded Archival Description) - a DTD for archival material. US MARC DTD (see under Metadata-Description) TEI (Text Encoding Initiative) - a DTD for a wide range of scholarly resources, initially developed for the humanities. XML (see below) HTML is the most common DTD (see below). DSSSL: Document Style, Semantic and Specification Language (ISO 10179) A standard associated with SGML that specifies the rules for a non-proprietary language to govern the appearance and style for the logical components (e.g. chapter headings) defined by SGML. XML: Extensible Mark-up Language (".xml") A simple, reduced subset of SGML designed (in 1996) for ease of implementation and interoperability with both full SGML and HTML. Currently a draft meta-language application profile, it is simpler than SGML (reducing a 500-page reference to 26 pages). Unlike HTML, XML supports (optionally) user-defined tags and attributes, allows nesting within documents to any degree of complexity, and can contain an optional description of its grammar for use by applications that need to perform structural validation. Every valid XML document will be a conformant SGML document. Not backward compatible with HTML documents, although those conforming to HTML 3.2 can easily be converted. Not intended to supplant HTML but to complement it. The XML character set is Unicode. XML is being widely discussed currently and future releases of MS Internet Explorer and Netscape browsers may be XML-enabled. HTML: Hypertext Mark-up Language (".htm" ".html") A reduced tag set version of an SGML DTD that provides a set of platform-independent styles (defined by tags) used to define the components of a Web document. HTML 2.0 is an IETF standard; 3.0 was an IETF draft (which have 6 month lifespans); HTML 3.2 was announced May 1996 to supplant 2.0 as lowest common denominator. Version 3.2 incorporates all of 2.0 and popular features of 3.0 such as tables but not frames. Version 4.0 was released as a draft in July 1997. While HTML tags are primarily structure-related, there are increasingly accepted tags for specifying presentation and layout. DHTML: Dynamic HTML Denotes recent developments by both Netscape and Microsoft that use a combination of Cascading Style Sheets (see below) and a scripting language such as Visual Basic script or Javascript to merge the HTML document with the style sheet. Supports greater creative control over the visual presentation of an HTML page and allows the page to respond dynamically, without a call to the server, to user-generated events. Cascading Style Sheets (".css") A new approach to increasing control over the visual formatting of HTML documents (e.g. spacing, colours, backgrounds, choice of fonts, drop shadows, layering, relative and absolute positioning, on/off visibility of options, choice of media such as print, display, braille, aural). CSS tags are in a separate document or part of the document (rather than being embedded in the text as with traditional HTML), so can be changed, updated across multiple documents quickly. Cascading style sheets can be cached locally and reused, so their deployment can result in bandwidth/response time gains. OpenDoc Aims to enable embedding of features from different application programs into a single working document. OLE: Object Linking and Embedding Microsoft’s proprietary distributed object system that allows an application to manage part of its contents in another application. For example, an Excel spreadsheet of changing data could be invoked in its up-to-date version from a word processing document. These store information about individual pixels or dots. Generally storage-intensive, so tend to be used for single images. GIF: Graphics Interchange Format (".gif") Widely used image format that displays well on most computer systems, but is limited to 256 colours. Uses a lossless compression technique. Results in relatively small files available for immediate display alongside text in Web documents, so commonly used for toolbars, icons and inline images. Can be "interlaced" (whole image displays with sharpening clarity rather than sequential line-by-line clear display). One colour can be transparent (good for floating images/icons on backgrounds). Better than JPEG for sharp line, black-and-white, and gray-scale images. JPEG: Joint Photographics Expert Group (".jpg") A lossy compression format. Over 16 million colour hues available. Better than GIF for real-world images such as colour photographs. TIFF: Tagged-Image File Format (".tif" or "tiff") Stores very large amount of information about an image. Supports different types of compression (lossy and lossless). Widely used, but mostly as an intermediary format between scanners and desk-top publishing programs. PNG: Portable Network Graphics (".png"; pronounced "ping") Intended to replace GIF, with improvements in error detection and interlacing speed and greater compression rates. An emerging format, but is not yet widely used. Photo CD PhotoCD is Kodak’s proprietary format for the digital storage of high resolution images on CD. The images can be viewed at a range of resolutions and manipulated using image processing software. CCITT Group 4 Fax CCITT have developed a series of compression mechanisms for transmitting black and white images across telephone lines by fax machines. The standards are officially known as CCITT Recommendations T.4 and T.6 but are more commonly known as Group 3 and Group 4 compression respectively. Group 4 Fax is in common use. These store information (mathematical algorithms) about lines and curves making up the image. Used for compound documents (e.g. docs. combining complex formatted text, images, etc.). Scale well to display at various degrees of magnification. PostScript and PDF use vector imaging (see Page Description Formats above). CGM: Computer Graphics Metafile Standard for the storage and exchange of 2D graphical data. Initially was a vector format, but has recently been extended to include Raster storage capabilities. Four international standardized profiles have been developed which specify how CGM will be used in within MIME-compliant e-mail and on the Web. There is a recent proliferation of proprietary Internet audio products/formats. These are the most widely employed. AIFF: Audio Interchange File Format (".aif" or "aiff") Macintosh audio file format. RIFF WAVE (".wav") Originally Microsoft Windows’ audio file format, now extended to other platforms. Stereophonic. mLaw (".au") Another common Internet audio file format, from Sun Microsystems. Works on all platforms, but of lower quality. Stereophonic. RealAudio (".ra" or ".raf") Progressive Networks’ very popular proprietary audio product. Uses "stream" delivery, which means the audio starts to play as soon as first bits are received by user’s computer. Sound document is not saved on the client. Stereophonic with version 3.0 (previous versions were monophonic only). There is likewise a recent proliferation of proprietary Internet video products/formats. The following are the most widely employed.
QuickTime Movies (".mov")
AVI: Audio-Video Interleaved (".avi")
MPEG: Moving Picture Expert Group (".mpg")
RealVideo
GIF 89a: Graphic Interchange Format 89a ("animated GIF")
ShockWave
Java, Active X
QTVR: QuickTime Virtual Reality
VRML: Virtual Reality Modeling Language (".wrl")
RTSP: Real Time Streaming Protocol
NetShow Standard Metadata
RDF: Resource Description Framework
URI: Uniform Resource Identifier
URN: Uniform Resource Name
ISBN, ISSN, ISMN
SICI: Serial Item and Contribution Identifier (ANSI/NISO Z39.56-1996 Vers. 2)
DOI: Digital Object Identifier
URL: Uniform Resource Locator
PURL: Persistent Uniform Resource Locator
URC: Uniform Resource Citation, or Uniform Resource Characteristics
Dublin Core
GILS: Government Information Locator Service
TEI Headers: Text Encoding Initiative headers
EAD: Encoded Archival Description
ISBD, AACR2, LC, DDC, LCSH, MARC, etc.
MCF: Meta Content Framework
SOIF: Summary Object Interchange Format
PICS: Platform for Internet Content Selection INFORMATION SEARCH AND RETRIEVAL
Web browsers
Web search engines Z39.50: ANSI/NISO Information Retrieval Standard Z39.50-1995 As of late 1996, also adopted as: Specifies the rules and procedures of two systems communicating for the purposes of database searching and information retrieval. There are two parts to the standard: the "origin" portion supports the querying of remote systems; the "target" portion translates queries to the logic of the target database system and returns records or results sets. From a searcher’s perspective, the standard enables the searching of different systems through use of one familiar user interface. SQL: ISO/IEC 9075:1992 Information Technology --- Database Languages --- SQL, also ANSI X3.135-1992 Database Language SQL (Structured Query Language) SQL is a popular standard interactive and programming language for getting information from and updating a relational database. It allows DBMS products from different vendors to interoperate. SQL defines common data structures (tables, columns, views) and provides a data manipulation language to update and query those structures. Z39.59: ANSI/NISO Common Command Language Standard
ILL Protocol: ISO 10160 and 10161 Directories manage distributed collections of information about people or resources. A directory is typically used to hold addressing information, but it can also be used to hold information on capabilities, accounting, or other attributes of the object being described.
X.500: CCITT X.500/ISO 9594 Directory Standard
LDAPv3: Lightweight Directory Access Protocol, Version 3
WHOIS++ INFORMATION STORAGE
CD-DA: Compact Disc-Digital Audio, or CD-Audio
CD-ROM: Compact Disc-Read Only Memory
CD-i: Compact Disc Interactive
CD-R Compact Disc-Recordable
CD-RW Compact Disc-Rewritable
Photo CD, Video CD
DVD Audio, DVD-ROM, DVD-R, DVD-RAM, DVD Video Magnetic storage media such as magnetic tape, diskettes, and cartridges are prolific and largely proprietary, and thus were excluded from this paper. Selected Sources Alschuler, Liora. ABCD...SGML: A user’s guide to structured information. International Thomson Computer Press, 1995. Cleveland, Gary. Electronic Document Delivery: Converging standards and technologies. IFLA UDT series on data communication technologies and standards for libraries, 1991. Dempsey, Lorcan, et.al. eLib Standards Guidelines. Version 1.0, February 26, 1996. http://ukoln.bath.ac.uk/elib/wk_papers/stand2.html Dictionary of PC Hardware and Data Communications Terms. http://www.ora.com/reference/dictionary/ EWOS Guide to Open Systems Specifications (GOSS). http://www.ewos.be/dir/gtop.htm Free On-line Dictionary of Computing. http://wombat.doc.ic.ac.uk/foldoc/index.html Guenette, David R. and Dana J. Parker. "CD, CD-ROM, CD-R, CD-RW, DVD, DVD-R, DVD-RAM: The Family Album." E-media Professional. Vol. 10, no. 4, April 1997, pp. 31-52. Hodges, Jeff. et.al. An LDAP Roadmap & FAQ: http://www.kingsmountain.com/ldapRoadmap.shtml Info2000 Directory Services. http://www2.echo.lu/oii/en/directory.html Internet Users' Glossary. http://ds.internic.net/rfc/rfc1983.txt InterNIC. 15-minute series. http://rs.internic.net/nic-support/15min/ National Institute of Standards and Technology. http://www.nist.gov/ NetLingo: A dictionary of the Internet Language. http://www.netlingo.com/ Network Notes. National Library of Canada. 1995- . http://www.nlc-bnc.ca/pubs/netnotes/netnotes.htm The Open Information Interchange Initiative. http://www2.echo.lu/oii/en/oiistand.html Pfaffenberger, Bryan. Internet in Plain English. MIS Press, 1994. TechWeb Tech Encyclopedia. http://www.techweb.com/encyclopedia/defineterm.cgi U-Geek Glossary. http://www.ugeek.com/glossary/glossary_search.htm UKOLN Directory Services: http://www.bath.ac.uk/~ccsap/Directory/ Weibel, Stuart and Juha Hakala. "DC-5: The Helsinki Metadata Workshop". D-Lib Magazine, Feb. 1998. http://www.dlib.org/dlib/february98/02weibel.html Welz, Gary. "Multimedia comes of age," Internet World. Vol.8, no.2, pp. 44-49. W3C. Naming and Addressing: URI's. http://www.w3.org/pub/WWW/Addressing/Addressing.html W3C. Resource Description Framework (RDF). http://www.w3.org/RDF/ Whatis.com, Inc. http://whatis.com/ Acknowledgements I would like to thanks my colleagues in Information Analysis and Standards at the National Library of Canada—namely, Gary Cleveland, Terry Kuny, Chris Robertson, Barbara Shuh, Leigh Swain, Fay Turner, and Michael Williamson—for reviewing and suggesting revisions to sections of this paper. Copyright. The National Library of Canada. (Revised: 1998-06-23). |