Beyond MARC

International Conference on the Principles and Future Development of AACR Toronto, Canada, October 23-25, 1997

Beyond MARC

Mick Ridley

Department of Computing, University of Bradford

Beyond MARC

Paper for Int. Conf. on Principles and Future Development of AACR

Mick Ridley
Dept of Computing
University of Bradford
Bradford UK

My Perspective ( or Prejudices)

I'm not a Librarian but a database specialist, doing research in a Computing Department in a team including librarians with a British (sometimes European) view of the world. This research has been funded by the British Library R&D Department (now RIC), Document Supply Centre and the European Union - Telematics for Libraries programme.

This means that I am a receiver or user of MARC records rather than a creator (or modifier or enhancer). And my comments about MARC should be seen in that light and hence I will tend to talk about MARC records as I've seen them, which is often not the same as how they should have been if they had been catalogued in the way you (or I) would do it (or the way the standards suggest).

What we have been doing (or trying to do) in a number of recent projects is to compare MARC records, find which are for the same items, find which are the best. More recently we have been trying to extract the maximum information from the records to produce more useful online catalogues. The difficulties of doing this has motivated the observations that follow and led me to believe that we need a new kind of catalogue record.

Questions

The principle questions that I am addressing are:

Is MARC simply an embodiment of AACR ?
Do we need a transfer standard for catalogue records ?
What is a good structure/format for catalogue records ?
Is the same structure/format needed for transfer, database storage and presentation to users ?

And on the way I would also like to ask a few others that I don't know the answers to.

Standards are a good thing but

MARC has been amazingly useful at enabling the exchange of information (records) and provided a starting point for extensive automation but, and its a very big but, its very success and all pervasiveness has meant that it has influenced, for ill in my opinion, the storage and organization of catalogue information. My particular concern is that there is a strong tendency to store either MARC records, or unitary records based on them, that contain all the information about an item. The card catalogue moved to inside the computer. There is of course a body of work that has been questioning this approach, Michael Heaney's "Object-Oriented Cataloging" seems to sum up many of the issues.[1] MARC wasn't meant to do that, it is (explicitly declared to be in the case of UKMARC) " a communications format...for the exchange of records between systems" [2] .

It is also explicitly a bibliographic record standard "defined for or extended to books, serials,cartographic materials, music and audio-visual materials" [2] .

There has also been a proliferation of MARCs (this sometimes seems to be almost a national pride issue although that is unfair since some at least have tackled character and language issues). This may be less noticeable in N. America but in a networked world not only do we need standards but international ones, i.e ones that stretch as far as the networks. We can convert between MARCs, but it can be a pain to do it, and some differences can result in a loss of information (e.g. systems that only have one 500 tag for notes can collapse information from many tags into this but cant automatically restore the original).

I think we need to consider "How do you have good standards ?" And more specifically from my point of view " How do you have good standards for automated systems ?" From attempting to work with existing MARC records it seems to me that there are problems in AACR2 and hence MARC over things that are optional, or allow scope for (mis)interpretation, such things are anathema to automated systems.

The status of uniform title as optional is one (disastrous to me) example. Its absence from most records makes trying to cluster records for the same item together very difficult.

An example where decisions require consulting external sources, was found in work I did on Greek material for Project Helen, where Cavafy may or may not be the most common transliterated form of the Greek poet's name depending on what sources you consult. Rule 22.31 states "..choose the form corresponding to the language of most of the works" which seems to have led the British Library to use Kavaphes and the Library of Congress to use Cavafy. Presumably in both cases a correct choice was made, but the different decisions reflect the different existing holdings of those libraries[3].

Physical format is one area (which I discuss in more detail below) where a greater degree of clarity is needed about what we are cataloging and why it is needed. Here there seems to be little distinction between medium and equipment needed to use the item. And in practice this opens the door for misinterpretation.

It would also be sensible to avoid default assumptions such as no language means the English language and no format means a book. This may be OK in a localised environment but is storing up trouble for the networked world where querying an overseas picture library is little harder than querying a local branch library.

To produce records that can be easily used in automated systems we also need to avoid two things

the ever increasing list that starts as a) to e) with nice clean distinctions but grows towards z) with very fuzzy edges to each group.
the black hole of other into which is put all sorts of material with all sorts of explanatory notes.

Having made a stand against the other category and bottomless pit that is Notes I'm very unwilling to ban them since I know there are occasions when they are supremely useful and that there really are some things that don't fit into any other category. But in practice they are often abused, that is, used when the information should clearly be provided elsewhere. Anyone who has sought out information on large print editions will have suffered this. There are clearly Notes that fulfill a useful function but are these best gathered together at the (conceptual) end of the MARC record? Language information may have appeared in 008, 041 and 546, but as a pragmatic consideration is it more likely that better, more consistent records will be created if all language related material is together? It is of course possible to display fields in any order one chooses but I feel that it would be beneficial to associate notes with the area they annotate. This would also have the benefit of letting us collapse edition information (including edition notes) if the subdivisions were not wanted, rather than the present situation where all notes (on edition, physical format, language, etc.) are collapsed together. It is also worth noting that some Notes may refer to a work, such as Summary Notes, others to a particular manifestation, such as a Publication note and still others such as a note that a book includes the author's annotations may refer to the individual copy.

We need categories that are taxonomic, that is cover the full range of possibilities, yet still allow for future growth. (I think here its worth mentioning an aspect of Barbara Tillett's work [4]that hasn't been commented on as much as some. That is, her work on relationships wasn't just a list of relationships, but was an attempt at a taxonomy, a complete categorisation.) Returning to physical format we want to specify the medium at one level, into a comprehensive set of categories e.g. text, sound, image. (I leave it to others to decide the details of such fundamentals such as whether moving images are a base category or a subset of image). We need to allow for material that is multimedia (or some other new medium) not by the ad hoc creation of new criteria but being able to repeat the medium category.

It seems to me also that one solution here is allow for a more hierarchical structure of information in records. This will also allow users to disregard a level of detail they do not want. This would allow us to categorise that software was on a disk at one level and at another whether it was 3 1/2", 5 1/4" or CD, and at yet another file formating or machine level information.

This would mean that it would be possible to use such records to group software together for OPAC displays and for example separate out software on CD-ROM from text on CD-ROM. At present pertinent information in this area may be scattered through the MARC record, in coded form in 008 Material designation and 037 and in more textual form in 300 and 530 and 530 in UKMARC.

Things that surprise me about Cataloging and Libraries

I'm still amazed by the amount of discussion of main entry and access points, for all that relational databases may not be the ideal for bibliographic uses they do teach us that we should be able to query anything and that sorting and displaying criteria can be controlled by users. Indexes are just a device for making queries go faster, and are hidden from just about everyone.

Isn't it reasonable to be able to query a catalogue to find out how many large print books (or books in French or CDs) they have? If I've found a good introduction in one of the volumes of Marx's writings published by Penguin shouldn't I be able to find what other Penguin editions of Marx are available? This I know raises issues of authority control over publisher information and what sort of series information is useful.

What things (objects) are these catalogue records of ?

I think that in plans for the future we need to avoid the primacy of print. Books aren't going to disappear but they will change (are changing) and have to live along side a variety of other media including networked resources that may have some very different behaviors from print (and materials like CDs, videos and even software on disc). Here I'm thinking of issues such as the notion of edition for information that is being continually updated. In saying this I don't want to fall for the "Books are dead, long live the Internet" hype but I do want to recognize that even a lot of printed material such as serials have in general been poorly served by the OPAC and cataloging rules at present. We need a model that represents the book and its web site and software, the work that starts online and then becomes a book (e.g. Krol's "The Whole Internet User's Guide and Catalog" or many Java texts).The model must fairly represent the journal that has online and print versions (e.g. Journal of Artificial Intelligence Research which starts online and then has a printed version later, and the situation we have at Bradford (certainly not uniquely) where we have some journals in the library, but we also have versions available on the Web and for some of the journals also available under that scheme we don't have printed versions).

I feel that looking at the Internet, as a new medium, is a useful technique for questioning old assumptions and seeing if the old ideas still fit. I also think that it represents a major challenge for cataloging.

Having said that, I'd like to look in more detail at a more traditional area, that of physical form, since we have been investigating how this is used at present in a current research project. We have been looking at it as part of our work on BOPAC2 [5] where we are attempting to offer different ways of sorting and displaying large retrievals. One feature that we wanted to offer, and was suggested by users, was the ability to choose only books, or videos etc. ( Here I should add that we are working on data retrieved via Z39.50 where we haven't been able to specify that restriction in the search.) Looking at this area in an attempt to do something with the MARC records the search retrieves has opened up more questions and not given us any answers yet. What exactly is the physical format information telling us? And here I would emphasize that I'm referring to the information we get, on MARC records delivered as the result of a search, AACR2 as is has been implemented rather than the ethos or true spirit of AACR2. Format seems to be a mixture of two things, the content i.e pictures, sound recording, moving pictures and the technology needed to use the item i.e whether its a microfiche or microform, a video recording or a movie film. There are two issues at play here and neither is being well served. If I want the motion picture of "Wuthering Heights" I may care about whether it is on film or video, but probably at a different level from the distinction between text versions and "recordings of a dramatisation" of it. Similarly I might like to group text, microfiche and microfilm together as different formats of the text, in which case issues of normal text and large print information which is often catalogued elsewhere may be relevant. The situation gets worse when we move on to something like a CD-ROM. Its physical form may be a CD-ROM, so we need a CD player, but what are the content/software/hardware issues. What is the CD-ROM really? It may be simple text that is held on this medium or a set of pictures. If it is text is this held in one a number of different formats that might be used such as HTML, Postscript, PDF or images of pages? If it was the latter it could be in the same file format, e.g. j-peg files, as a CD of photographic images. It would then need similar software, to the CD of photos but we might which to categorise it with other texts rather than with the photo CD. Or the CD may use a proprietary format only easily intelligible to specialised software. What machines does the CD run on? And if we can answer these questions for CD-ROM can we answer them if its a DVD or other disk format that the future may bring?

What we need to catalogue is the intellectual content (the work) and the forms (manifestations) in which it is found and the links and relationships between works and manifestations. And we need to do this with a clear distinction between these different functions.

And who will create records for them ?

If a lot of information is being created (avoiding the question of what's worth cataloging) how much can be done automatically, Is this the lesson of the internet search engine? Is the counterpart that the producers of information rather than national bodies do more? If so how much do we trust them? This assumes (not unreasonably I think) that most material is in electronic form, but it also suggests that something like a serial should have all its parts accessible.

I know there are numerous efforts to provide access to serial contents via abstracts and contents listings, both printed and online. But these are apart from the normal catalogue and often involve re-keying or scanning activities. It should surely be possible to create records from the electronic copy that provide us with information on an article in a journal that can be put into a catalogue. This should allow us to access that article in the same way that we would expect to find Bleak House within Dickens' Collected Works in a catalogue.

Here I would like to differ to some extent with Pat Oddy who in her book "Future Libraries Future Catalogues" [6] suggested that there would only be a limited number of electronic texts. I believe that there are a number of issues here and its is important to recognise that increasingly texts exist in electronic forms, either originated by their authors, or by publishers. This is not to say that this is automatically the primary means of transfer and presentation for those texts. They do none the less exist as electronic texts and we may wish to consider how these are likely to be used and archived and accessed over a long period of time. Do we see a long term future for microfiche theses and reports ? Will there be an electronic life after death for out of print books ?

And how do you get the records and how much do they tell you ?

In a networked world you can access things you couldn't in the past , or at least get them more easily, and there are a number of protocols like http which are of necessity very open. That is, you can see the source of Web pages. There are similar issues of access for instance with Z39.50 [7]- which commonly works by transfer of MARC records, will people come up with another (lower in content than MARC) standard for this exchange, or charge as you would if the MARC record were being supplied to build a catalogue ? If you don't allow access to your MARC records, what structure do you provide in the records that is still useful ?

It is possible to imagine a simple structured record that had tagged author, title, publication, physical description, ISBN, subject and note fields. Each of these tags being an aggregation or selection from a number of MARC records. The result would be similar to the detail shown in a full record on most OPACs. Exactly how much detail should be shown (or would be acceptable to suppliers) is not clear. Wool[8] surveys current practice in this area.

Its not clear what financial models work in a networked world, we can see publishers (information sources or providers ) coming up with a variety of different models and testing the water to see what works for them.

If libraries have a policy of access rather than holding how do they demonstrate that in their OPAC ? How do you differentiate between, things you have, things you have online here, things that you know are online elsewhere, and things that are elsewhere but can be brought to you (ILL and document delivery and eventually electronic delivery)?

What we should be aiming for

I think we need a work based system, with a three level structure:

The Work
The Manifestation
The Copy

This sort of structure has previously been suggested e.g. Heaney[1] and the Multiple Versions Forum[9] and was the basis of the work we did on BOPAC1[10]. This can then be used for the differing needs of database storage, record exchange and OPAC display.

Database

This three level structure suggests a demarcation between parts of the catalogue record.

With this everything has a uniform title, the title of the work, and within a database many items can in fact be lists, so there is no limit to how many variant versions you might have, no issue over where a author comes in a list, no problem over repeating information since it doesn't need physically repeating. It just exists as a link to the original. The record never exists as a whole it is always the sum of a number of parts that can be accessed in any way and put together by the links between parts of the record. Some information may also be part of more than one work or manifestation. A work that is "Dickens' Collected Novels" for example may be seen as containing both the work "Bleak House", and a particular manifestation of that work. All the information pertaining to this version of Bleak House may be the same as would be held for an individual volume of the same edition and hence the Bleak House information within the Collected Novels may only be links to existing work and manifestation information. The Collected Novels would have its own copy or holdings information of course. Just as a "Collected Works" contains a number of works, an edited collection or issue of a serial may contain a number of articles (or works). A work level record should exist for each of these articles in the same way that it would for Bleak House within the Collected Novels. The records for the collection and the individual article would of course link to the same manifestation. Separation out and standardisation of some of these features would also make links to related works simpler. A critical work would then be linked to the work it was about and a translation to both the work and possibly the particular manifestation it was based upon, if for example it was known to be a translation of a second edition.

Record Exchange

A record for transfer can always be constructed by putting together the work and manifestation parts to make a self contained record. If the receiver already knows the work they discard that bit and link the manifestation part to their existing work record. If not, they can save the work part and save the manifestation part separately and create the copy part of the record. Work records would of course have to be able to contain other work records nested within them.

OPAC display

A different looking complete record is assembled for OPAC display, in particular following Bradford OPAC1 it may not be a set of complete work/manifestation/copy records but a tree of work Authority

I feel we need to avoid the term authority file, because it suggests the wrong thing. (Just as some other terms need to be renamed to break the link with card catalogs and printed bibliographies.) What we need is not a separate stand-alone list, which often seems hard to use or link into other systems, e.g. the UK names authority file has lots of very brief records that are little more than a personal name. What you really want is to be able to query the work part of a large (your national library's) database, which should hopefully be available online. At the OPAC you don't want to be told to "see also" you want the system to do that for you (and possibly explain what its done).

How do we get where we want to be from where we are now ?

Can the transition be made ? A new system seems like a big shock but how many systems are the same as they were (say) 10 years ago ? How many libraries have changed systems entirely, or changed the hardware, the operating system etc. (without complete collapse of the system).

What we did in Bradford OPAC1 showed that we can create the new structures from existing MARC records, there may be problems but its a start. Some records can be grouped together and the work information separated out automatically. Other records because of existing cataloging anomalies need individual attention. And once you have got a system like this you can see how its easier to add new records, and see how they relate to other records and how anomalies show up more easily. From a system like Bradford OPAC1 you could still produce old style MARC records for export, so users of old systems are not cut off, they could migrate at some (not fixed) time.

Standards

The old standards have been library ones, there is a need to embrace wider standards particularly in the areas of how records are marked up (or structured or tagged) so we need to look at SGML and and in the area of character sets we need to look at Unicode [11]. There is a lot of general (i.e not tied to single application areas) software widely available to process these. It would be a good thing if library systems were less of a specialist market and using common standards will help this. And as I said earlier we need to be looking at international standards that cover all areas that networks might do. To this end although we must acknowledge the AA of AACR we will I hope want to look as wide as possible.

Mark-up, SGML, HTML and XML

I have nothing in particular against the actual file format of MARC but if a widely used structure can give us the same results and allow us to use more general purpose software then I think we need to consider it very seriously. The popularity of the Web has made HTML (which is a SGML application) common so SGML would seem to be the mark-up of choice. There may well also be benefits in this approach since publishers have been significant users of SGML so much material that we want to catalogue may already be in this form. At a more personal level word processing and text formatting packages are offering HTML or SGML as output formats. There are other developments such as XML, Extensible Markup Language,[12] (sometimes called SGML -Lite) which we need to be aware of since they are providing generalised tools that may be useful to libraries.

One feature in MARC's favour has been the use of numeric tags which has helped to free it from a bias in favour of English. Unlike HTML whose tags have their origins in English terms. Do we need to come up with a compromise that makes the tagging structure more easily readable by the non-specialist ? I think probably not since it should always be possible to create records from data that is tagged and structured in some other way. For example we should be able to parse the TITLE tag of an HTML document to create a 245 (or whatever it is to be). On the other hand there may be a strong case for renaming (and renumbering) to emphasise a break from old traditions. And similarly we can always turn numeric tags into the display terms or styles that we need. Here there is perhaps a lesson to be learnt from Web browsers and the notion that for example a heading or blockquote can be shown in a number of different ways. The stylistic choices may be the browser's, the author's or the reader's

Character Sets and Unicode

I hope I can interest people in the issues of character sets, since they are very important. Although I do recognise that they may not seem interesting at first. I would like to give some background on this area and I hope those that are familiar with will bear with me while I attempt to evangelise on the issue to others.I will also try to avoid too much technicality.

The problem with most character sets in use on computers is their limited size. They are restricted to 128 or 256 different characters, since they are based on each character being stored in one byte in the computer's storage. Not all of this range is available for printed characters since some positions are used for control codes such as end-of-file and other positions are used for tab and new-line. This sort of range may be OK for modern English but it is insufficient for many uses, for example most bibliographic situations. Single byte character sets can also be used for other alphabets such as Greek but a file that uses a Greek character may well appear to be rubbish if read expecting the Roman alphabet. Even among English speakers transferring files from Windows to Macintosh can produce odd results for characters outside the A-Z range. Most MARCs have a character set that is satisfactory for the language of its country of origin but can cause problems if you want anything a little out of the ordinary. Even use of cedilla or tilde in Spanish may be problematic let alone whole different alphabets such as Cyrillic.

One solution to this problem has been the use of escape sequences to notify software that the character set was going to change and that the position that was an accented O is now a capital Sigma. This technique has had limited use but seems likely to be superseded by the use of multi-byte character sets. Here we allow more than one byte per character and hence increase the range of characters that we can represent. This means that one character set (Unicode) will let us represent all the different languages, and alphabets of the world.

One effect of using Unicode will be to permit the storage of records in their correct script, and if transliterated versions are needed these can be produced automatically, on demand, by software. This will end one situation where the computerized catalogue has often been a step back from the card catalogue. Up to now many computer systems would only allow the storage and display of the transliterated record, the original script would not have been present at all, at least the card catalogue could have had a transcription of the real original text as well as a transliterated version.

The library community has a good record of involvement in the development of Unicode. And some good work has already been done in using Unicode for MARC records in projects such as the EU funded CHASE project.[13] Having addressed the topic from an introductory point of view I would also like to make a few comments at the more complex end of the topic. For those that already know something of this area you may have wondered why I have emphasised Unicode rather than ISO10646. The reason for that is that I would like to stress again the benefits of being able to use standard commercial (non library specific) products where possible. And here the developments are for Unicode rather than ISO 106464 although I do recognise that bibliographic use is one that may well need the greater resources of ISO10646 for such things as records of documents in non living languages etc. Unicode is significant because it is the underlying character set for Java which I believe will be very important and Unicode (rather than ISO10646) support is being built into new computer operating systems such as Plan 9 and Windows NT.

Also this isn't to say that there are not still a lot of issues over Unicode as anyone familiar with the debates over Korean and Han unification will know but I feel it is coming.

If Unicode and more complex records seem to be a problem remember that it is software that is increasingly expensive, not hardware especially storage. Anyway how big is a catalogue record, how big are your Word documents, how big are the pages that you download from the Web, especially those that include images? And of course there is no need for you to store your records using the Unicode character set, you merely need to be able to translate to and from it, what you do internally is your own business, again MARC is only the transfer medium.

I accept that at the moment there is a gap between these two ideas and one of the big problems of HTML has been the poor character set support, but change is on the way both in terms of character set support in standards and commercial support ( e.g Netscape's Marc Andreessen Web article on future support for Unicode[14]).

Conclusion

Returning to my initial questions: I don't believe that MARC is simply an embodiment of AACR. A transfer medium for catalog records should be that, a transfer medium, and therefore may have a different structure to the records we store and display. The same set of cataloging rules may apply but to different structures. Structures that are suited for particular purposes and with rules for conversion between these different forms.

References

[1] Michael Heaney, Object-Oriented Cataloging, Information Technology and Libraries, 14 (3) Sept 95, p135-153
[2] UKMARC Manual 3rd ed. Part 1, British Library Board, 1989, 0712310525, 1/1,
[3]Evelyn Cornell, Amelia Hatjievgeniadu, Michael J. Ridley, Searching for non Roman script terms, in ELVIRA, 2 ed. Mel Collier and Kathryn Arnold, Aslib, 1995, 085142354X
and
E. Cornell, Project Helen Name Preliminary Report 2.1, Univ of Bradford, Dept of Computing, Apr 1994
[4] Barbara Tillett, A Taxonomy of Bibliographic Relationships, LRTS, 35(2), p150-158
[5] BOPAC2, http://www.comp.brad.ac.uk/research/database/bopac2.html
[6]Pat Oddy, Future Libraries Future Catalogues, Library Association Publishing, 1996, 1856041611
[7] Z39.50, http://lcweb.loc.gov/z3950/agency/
[8] Gregory Wool, The Many Faces of a Catalog Record: A Snapshot of Bibliographic Display Practices for Monographs in Online Catalogs, Information Technology and Libraries, 15 (3) Sept 1996, p173-195.
[9] Multiple Versions Forum, Library of Congress, 084445965
[10] F.H.Ayres, L.P.S.Nielsen, M.J.Ridley, Bibliographic Management: A New Approach using the Manifestations Concept and the Bradford OPAC, Cataloging & Classification Quarterly, 22 (1) 1996, p3-28.
and
F.H.Ayres, L.P.S.Nielsen, M.J.Ridley, Design and display issues for a manifestation based catalogue at Bradford, Program, 31 (2) April 1997, p95-113.
[11] Unicode, http://www.stonehand.com/unicode.html
[12] XML, http://www.ucc.ie/xml/
[13] PROLIB/COBRA-CHASE 10169, Character Set Standardisation: Migration Strategies to UNICODE for National Bibliographic Databases, report to appear
[14] Marc Andreessen, The *World Wide* Web, http://www.netscape.com:80/comprod/columns/techvision/international.html

Acknowledgments

I'd like to thank Fred Ayres and Lars Nielsen, the other members of our team at Bradford.