As of 22 April 2009 this website is 'frozen' in time — see the current IFLA websites
This old website and all of its content will stay on as archive – http://archive.ifla.org
SGML is generic for defining a logical data structure and marking up instance documents following the generic format. This language can be applied to any document type such as a book, a journal article, the aircraft technical documentation, a dictionary and of course bibliographic records.
SGML documents are coded in a platform independent and standard format, guarantying the permanence and reusability of the information for a very long time which is very important in the library environment.
SGML is currently used for data creation, for data exchange, for data storage, for indexing , searching and retrieving, for printing out and for data viewing. Commercial and rather professional tools are available covering the whole range of possible applications.
SGML is under implementation for data handling in large corporate and publishing companies for data production and storage and sometimes for data delivery. It is also considered by Library projects building digital documents programs as a generic format that can cover both the electronic textual document logical structure and the bibliographic information associated to this document.
The Text Encoding Initiative is an international application of SGML in the area of humanities and language industries. It allows researchers to extend the format for his or her own usage using the SGML language and coding to markup the primary textual document.
As SGML is often cited as a potential competitor or successor for the MARC formats, it is important to understand how it works, the different level of the possible usage for the bibliographic information and the corresponding interest for the libraries who are rethinking the cataloging process against analysis of the functional requirements of the bibliographic information in the electronic environment.
The DTD defines the logical elements which are to be found in a document type with their hierarchical structure. The elements can be qualified by attributes. External entities, like text or images or video, can be anchored at a specific place in the document allowing to handle multimedia. The SGML language allows to attach links inside a document or between documents at different level of the structure providing a basis for hypertext navigation.
A document coded in SGML has the portions of text marked up by logical beginning and ending tags defined in the corresponding DTD. In order to use the document with an SGML compliant tool you need the DTD. It can be sent with the document if it is a proprietary one or you can refer to a DTD if it is know by the partners of the exchange. Some DTDs are publicly available or standardized and are proposed as basic tools by the SGML editors.
One of the main differences of the MARC structure and SGML is that the structures expressed in SGML can be very deep in the tree of the hierarchy compared to the flat structure of MARC data. SGML allows to describe the structure of a collection of documents and then the internal structure of the component items and you can reuse basic structure blocs at different level of the hierarchy. You can use theoretically the same tools for format encoding and for exploitation of the information at any level of the hierarchy or embedded bloc. Specific users can add their own coding for specific usage. The bibliographic community can add bibliographic metadata (data on data), the indexing community can add subject metadata, the researcher in linguistic can add linguistic metadata.
Examples :
Example :
é for the letter e with an acute is displayed é
The SGML document and the marking up process
The SGML document is basically a text with beginning and ending tags marking up the logical elements or the attributes or the call for entities that can be textual information or images or video. An article coded in SGML has the following aspect :
<!DOCTYPE ARTICLE PUBLIC "-//USA/AAP//DTD ART-1//EN" "article.dtd" [ <!ENTITY Darc CDATA "SoftQuad Explorer(tm)"> <!ENTITY nbsp CDATA " "> ]> <ARTICLE> <FM> <TIG> <ATL>Flexible Management of SGML-encoded Documents </ATL> <SBT>Design Principles in &Darc; </SBT> </TIG> <AU> <FNM>Donald </FNM> <SNM>Broady </SNM> <DEG>Ph.D. </DEG> <AFF> <ONM>University of Stockholm </ONM> <ODV>Project manager </ODV> <EAD>broady@nada.kth.se </EAD> </AFF> </AU> <AU> <FNM>Hasse </FNM> <SNM>Haitto </SNM> <DEG>M.Sc. </DEG> <AFF> <ONM>Royal Institute of Technology </ONM> <ODV>Project coordinator </ODV> <EAD>haitto@nada.kth.se </EAD> </AFF> </AU> <ABS> <P>&Darc; is a multi-user, cross-platform (PC/Windows 3.1 and Sun SPARC/X11) database and information retrieval application designed primarily for documents marked-up with SGML . Among its features is a full-text document browser, in which markup-based hypertext linking is complemented by interactive, on-line linking and annotation facilities through concurrent webs. Cooperative work is supported through a novel hierarchical user group mechanism </P> </ABS> </FM> <BDY> <SEC> <ST>Keywords </ST> <L1> <LI> <P>SGML </P> </LI> <LI> <P>Hypertext </P> </LI> <LI> <P>Databases </P> </LI> <LI> <P>Information Retrieval </P> </LI> </L1> </SEC>
You can capture this information or insert the tags to an existing ASCII file with a regular text processing system or with an SGML tool offering WYSIWYG display, interactive structure control and contextual help for the allowed tags at a specific place of the structure.
The SGML tools also allow to view the SGML document with the tags hidden, using a style sheet for the layout. Of course different style sheets can produce different layouts for the same SGML document.
Currently SGML is looked at either an exchange format or a creation and handling format, depending on how the cataloguing process and the objectives are seen and depending also of the type of the document.
Several USMARC DTDs are already available. The most detailed one is made by the Library of Congress with an alpha test version available at the Library of Congress Network Development and MARC Standards Office web site. There is one DTD for the bibliographic data and one DTD for the authority data. The objective of this project is to create a standard SGML DTD to support the conversion of cataloging data from the ISO 2709/USMARC data structure to SGML (and back) without loss of data. The project also includes the development of software utilities capable of converting between the two encoding standards. Here is the general structure of the bibliographic DTD down to the subfield level.
The U.C. Berkeley University Library also offers an ISO 2709/USMARC DTD less sophisticated together with the conversion tools. This DTD is designed for use in an on-line catalog employing SGML as its underlying record format at U.C. Berkeley Campus.
I think that the main interest to use SGML rather then the traditional format is :
In the second case all the descriptive information of the textual document, copied directly in the source document, can be used directly in the document if it is properly marked up by the publisher or during the cataloguing process.
Several approaches are under development. One is the Text Encoding Initiative (TEI) where a header is added to the document itself in order to carry the bibliographic information. Another is the metadata initiatives starting from the Web HTML needs.
But they are all a result of the same analysis : there is a need to carry the descriptive data and the organizing data (access points) along with the electronic document.
ftp://library.berkeley.edu/pub/sgml/marcdtd
ftp://ftp.loc.gov/pub/marcdtd