   
UDT Occasional Paper # 9
SGML: An Overview and Criteria for Use
Gary Cleveland
UDT Core Programme
E-mail:
July, 1998.
1. Introduction
SGML, or Standard Generalized Markup Language, is a tool that can aid in the creation, management, and delivery of documents. Developed in the publishing sector 25 years ago, it has since evolved into general tool for document management. Specifically, it can help integrate the work of several authors in multi-author documents, facilitate document revision and information reuse, and streamline the output of information in multiple formats (e.g., paper, the Web, and CDROM). The benefits of SGML are many, ranging from increased information integrity, to reduced document production times, to improved document longevity.
Documents suited to SGML encoding tend to be large, complex, with multiple authors. They typically undergo frequent revision, and have components that are reused in other documents. SGML is used successfully in many sectors to manage documents, most notedly in the defense, aviation, and pharmaceuticals sectors where documents are massive, and involve many parties at both the input and output stages.
SGML is not, however, an appropriate tool for all documents and all organizations. The costs of converting existing publication processes to SGML are high. Shifting to SGML is not like simply buying an improved word-processor or desk top publishing package, but imposes widespread changes on an organization. Document processes are re-engineered and workflow changes--extending from authors, through editors, to layout specialists, database administrators, and print houses. Thus, the decision to use SGML must be part of an overall information management strategy, rather than a piecemeal solution to isolated problems.
Because such large investments and changes in business practices are required for an SGML implementation, the key is to determine whether investments in SGML make sense for a set of target documents. Each document process must be matched against a set of criteria that help determine its suitability to SGML encoding. Furthermore, the document must be analyzed in the context of the originating organization, taking into account its current practices, its technical infrastructure, and its overall goals for information management. It may be that, given the context, an alternative technology or the status quo is more cost effective than retooling for SGML.
This paper provides an brief introduction to SGML and lists several criteria that can help determine whether SGML is suitable for specific document sets.
2. What is SGML?
SGML emerged 25 years ago out of the publishing industry as a method of easily exchanging text among typesetting systems. Since that time, SGML has grown beyond the publishing sector to become a widely-used method of managing textual information. Essentially, SGML provides a framework for describing the logical structure of documents. It sets out a grammar and method for: a) explicitly identifying the structural, or content, elements of documents; and b) codifying the hierarchical relationships among the elements.
Content elements of a document are explicitly identified through embedded tags. As in the familiar HTML (which is a specific use of SGML), content elements are "wrapped" with tags that indicate the nature of the text string contained within. For example, a string of text that is a title would be surrounded by title tags as in:
Other types of content elements could include, for example, headings, chapters, paragraphs, and appendices. However, SGML allows for the creation of any number of custom-defined tags depending upon the logical structure of a given document.
The SGML standard also contains general rules for describing specific types of documents. These descriptions, called Document Type Definitions (DTD), set out the elements allowed in a given document type as well as the relationships among the elements. For example, a journal article can be defined as having a title, followed by an author name, followed by an abstract, followed by an indeterminate number of headings and paragraphs in the body, followed by one or more references. In short, the DTD is an abstract model of the document. As an analogy, a DTD document model is something like the data model that describes entities and relationships in a database application. Both the DTD document model and the data model are structured, both describe the relationships within the structure, but instead of capturing data in tables and rows as in a database application, SGML allows for the structure to be embedded within the document itself through the use of tags.
DTDs can be created for an unlimited number of document types. In fact, each new document type requires its own DTD, created through a process of document analysis. Document analysis involves examining a representative sample of like documents (e.g., technical manuals for a car) and abstracting the structure inherent within them. The inherent structure is then codified formally in the DTD, creating a document model. Thereafter, all documents of that type are created according to the model. If the document type evolves over time, changes must be made to the DTD to reflect that evolution. Large industrial sectors will often collaborate on defining a common DTD for their particular sector to facilitate document consistency, exchange, and reuse. The Text Encoding Initiative (TEI) and Encoded Archive Description (EAD) DTDs are examples of DTDs of interest to librarians created by specific communities for specific purposes. (1)
With SGML, information in documents becomes more than a long string of undifferentiated characters. Instead, each element becomes a uniquely identifiable "object" that can be independently stored, accessed, inserted, deleted, revised, hidden, protected, searched, and reused. This feature of SGML is what makes it so powerful in managing large and complex documents.
3. Benefits of SGML
When used in an appropriate application, the benefits of SGML encoding are significant. They include:
- information reuse. Because SGML-encoded documents contain uniquely identifiable components (paragraphs, sections, etc.), they can be easily reused in other documents. For example, "boilerplate" text that appears in many documents need only to be stored once, then inserted automatically when a particular document instance--a fixed instantiation of a document--is produced. When changes need to be made to the boilerplate text, it is made to the master version only, avoiding the necessity of keeping the same text in multiple documents in synch. In this way, blocks of text can be repurposed in as many documents as needed.
- consistency and information integrity. The DTD forces compliance to a document model. When producing documents, authors must follow the rules set out in the DTD--certain types of information must appear, and in a specified order. Such formality improves consistency across like documents because the DTD ensures that all the necessary components are included and are in the proper place. Information integrity is improved because text that is common across documents can be inserted from a single authoritative source. Further, both consistency and integrity are preserved when moving documents from system to system, because the risk of information loss and/or corruption when converting from
- impose varying levels of security on information components. Because the components of an SGML-encoded document are uniquely identifiable, different levels of security can be attached to each. The resulting document can then be programmed to have different "views" depending upon who the reader is, with associated read/write authorizations. Certain types of documents can contain information that not all individuals need have access to. For example, there may be layers of detail within a document that are only of interest to one set of individuals, while irrelevant to another. Documents containing sensitive information can have classified sections restricted from those without authorization, while allowing those with authorization to see and modified them.
- collaboration in multi-author environments. SGML supports features that streamline collaboration among multiple authors. First, the document model provides a "blueprint" of the document which authors must follow. Each author can be assigned a separate part of the document which, though the authors work independently, necessarily fit together because they must conform to the DTD. Second, assigned security levels can prevent unauthorized changes to document components.
- ability to customize information products. The same functionality that supports different "views" of a document, discussed above, allows information products to tailored for a particular target market. Manually creating new information products from one information base can be time consuming, requiring the selection of relevant information and its layout in a new format. SGML-encoding simplifies that task of pulling information pertinent to a particular target market from an information base and formatting it for the customer's preferred medium (e.g., online, paper, CDROM).
- ability to produce multiple output formats from the same information base without reformatting. SGML tags specify only the content of a document, and not its layout. In propriety desk top publishing (DTP) applications, content and layout are typically merged within a document. That is, the codes for a paper's appearance is contained within the file along with the content. This conflation causes problems when the document is to be produced in a different format (e.g., the Web), because the original layout instructions must be removed and replaced with instructions appropriate to the new medium. Because SGML separates content from layout, alternative layouts can be easily applied to the same structured content.
- increased longevity of information. The longevity of documents maintained in proprietary formats suffer from technological obsolesce. As the software and hardware supporting the format become out-of-date, the information in the documents can be lost because they are no longer readable. SGML, on the other hand, is a mature, well-accepted, cross-platform, device-independent, vendor-independent standard. As such, it provides a degree of protection from technological obsolesce. As operating systems, platforms, and applications change over time, SGML-encoded documents will still be able to be used in new SGML-compliant systems. In this way, the probability of longevity is increased.
- improved information retrieval precision. Because SGML tags explicitly identify content elements, information retrieval tools can target specific elements thereby improving retrieval precision over traditional full text techniques.
- shortened time to market. Production cycles that have documentation as a dependency for completion can be shortened through the increased automation and processing efficiency brought by SGML. Examples include applications for new drugs and technical manuals for aircraft. In both cases, producing documentation efficiently is critical to shortening time to market.
4. Costs of SGML
Despite the benefits that can be gained from using SGML, the costs of implementation are not trivial. As one SGML consultant states "an SGML implementation often includes large investments in new software and hardware, training and consulting fees, legacy data conversion, as well as changes in workflow, mindsets and the entire business process" (Cox, 1997). The primary areas where costs occur include:
- skills development and training. SGML is a complex tool and, therefore, requires a large investment in skills development. SGML skills are needed on many levels, including a thorough understanding of the standard; document analysis and DTD design; training of personnel in the workings of the re-engineered document process, as well as in any specialized SGML software tools they may use; and the selection, setup, and maintenance of an SGML document-processing system. SGML experts state that such training is required "even if part or all of the design, development, and implementation work is given to specialized consultants" (Marcoux and Sévigny, 1997:588).
- SGML technical infrastructure. The technical infrastructure includes all of those components required to create an SGML document-processing system. Components include some combination of data design and validation tools, authoring tools, legacy data conversion tools, information (database or document) management tools, formatting tools, and system integration tools. SGML tools need to researched, selected, purchased and implemented and there are significant costs associate with each stage. A more thorough discussion of the technical infrastructure needed to create a SGML document-processing system in contained in UDT Occasional Paper #10, SGML Technical Infrastructure Overview.
- re-engineering document processes. The direct monetary costs of re-engineering a document process for SGML are training costs. However, there are other, indirect costs associated with re-engineering and redesigning workflow. They include disruptions to publications schedules while personnel adjust to new processes; disruptions while bugs are worked out of the SGML document-processing system; and fear, rejection, and/or resistance to change to long-established processes by personnel.
- DTD design and maintenance. For every different document type to be SGML-encoded (e.g., newsletters, directories, manuals), a separate DTD will have to be created, or a industry-wide standard DTD modified. DTD development or modification requires careful document analysis, often measured in thousands of dollars per DTD. If a document type evolves over time, the DTD must be changed to reflect that evolution. Modifying the DTD, and training staff in the changes, will be an added cost factor.
- reduced flexibility. While the use of SGML improves the consistency of documents, it can also reduce flexibility in changing their structure and layout. The requirement to conform to a DTD can restrict the incremental changes in a document's structure that happen naturally over time. Such changes will have to be put off until enough have accumulated to warrant DTD re-design and retraining of staff. The same is true of a document's layout. Changes to a document's layout may have to be delayed until there are enough to justify modifying the code that specifies how an SGML document appears. (2)
- display of SGML. SGML cannot be easily displayed as native SGML-encoded text over the Internet. Typically, it is converted to HTML on-the-fly so that it can be viewed with standard Web browsers.
- handling over a network. Considerable difficulty remains in the delivery and display of SGML over networks. For structured formats to be displayed, three things must be present: 1) the tagged SGML document, including its multimedia components; 2) the DTD; and 3) the document layout specifications. For HTML documents, the layout specifications and DTD are an inherent part of a Web browser's programming. Only the raw HTML and multimedia components need to be sent over the network. This is not true of SGML where all three aspects must be delivered. Thus, there is a considerable network burden for retrieving a viewable SGML document because at least three significant network connections have to be made-one for the SGML document, one for the
DTD, and one for the layout specifications. This problem has never been satisfactorily resolved by the SGML communities. (3)
5. When is SGML an Appropriate Technology?
The benefits of an SGML implementation do not necessarily accrue to every document or document process that undergoes conversion. Success depends upon the characteristics of the target documents and their suitability to SGML-encoding. The criteria below can be used in making a decision as to whether SGML encoding is appropriate. While there is no exact formula for making a decision, obviously, the greater the number of criteria that fit the target document, and to a greater degree, the more the likely the document is appropriate for SGML encoding. SGML encoding may be suitable if a document:
- Represents a large quantity of material. This is the one criterion that, on its own, can disqualify a conversion project (Alschuler, 1995). Projects suitable to SGML conversion tend to be very large. Typical SGML projects involve the coding of thousands, hundreds of thousands, or millions of pages.
- Is re-used in other documents. Documents containing components that have utility in more than one document may be appropriate SGML encoding (e.g., paragraphs, sections). Documents with a short life expectancy, that quickly become outdated, whose components are never reused for any other purpose, have a low suitability to SGML encoding.
- Must be kept for an indefinite period. Information that must be kept accessible in perpetuity across succeeding generations of computer hardware and software may be a candidate for SGML use.
- Contains content that is of high value. SGML encoding may be warranted if the information can be considered to be an important corporate asset (e.g., patent or technical, medical, legal, or business information) or is critical to core business objectives.
- Is highly complex and structured. If the information structure exhibits many nested hierarchies and many optional inclusions (that is, component that may, or may not, be included depending on circumstances) then it has a higher suitability to SGML encoding. Other clues to a document's complexity include no fixed limits to field lengths and the need for full text searching in context (Alschuler, 1995). Examples of complex documents include aircraft and automotive maintenance manuals. Simpler structures that fit traditional database models (i.e., fixed length, repeatable fields, and controlled language) are less appropriate.
- Is dynamic and subject to frequent revisions. SGML encoding is appropriate in documents that are frequently updated and revised (e.g., technical manuals). "Static" documents that do not change once they are produced are not appropriate for SGML encoding.
- Contains linked information to a fine level of granularity. If there is a requirement within a document to be able link from external documents down to individual elements (e.g., words, sentences, paragraphs, sections), then SGML may be suitable. The finer the link granularity required, hence the greater the need for structured information, the more warranted is the expense of SGML.
- Contains components that are integrated as a condition of external factors. Documents containing optional components that are integrated depending upon external factors (e.g., economic or environmental factors) can benefit from SGML encoding. Information without such conditional inclusions is less suitable.
- Needs to be ported from application to application and from system to system. SGML is appropriate in heterogeneous technical environments where information moves among many different systems. Encoding of documents produced within single-system or single-application environments where there are no format conflicts is less warranted.
- Needs to be exchanged among several organizations or businesses. Documents that must be shared among several organizations (e.g., all automakers), and therefore required a common interchange format, are good candidates for SGML encoding.
- Is produced by multiple authors. If a document has components that are written at different times by different authors which must be integrated, then SGML may be suitable. Single author documents are less suitable.
- Is derived from multiple sources of information. If a document is made up of parts that come from many different sources in a number of differing formats, then SGML use may be appropriate (e.g., boilerplate text from an information base, plus database information, plus authored text, plus spreadsheet data). SGML reduces the effort of integrating information in multiple formats. Information that arrives in only one format is a low priority for SGML conversion.
- Is reviewed by many people before approval. Documents that undergo complex approval processes, where many people must sign off, and the result of the approval process must be stored for liability purposes, can benefit from SGML encoding. Simple approval processes may not warrant the expense of SGML encoding.
- Is produced for a large, diverse user group with complex access requirements. Documents that are targeted to a large, diverse user group, each of which has specific access requirements (e.g., different points of access, different views of the information, different media), are suitable to SGML encoding. If the target audience is small with simple access requirements, then the use of SGML may not make sense.
- Must be output in multiple formats. Information that must be produced in many different formats, such as paper, CDROM, the Web, and help files, is more suitable to SGML encoding than information that is only produced in one or two formats.
- Is officially mandated to conform to SGML format. A final criterion is whether the use of SGML has been mandated by an official body, such as a governmental body or industry group. An example are defense departments that typically mandate the use of SGML for their manuals (i.e., CALS).
(Source: Alschuler, 1995; Arbortext, 1997a; Arbortext, 1997b)
6. Extensible Markup Language (XML)
In any discussion of SGML, it is worth mentioning XML--or Extensible Markup Language--which is its highly publicized offspring. XML is a simplified subset of SGML intended for use over the Web. It is a grammar, like SGML, rather than a fixed tag set, like HTML, for creating tags and adding structure to documents. It is intended to be more powerful than HTML in giving documents structure, but easier to implement than full-blown SGML, and thus lies on the spectrum of complexity somewhere between SGML and HTML. Specifically, it has been designed to be:
- easier to implement over the Web
- easier to define document types
- easier to create supporting software
- interoperable with both SGML and HTML.
Like SGML, XML supports a strong separation of content and presentation. To control appearance, a standard for output specifications called Extensible Style Language (XSL) is being proposed for use with XML. It combines formatting features from both DSSSL and CSS. The proposed standard for linking documents in XML is XLink. Both the style sheet and linking standards have yet to be finalized.
As a simplified subset of SGML, XML shares many of its characteristics. However, because it has been designed to be easier to implement, one assumes that the overall costs of use will be lower. This assumption may, or may not, be borne out with time. The criteria outlined in this paper should also be of use, to some extent, in helping to decide whether XML a suitable technology for document management for a specific document set.
7. Conclusion
Analyzing the appropriateness of a document process for SGML encoding is, unfortunately, more an art than an exact science. There are no objective, quantitative methods available for placing a particular document on the spectrum between "highly suitable" and "not suitable" for any single criterion. Judgments are necessarily subjective based an understanding of the document process in question, the scope of the criterion, and what is typical of successful SGML implementations.
There is also no objective, quantitative method for summing the results across criteria and coming up with a "yes" or "no" answer for a particular document process. Thus, each document process also has to be judged subjectively, taking the wider context into consideration. This wider context includes the organization, its goals for information management, its existing technical infrastructure, the costs of implementing an SGML document-processing system, and its suitability relative to successful SGML implementations.
SGML used appropriately can save time and money, and increase the efficiency of document processes to a significant degree. However, when considering the use of SGML--or XML--it is important to get beyond the hype surrounding the standards. They are not a panacea for all document processes or all organizations because of the high costs involved. When embarking on a project to make document processes more efficient, those inefficiencies should be approached as a systems problem, open to any appropriate technical solution of which SGML is just one candidate. Encoding standards should be selected only if a strong business case can be made for their use.
SGML References
Alschuler, L. (1995). ABCD… SGML: A user's guide to structured information. London: International Thomson Computer Press.
Cover, Robin. (1998) Extensible Markup Language (XML) . http://www.sil.org/sgml/xml.html
Cox, J. (1997). Success in selling SGML internally. Paper available on the SGML Open Web site at URL: www.sgmlopen.org.
Marcoux, Y. and Sévigny, M. (1997). Why SGML? Why now? Journal of the American Society for Information Science. 48(7): 584-592.
Arbortext. (1997a.) Getting started with SGML. White paper available on the SGML Open Web site at URL: commerce4.best.com/~sgml/getstart.htm.
Arbortext. (1997b.) Who needs SGML/XML?. White paper available on the Arbortext Web site at URL: www.arbortext.com/whoneeds.html.
Online SGML Resources
SGML, HTML, and XML Resources. www.impressions.com/resources_pgs/SGML_pgs/ SGML_HTML_XML.HTML
The SGML/XML Web Page. http://www.sil.org/sgml/sgml.HTML
OASIS - Organization for the Advancement of Structured Information Standards. (Formerly SGML Open) http://www.sgmlopen.org/
Notes
- See www.uic.edu:80/orgs/tei/ and
www.loc.gov/rr/ead/eadhome.html for more information.
proprietary format to proprietary format is eliminated.
- While DSSL (Document Style, Semantic, and Specification
Language) is the international standard for specifying the layout of
SGML documents, typically solutions remain proprietary (e.g.,
OmniMark®).
- Thanks to Terry Kuny of XIST, Inc., for these observations.
|