This is a short paper written for the seminar "XML, a New Start for the Web", held spring 2000 at Hypermedia laboratory, Tampere University of Technology, Finland.

XML and SGML

Marko Leppänen (homepage currently down) (marko.leppanen@tut.fi)

Introduction

For many people all that they know about XML is that it is a simplified subset of SGML. That's very true, but how and why XML differs from SGML is hardly known but to XML-enthusiastis. And what is their relation to their well-known cousin HTML? This short paper tries to answer to these questions in a compact form.

The long journey from SGML to XML

SGML was implemented to answer needs arising from paper-to-digital-media transition that was happening in 1980s. Need for standardization of mark-up language to describe the structure of a document was clear and after getting experience from vendor-dependant languages like GML ISO declared the ISO:8879 standard which defined SGML. SGML uses Document Type Definitions (DTDs) to describe the logical structure of document so it is easy to find diffrent kinds of elements that form the text. But it was soon relized that SGML was too heavy and complex to use in everyday life applications. It was difficult to learn and many exceptions made it cumbersome to write parsers that implemented the full richness of SGML. Next step was that when Internet was expanding rapidly, HTML was devloped to show text and graphics in an architehture-independent way. In addtion, easy-to-use GUIs were made to bring mundane people to the Web. HTML was however too weak to respond to the explosion-like growth of the Internet. HTML gave no standard way to renew the syntax and the absence of structural mark-up made quick searching difficult, only limited meta-data could be embedded in the document. After few years of patching HTML in various ways (often vendor-dependant) W3C decided that something was to be done. After two years of hard working they brought forth XML 1.0, that was their main weapon to fight the defects of HTML. It supports genuinely Unicode and allows use of DTDs in formatting document text and structural data can be written among the formatting mark-up.

XML compared to SGML

XML is designed for introduce an easy-to-learn way to use SGMLs structure-defining power and to combine it with HTMLs popular features to describe easily text and graphics in the Internet. Easiest way to implement this was to restrict some of SGMLs non-trivial features as HTML is only a (broken) SGML DTD[HTML] Some incontencies were made more strict, like in XML a end-delimiter is always required. A good review of things that are diffrent in XML compared to SGML can be found in [CompSGML]

As we see from the following figures, structures of XML and SGML do not differ much. This is due the fact that XML is a real subset of SGML. The most important diffrence is that output spesification is not defined by SGML, but it is fixed in XML. (Sorry for the gigantic pictures, I know what it is like to own a modem.)

The Structure of XML

Figure 1. The structure of XML document (image from COMPS).

The Structure of
SGML

Figure 2. The structure of SGML document (image from COMPS).

When comparing these two mark-up languages to HTML, we can see that HTML is fixed in the document type definition. HTML has been published as XML DTD and this reformulation is called XHTML 1.0 Another advantance of XML compared to HTML is that now documents can be easily modularized and old ready document bases can be reused. It is meant that DTDs need not to be rewritten every time when making a new document, but ready DTDs has been made to ease document writing. XML-based languages, such as X3D and WAP are a good example how XML can be extended by using DTDs. WAP is also a excellent example of how XML-based language is well-suited to purposes where HTML would be out of scope for hardware. Other diffrences that a XML-newbie must cope when switching from HTML to XML are that: attributes must always be quoted (this created ambiguities between SGML and HTML[HTML]), tags are case-sensitive, empty tags must contain a slash-character and end tags must be used.

The Structure of
HTML

Figure 3. The structure of HTML document (image from COMPS).

These diffrences mean that the main relation between this family of mark-up languges is that HTML is a DTD for SGML and XML is a subset of SGML. Transformation to SGML from XML/HTML is fairly easily done with a proper DTD, but reverse operation can cause data loss, especially when HTMLizing SGML. Transformation from SGML to XML can be done with a minimal data loss. In [CompSGML] are good advices when doing the translation process. Tools for automatic conversion do exist and can be found in the Internet.

Material on the Web

The following sources were used in the making of this document.

Comparisions
SGML, XML, and HTML Document Components Compared, Dennis J. O'Connor, Consultant, Mulberry Technologies, Inc.. Available at http://www.mulberrytech.com/papers/components.html.
Comparing SGML
Comparison of SGML and XML World Wide Web Consortium Note 15-December-1997. Available at http://www.w3.org/TR/NOTE-sgml-xml.html
HTML related
Steve DeRose on HTML and/versus SGML , Steve DeRose, 13 Jan 1994 . Available at http://www.oasis-open.org/cover/html-not.html.