Extensible Markup Language (XML)C.M.Sperberg-McQueenUniversity of Illinois at Chicagou35395@uicvm.uic.eduTimBrayTextualitytbray@textuality.com1997ACH/ALLC 1997editorthe secretarial staff in the Department of French Studies at
Queen's UniversityGregLessardencoderSaraA.Schmidttext encodingWWWExtensible Markup Language (XML for short) is being designed under the
auspices of the World-Wide-Web Consortium (W3C); the larger goal of this
effort is "to enable future Web user agents to receive and process generic
SGML in the way that they are now able to receive and process HTML. As in
the case of HTML, the implementation of SGML on the Web will require
attention not just to structure and content (the domain of SGML per se) but
also to link semantics and display semantics." (See for the W3C's
description of this activity.) As a subgoal, we are creating an SGML
application profile, XML, that is designed to provide many of the benefits
of SGML in a lightweight, easy-to-use, easy-to-implement dialect that omits
many of the difficult or problematic features of the full standard. This
paper is a report on the XML specification; if time allows, some information
will also be provided on the progress of the work toward a typology of links
and link behaviors. At the time this abstract is prepared, the XML
specification has been made public, but is still officially a working
draft.MotivationThe Standard Generalized Markup Language (SGML) is the most fully developed
specification of the use of descriptive markup languages for electronic
documents. The idea of descriptive markup is simple and powerful, and in
fact has proved to be a basic requirement for many advanced information
processing applications.Unfortunately, the adoption of SGML has proved surprisingly difficult,
expensive and slow, given that the underlying ideas are simple and
self-evidently good. In particular, there is very little use of SGML on the
World-Wide Web, which is the world's most popular electronic information
delivery mechanism. Some of the perceived reasons have included:The SGML standard itself is large, complex, and difficult to
understand.The standard specifies several optional and advanced markup
features, some of which are almost never implemented.Some of the features of SGML have proven counter-productive in
practical use.Practical use of SGML requires learning several other languages,
including the language used to write DTDs, various stylesheeting and
formatting languages, and the SGML/Open Entity Catalog
language.The design of SGML takes little account of the contemporary theory
of formal languages and finite automata. One practical result is
that SGML parsers are unable to make use of some advanced tools and
techniques made possible by that theory. Consequently, they are
large and complex pieces of computer software; as such they (a)
suffer from reliability problems, (b) have in practice proven
difficult to integrate into applications, and (c) change slowly in
response to advances in software and document processing
technology.Nonetheless, there remains a consensus that SGML's basic design partition
into entities, elements, and attributes is correct and useful. One result is
a common tendency, in strategic projects involving SGML, to avoid using many
advanced features and operate within the bounds of a highly restricted
subset. This approach has generally met with success. However, this
restricted subset has been re-invented by each successive group that has
attacked the problem.The SGML standard itself identifies two subsets of its features, intended to
simplify implementation: Minimal SGML (defined in ISO 8879, clause 15.1.2)
and Basic SGML (ISO 8879, 15.1.1). These have not had any practical
significance, however, both because the choice of SGML features they include
is not a happy one and because they have no free-standing definition, which
means they cannot be implemented by anyone who has not first studied and
understood the full text of ISO 8879.There has been informal discussion for years on the subject of a
further-simplified version of the standard. In recent times, there have been
a substantial number of formal proposals for such a simplification. They
include:A lexical analyzer for HTML by Dan Connolly of W3C, as presented
in "A Lexical Analyzer for HTML and Basic SGML: W3C Working Draft
15-Jun-96" (); this is a slight simplification of Basic SGML, with no entity
handling.the Minimal Generalized Markup Language (MGML) defined by Tim Bray
in "MGML -- an SGML Application for Describing Document Markup
Languages", unpublished draft paper for SGML '96 ( ). MGML is
unique among the contributions in that it proposes using instance
syntax for markup declarations.Normalised SGML (NSGML), an invention of Henry Thompson, David
McKelvie, and Steve Finch, presented in "The Normalised SGML Library
(NSL)", NSL Version 1.4.4, Documentation version Fri Aug 2 14:13:40
BST 1996 ().
NSGML includes not just a language definition but a suite of
software modules for parsing and handling documents in an efficient
pipelined fashion.Poor-Folks SGML (PSGML), defined by Michael Sperberg-McQueen in
"PSGML: Poor-Folks SGML: A Subset of SGML for Use in Distributed
Applications", Document UIC CC DB92-10, 8 October 1992 ( or ).SGML-Lite, by Bert Bos, presented in "`SGML-Lite' -- an easy to
parse subset of SGML", 4 July 1995 ( ().SGML Online, by Eliot Kimber, described in "SGML Revision,
Proposal for Minimal SGML Feature Set" (unfinished, unpublished
draft, distributed privately via email on 1996-06-03, now accessible
at ).the TEI Interchange Format defined in Association for Computers
and the Humanities (ACH), Association for Computational Linguistics
(ACL), and Association for Literary and Linguistic Computing (ALLC),
Guidelines for Electronic Text Encoding and Interchange (TEI P3),
edited by C. M. Sperberg-McQueen and Lou Burnard (Chicago, Oxford:
Text Encoding Initiative, 1994). The portions of the TEI Guidelines
relevant to the interchange format have been extracted and are
available separately at ().These simplified application profiles of SGML all take advantage of the fact
that SGML exhibits an extreme case of the `80-20 syndrome'; that is to say,
80% of the benefit is gained by applying only 20% of the machinery. The W3C
SGML Activity has formalized the definition of a useful subset in the form
of the Extensible Markup Language, or XML.Structure, Membership, and MechanismsThe current work was initiated by Jon Bosak of Sun Microsystems, who, in
co-operation with the Tim Berners-Lee and Dan Connolly of the World-Wide Web
Consortium, initiated the formation of the Consortium's SGML Editorial
Review Board and Working Group, who labor under the unwieldy acronyms W3C
SGML ERB and W3C SGML WG. The mandate for this effort may be found at ; it includes
SGML simplification and work on hyperlink semantics and display processing
(presumably via a DSSSL profile). This paper describes the SGML
simplification work.The work is co-ordinated by the Editorial Review Board. Its members are: Jon
Bosak (Sun, Chair), Tim Bray (Textuality, XML Co-Editor), James Clark,
(Independent, Technical Lead), Steve DeRose (EBT), Dave Hollander (HP),
Eliot Kimber (Passage), Tom Magliery (NCSA), Eve Maler (ArborText), Jean
Paoli (Microsoft), Peter Sharpe (SoftQuad), and Michael Sperberg-McQueen
(University of Illinois at Chicago, XML Co-Editor); Dan Connolly serves as
liaison with W3C. The main functions of the ERB are to steer the design and
discussion activities, and to resolve issues by voting. There is a
well-defined voting procedure designed to maximize the chances of reaching
consensus and to exercise majority rule rapidly when consensus is not
possible.The main work is done in the Working Group; this has over 60 members,
including those of the ERB. The Working Group provides technical input,
design proposals, and design critiques. It includes many people who have
published significant papers on SGML or played a visible role in the design,
evolution, and implementation of SGML; in particular Charles Goldfarb and
James Mason from WG8. As a result of this overlap, it is likely that XML
will avoid taking any directions fundamentally incompatible with the future
development of SGML; in fact, the debate on XML is apt to have some
influence on the next SGML revision.Design GoalsPrior to the commencement of discussion in the WG, the ERB developed a
`strawman' set of design goals to guide this discussion. While these remain
open for challenge and revision, they have been fairly stable and thus
presumably represent a reasonably large-scale consensus among those involved
in this work. The design goals are:1. XML shall be straightforwardly usable over
the Internet. This does not
mean users can feed it to, for example, the Netscape of today, but
that the design will have regard at all times to the needs of
distributed applications working on large-scale networks.2. XML shall support a wide variety of
applications. No design elements shall be adopted which
would impair the usability of XML documents in other contexts such
as print or CD-ROM, nor in applications other than network browsing,
such as validating editors, batch validators, simple filters which
understand XML document structure, normalizers, formatting engines,
translators to render XML documents into other lanuages, and
specialized browsers for specialized markup.3. XML shall be compatible with SGML.
I.e. (1) Existing SGML tools will be able to read and write XML
data. (2) XML document instances are SGML document instances as they
are, without changes. (3) For any XML document, a DTD can be
generated such that SGML will produce the same parse as would an XML
processor. (4) XML should have essentially the same expressive power
as SGML. Note: (1) and (2) describe our
goal in its ideal form. If this goal is not achievable in its
fullest form, then we may back out to a weaker form: it shall be
simple to transform XML documents into equivalent SGML documents,
and vice versa. Our intention, however, is to bite the bullet and
ensure if we can that no transformation is needed to allow SGML
tools to read and write XML document instances. (3) and (4) indicate
our intentions accurately, but it is not yet clear how best to
formalize and explain the phrase "the same parse", or the phrase
"essentially the same expressive power". These remain open
questions; see point 8 also.4. It shall be easy to write programs which
process XML documents. In particular, it shall be
straightforwardly possible to construct useful XML applications
which do not read, or need to read, the DTD of the XML document.
Note: For this purpose, easy means that the holder of a bachelor's
degree in computer science should be able to construct basic
processing (parsing, if not validating) machinery in less than a
week, and that the major difficulty in the application should be the
application-specific functions; XML should not add to the inherent
difficulties of writing such applications.5. The number of optional features in XML is to
be kept to the absolute minimum, ideally zero. As a
result of this, any XML document has a high probability of being
handled successfully by any XML processor.6. XML documents should be human-legible and
reasonably clear.7. The XML design should be prepared
quickly. A first draft of the XML design should be ready
for distribution and comment by end of 1996; a version should be
ready for production use by the end of March 1997.8. The design of XML shall be formal and
concise. XML should be simple and easy for implementors
to grasp; its reference documentation should not exceed 20 pages,
which should contain mostly formal grammar and very little normative
text, if any. Note: normative text is not
the same as descriptive or explanatory text. XML shall specify
clearly what characteristics of the input must be represented in the
parse tree of an XML document, and what characteristics need not be
captured by XML processors. This means the property sets
`significant' in an XML application will be defined both formally
and informally. Which properties are significant and which are
insignificant remains an open question.9. XML documents shall be easy to create.
It should be a straightforward task (though possibly
labor-intensive) to create valid XML documents by hand (i.e. without
a validating authoring tool). It should be a straightforward task
(though possibly labor-intensive) to create a validating XML
authoring system. 10. Terseness is of minimal importance.
Minimizing keystrokes is not deemed important in achieving any of
the above goals, but other things being equal a concise notation
should be preferred to a verbose.A Snapshot of XMLAt the time this paper is submitted, an initial public draft of the XML
specification has been distributed, but like all working drafts it is
subject to change. The broad outlines of XML, however, are clear enough to
be summarized here.XML omits a large number of SGML features often left unused in practice:
DATATAG, OMITTAG, RANK, LINK, CONCUR, SHORTREF, SUBDOC, and FORMAL are all
dropped. SHORTTAG, which defines several ways in which SGML documents may
abbreviate their tags, is entirely disallowed except that attributes need
not be specified if a default is specified for them when they are
declared.Most of these features are rarely used in any case; the most visible change
is the absolute abandonment of SGML techniques of markup minimization. In
XML documents, all tags are always present in full (except that attributes
may be omitted if they have their default values). This will make no
difference to those who use SGML or XML editors; others may choose to write
their documents using standard SGML tag omission and then run the document
through a normalizer like James Clark's spam.In order to ensure that XML processors can, under certain circumstances, skip
the document's DTD and still process the document correctly, empty elements
(like the TEI's PTR element or like the HTML BR element) are required to be
self-identifying: instead of the form <e>, they must take the
form <e/>. This simple innovation radically reduces the
complexity of parsing XML documents.Comments and processing instructions are retained; XML uses a number of
specialized processing instructions of its own as declarations. Comments are
simplified, however, to try to minimize user errors.In order to ensure the widest possible use, XML requires conforming
processors to support usage of the characters from ISO 10646 (Unicode) in
both markup and data. For the convenience of those still working without
Unicode editors (currently the majority of users), processors are encouraged
to accept other character-set encodings as well.XML also restricts, in some ways, the normal SGML syntax for declaring
elements and attributes. In particular, the AND connector is dropped,
inclusion and exclusion exceptions are dropped, and the set of data types
for attributes is simplified and rationalized (within the limits set by the
design goal of compatibility with SGML).Conditional marked sections are allowed in the DTD, but not in the document
instance. In DTDs, conditional sections allow easy customization of the DTD;
they appear unnecessary in document instances, since most practitioners
agree that variant text is better handled with specialized elements and
style-sheets. CDATA marked sections, in which markup characters need not be
escaped, are allowed only in the document itself, and only in a restricted
form.In the interests of simplicity, XML abandons SGML's notion of abstract syntax and defines only a single concrete syntax, modeled on SGML's reference concrete syntax but extended to handle
polyglot documents and large documents better. In XML, all tags will be
enclosed in < and >, all entity references between &
and &refc;, and all attribute values quoted. Unlike SGML, XML
provides no mechanism for changing the default delimiters.Future PlansAt the time this paper is submitted, it is the intention of the ERB to revise
the draft XML spec and to turn, in early 1997, to the topic of hyperlink
typology. In late 1997, the third phase of the project will see the
specification of a subset of the Document Style Semantics and Specification
Language (DSSSL) intended for use in network browsers (DSSSL-Online). The
XML specification may change as a result of work in the two later phases;
when it appears stable, steps will be taken to move it through the normal
W3C processes to make it a technical report, then a proposed recommendation,
and finally a specification of recommended practice.