Meaning and Interpretation of MarkupC.M.Sperberg-McQueenWorld Wide Web Consortium, USA ClausHuitfeldtUniversity of Bergen, Norway AllenRenearBrown University, USA 2000University of GlasgowGlasgowALLC/ACH 2000editorJeanAndersonAmalChatterjeeChristianJ.KayMargaretScottencoderSaraA.SchmidtText EncodingMarkup is inserted into textual material not at random, but to convey some
meaning. An author may supply markup as part of the act of composing a text; in
this case the markup expresses the author's intentions. The author creates
certain textual structures simply by tagging them; the markup has performative
significance. In other cases, markup is supplied as part of the transcription in
electronic form of pre-existing material. In such cases, markup reflects the
understanding of the text held by the transcriber; we say that the markup
expresses a claim about the text.In the one case, markup is constitutive of the meaning; in the other, it is
interpretive. In each case, the reader (for all practical purposes, readers
include software which processes marked up documents) may legitimately use the
markup to make inferences about the structure and properties of the text. For
this reason, we say that markup licenses certain inferences about the text.If markup has meaning, it seems fair to ask how to identify the meaning of the
markup used in a document, and how to document the meaning assigned to
particular markup constructs by specifications of markup languages (e.g. by DTDs
and their documentation).In this paper, we propose an account of how markup licenses inferences, and how
to tell, for a given marked up text, what inferences are actually licensed by
its markup. As a side effect, we will also provide an account of what is needed
in a specification of the meaning of a markup language. We begin by proposing a
simple method of expressing the meaning of SGML or XML element types and
attributes; we then identify a fundamental distinction between distributive and
sortal features of texts, which affects the interpretation of markup. We
describe a simple model of interpretation for markup, and note various ways in
which it must be refined in order to handle standard patterns of usage in
existing markup schemes; this allows us to define a simple measure of
complexity, which allows direct comparison of the complexity of different ways
of expressing the same information (i.e. licensing the same inferences) about a
given text, using markup.For simplicity, we formulate our discussion in terms of SGML or XML markup,
applied to documents or texts. Similar arguments can be made for other uses of
SGML and XML, and may be possible for some other families of markup
language.Related work has been done by Simons (in the context of translating between
marked up texts and database systems), Sperberg-McQueen and Burnard (in an
informal introduction to the TEI), Langendoen and Simons (also with respect to
the TEI), Huitfeldt and others in Bergen (in discussions of the Wittgenstein
Archive at the University of Bergen, and in critiques of SGML), Renear and
others at Brown University, and Welty and Ide (in a description of systems which
draw inferences from markup). Much of this earlier work, however, has focused on
questions of subjectivity and objectivity in text markup, or on the nature of
text, and the like. The approach taken in this paper is somewhat more formal,
while still much less formal and rigorous than that taken by Wadler in his
recent work on XSLT.Let us begin with a concrete example. Among the papers of the American historical
figure Henry Laurens is a draft Laurens prepared of a letter to be sent from the
Commons House of Assembly of South Carolina to the royal governor, Lord William
Campbell, in 1775. Some words have lines through them, and others written above
the line. The editors of Laurens's papers interpret the lines through words as
cancellations, and the words above the lines as insertions; an electronic
version of the document using TEI markup and reflecting these interpretations,
might read thus:<P><DEL>It was be</DEL> <DEL>For</DEL> When we
applied to Your Excellency for leave to adjourn it was because we foresaw
that we <DEL>were</DEL> <ADD>should continue</ADD>
wasting our own time ... </P>From the DEL elements, the reader of the document is licensed to infer that the
letters "It was be", "For", and "were" are marked as deleted; from the ADD
element, the reader may infer that the words "should continue" have been added.
Software might rely on these inferences in the course of making a concordance or
displaying a clear text; human readers will rely on them in interpreting the
historical document. Note that the markup here stops short of licensing the
inference that "should continue" was substituted for "were". The editors could
license that inference as well by appropriate markup, if they wished. Human
readers may make the inference on their own, given the linguistic context;
software cannot safely infer a substitution every time an addition is adjacent
to a deletion.A simple way to capture the meaning of markup is to define, for each markup
construct, a set of open sentences - sentences with unbound variables - which
express the inferences licensed by the use of that construct. In formal
reasoning, such open sentences may be transformed into logical predicates in the
usual way.For example, the TEI element type DEL is said by the documentation to mark "a
letter, word or passage deleted, marked as deleted, or otherwise indicated as
superfluous or spurious in the copy text by an author, scribe, annotator or
corrector" (TEI P3, p. 922). We take this to mean that when a DEL element is
encountered in a document, the reader is licensed to infer that the material so
marked has been deleted. In formal contexts, we may write "deleted(X)"; we can
specify the meaning of the DEL element and of the logical predicate "deleted(X)"
by means of an open sentence: "X has been deleted, or marked as deleted, or ..."
etc. The variable X is to be bound, in practice, to the contents of the DEL
element. If we imagine a variable named 'this', instantiated to each element of
a document in turn, and a function 'contents' which returns the contents of its
argument, then the meaning of the DEL element becomes
"deleted(contents(this)))", or equivalently "contents(this) has been deleted
..." etc.The TEI element type HI, similarly, "marks [its contents] as graphically distinct
from the surrounding text" (TEI P3, p. 1013). We can capture the meaning of HI
by the open sentence "X is graphically distinct from the surrounding text", or
"highlighted(X)", where X is, as before, to be replaced by "contents(this)".Attributes may be treated similarly. The 'rend' attribute on the <hi>
element "describes the rendition or presentation of the word or phrase
highlighted". In the example<P><HI REND="gothic">And this
Indenture further witnesseth</HI> that the said <HI
REND="italic">Walter Shandy</HI>, merchant, in consideration of the
said intended marriage ... </P> the HI elements convey the
information that the contents of those elements are distinct from their
surroundings, while the 'rend' attributes on the HI elements specify how. The
meaning of the 'rend' attribute is expressed by the open sentence "X is rendered
in style Y." An HI element with a 'rend' attribute thus means "X is graphically
distinct from its surroundings, and X is rendered in style Y".Perhaps the simplest method of interpreting markup is to assume that1. The meaning of every element type is expressed by an open sentence
whose single unbound variable is to be bound to 'contents(this)'.2. The meaning of every attribute is expressed by an open sentence
with two unbound variables, one of which is to be bound to
'contents(this)' and the other to 'value(this,attribute-name)' (i.e. to
the value of the attribute in question). In other words, each attribute
defines some relation R which holds between the contents of the element
and the value of the attribute.3. All inferences licensed by any two elements are compatible.The set of inferences applicable to any given location L is then the union of the
inferences licensed by all the elements within which L is contained. Let us call
this the 'union model' of interpretation.The union model is simple, and provides a good first approximation of the rules
of inference for marked up text. But it is not wholly adequate.First, it fails to distinguish distributed properties (such as 'italic' or
'highlighted') from sortal properties (such as paragraphs, sections, or - as
illustrated above - deletion). It is as true to say "The word 'And' is in
black-letter" as to say it of the entire phrase, and the meaning of the example
given above would not change if the HI elements were split into two or more
adjacent pieces each with the same 'rend' value. Conversely, two HI elements
with the same attribute values can be merged without changing the meaning of the
markup. Other elements mark properties which are NOT distributed equally among
the contents, and cannot be split or joined without changing the meaning of the
markup. From the markup<P>Reader, I married
him.</P>we can infer the existence of one paragraph, but we
cannot infer that "Reader" is itself a paragraph. Such properties we call
'sortal' properties, borrowing a term of art from linguistics. Elements marking
sortals are usefully countable; those marking distributed properties are
not.Second, the union model fails to allow a correct interpretation of inherited
values and overrides, as illustrated by the TEI 'lang' attribute or the xml:lang
attribute of XML. In fact, some inferences do contradict each other, and
specifications of the meaning of markup need to say which inferences are
compatible, and which are in conflict, and how to adjudicate conflicts.Third, the union model allows inferences about a location L only on the basis of
markup on open elements (those which contain L); in order to handle common
idioms of SGML and XML, a model of interpretation must handleupward propagation: the meaning of an element may depend in part on
its contents; this is unusual in colloquial SGML/XML systems, but is a
regular feature of proposals to eliminate attributes from markup
languages.context dependency: the meaning of an element may depend on its
context; trivial examples include TEI's HI and FOREIGN, which can mean
'not-Roman' and 'not-English' in one context, and 'not-italic' and
'not-German' in others.ordinal position, relative or absolute; dependence of meaning upon
ordinal position is seldom an explicit feature of markup languages, but
dependence of processing based on position is a standard feature of
style-sheet languages.milestone elements; these convey information by position in the
beginning-to-end scan of the linear form of the document, rather than by
position in the tree.linking: out-of-line or 'standoff' markup conveys information about
location L based not only on open elements, but on elements which point
at L or some ancestor of L.Other methods of associating markup with meaning are imaginable, but we believe a
survey of existing DTDs will show that all or virtually all current practice is
covered by any model of interpretation which encompasses the complications just
outlined.Essentially, these can be handled by extending the rules for binding variables in
the open sentences which specify the meaning of a given markup construct. The
simple union model allows only 'contents(this)' and
'value(this,attribute-name)'; the constructs listed above require more complex
expressions, roughly equivalent in expressiveness to the TEI extended-pointer
notation or to the patterns of the XPath language defined by W3C.Complexity of the semantics associated with an element type or attribute may be
measured by the number of unbound variables in the open slots, by the complexity
of the expressions which are to fill them, and by the amount or kind of memory
required to allow full generation of the inferences licensed by markup in a
particular text.ReferencesSteveDeRose et alWhat is Text, Really?Journal of Computing in Higher Education13-261990ClausHuitfeldtMulti-Dimensional Texts in a One-Dimensional
MediumComputers and the Humanities284-5235-2411995D.TerenceLangendoenGaryF.SimonsRationale for the TEI Recommendations for
Feature-Structure MarkupComputers and the Humanities293191-2091995[Laurens, Henry.]Commons House of Assembly to Lord William
CampbellDavidR.Chesnutt et al The Papers of Henry LaurensColumbia, S.C.University of South Carolina Press1985Vol. 10305-308AloisPichlerWhat is Transcription, Really?ACH/ALLC '93, Georgetown1993AllenRenearDavidG.DurandElliMylonasRefining our notion of what text really is: the problem
of overlapping hierarchiesResearch in Humanities ComputingOxfordOxford University Press1995Originally delivered at ALLC/ACH '92.GaryF.SimonsConceptual Modeling versus Visual Modeling: A
Technological Key to Building ConsensusComputers and the Humanities304303-3191997C.M.Sperberg-McQueenLouBurnardGuidelines for Electronic Text Encoding and Interchange
(TEI P3)Chicago, OxfordACH, ALLC, and ACL1994C.M.Sperberg-McQueenLouBurnardThe Design of the TEI Encoding SchemeComputers and the Humanities29117-391995PhilipWadlerA formal semantics of patterns in XSLTPaper presented at Markup Technologies '991999ChristopherWeltyNancyIdeUsing the Right Tools: Enhancing Retrieval from
Marked-up DocumentsComputers and the Humanities331-258-841999Originally delivered at TEI 10, Providence (1997).