Applying the TEI: Problems in the classification of
proper nounsJuliaFlandersBrown UniversityJulia_Flanders@brown.eduSydneyBaumanBrown UniversitySydney_Bauman@brown.eduPaulCatonBrown UniversityPaul_Caton@brown.eduMavisCournaneComputer Center, University College
Corkmavis@eolas.ucc.ieWillardMcCartyKings College LondonWillard.McCarty@kcl.ac.ukJohnBradleyKings College Londonjohn.bradley@kcl.ac.uk1997ACH/ALLC 1997editorthe secretarial staff in the Department of French Studies at
Queen's UniversityGregLessardencoderSaraA.SchmidtnameTLHWWPTEIAbstractThe testing of the TEI Guidelines since their release has thus far taken a
somewhat private form. Scholarly text encoding projects have availed
themselves of the Guidelines' exceptional richness and nuance, but with the
aim of doing the greatest possible justice to the complexity of their own
data, or the particular needs of their own users, rather than with any
concern for developing consistency between projects. To a certain extent
this is justifiable; the point of a flexible standard is precisely that it
can accommodate the multiple needs of its various users. However, where
divergence is only the result of random choice among equivalent options,
rather than being motivated by real constraints, it serves no purpose and
only impedes the exchange and use of data. Now that the Guidelines have been
in use long enough to create a substantial base of encoded data, projects
whose source material and encoding strategies are similar can benefit from
comparing approaches to common problems, and assessing whether their
divergences are justified by differences in data or philosophy, or merely
represent unnecessary variation in the application of the TEI.One area of primary source transcription which deserves examination along
these lines is the classification of proper nouns and similar words and
phrases, using the elements described in Chapter 20 of the TEI Guidelines: <name>, <rs>, and
the suite of more specific elements such as <placeName>,
<orgName>, <foreName>, <surName>,
<roleName>, etc. These elements describe a set of phenomena
whose retrieval and processing are important to the scholarly user of the
encoded text, but whose boundaries are quite fluid and often involve the
application of theoretical considerations quite unrelated to text encoding.
(For example, is "God" a personal name?)The proposed session will present several perspectives on this problem, with
several aims: first, of allowing the participating projects (and those
represented in the audience) to compare practices and discuss the status of
their variation; second, of situating the specific problem of encoding
proper nouns within the context of scholarly analysis, so as to create a
more precise sense of the needs which the encoding is intended to address;
and third, to think more broadly about the pressures and constraints on
classification systems in text encoding.Two of the three papers in this session come from encoding projects which use
the TEI, and which have used Chapter 20 in particularly detailed and
carefully considered ways. The first of these is the Brown University Women
Writers Project, in a paper co-authored by Julia Flanders, Paul Caton, and
Sydney Bauman, which will address the WWP's approach to the use of Chapter
20, and its attempt to balance scholarly needs and cost-effectiveness. The
second is the Thesaurus Linguarum Hiberniae (TLH) project, discussed by
Mavis Cournane, who will examine TLH's use of TEI to classify and specify
different kinds of proper nouns within TLH's corpus of writing in Ireland.
The last paper, by Willard McCarty and John Bradley, will consider these
issues from the perspective of a non-TEI project dealing intensively with
names and their classification, An Analytical Onomasticon
to the Metamorphoses of Ovid. This
paper will discuss the encoding of names in relation to the complex issues
of literary criticism and analysis, with an in-depth exploration of examples
from the Metamorphoses.Nouns Proper and Improper: Using the TEI for primary sourcesJulia FlandersSydney D. BaumanPaul CatonIntroductionThe TEI approaches the encoding of names as a problem having
largely to do with the need to give labels to existing
phenomena: Chapter 20, "Names and Dates", begins by saying that
the elements provided therein offer the encoder "a detailed
substructure" and the ability "to distinguish explicitly between
names of persons, places or organizations" [P3, p. 583]. The
elements offered in this section and elsewhere in the TEI are
indeed sufficient to encode most if not all of the name-related
phenomena found in the texts with which the Women Writers
Project is concerned. However, this sufficiency on the SGML side
of the equation does not assist with the other side: the fact
that the encoder must in fact "distinguish explicitly" between
the names of persons, places, organizations, mythical creatures,
objects, and the like. That is, we must decide what the thing is
before we can encode it, and this is not always easy.The Women Writers ProjectThe WWP has an additional challenge, however, which is that in
working with older texts we are confronted with a set of
phenomena which the text itself identifies--by typographical
emphasis of some sort--as being of linguistic or rhetorical
importance. In texts printed in the 17th and 18th centuries,
this set includes the elements discussed in Chapter 20, but also
some related textual features such as abstract nouns and
adjectives derived from proper nouns. Thus texts from this
period themselves identify a set of features in which a scholar
might well be interested, but which shade into one another and
may be difficult to identify and classify with any certainty.
For example, if one wants to distinguish names of persons from
abstract nouns, one runs into challenges in the case of allegory
or moral poetry, where virtues may be apostrophized as if they
were human, or may be identified with a human agent, or may be
in fact the name of that agent. Similarly if one wishes to
distinguish names of persons from the names of other kinds of
things (such as non-human creatures or objects) one needs not
only the encoding equipment to label these but also a clear
definition of what it means to be human. Test cases here might
be the Medusa, the Minotaur, mermaids, centaurs, Niobe after her
transformation into a stone, and the vexed question of the human
status of various deities. An additional challenge arises in the
case of adjectives which are derived from proper names, like
Caesarian or Plutonic; for these there is not even a clear TEI
element for the purpose, since <rs> is technically
reserved for nouns.Problems of ClassificationDiscussion of these issues often verges on the whimsical, for
instance when one is forced to articulate what a "person" is
(human from the neck up? able to speak or write? able to mate
with humans?), but also engages with more serious issues
concerning the nature of naming. If one does not wish to make
naming and reference the centerpiece of one's encoding system
(as would be appropriate for a text like the Metamorphoses, but not for an eclectic collection
like the WWP's), one needs to draw a line between the category
of names and the other things which shade into them: for
instance epithets, vocatives like "Milady", or terms like "the
Cockatrice" whose unique reference is vitiated by the presence
of an article, and the whole range of apostrophes to abstract
qualities like "Fair Virtue". Without such a line, it is hard to
know where to stop, and the result is a huge set of features
from which it is impossible to retrieve the information one
wants. The natural response to this problem is to attempt to
classify these in turn, for instance with type attributes, an
approach which other projects (CURIA for instance) have taken
with success.The WWP has found, however, that for our texts it is very
difficult to create a sufficiently comprehensive and unambiguous
set of values to categorize these features in a way that would
allow researchers actually to do systematic work on them. The
WWP's path to this conclusion involved several attempts to
create a system which could do
justice to this complexity. We tried dividing our field of
features into names of persons (using <persName>
and its various components), names of non-persons (using
<name>), and non-name references to both persons
and non-persons (using <rs>). This last category
was especially baroque, since it included the most heterogeneous
group (abstractions, epithets, personifications of inanimate
things, symbols, apostrophes, and references to mythical or
imaginary creatures), and in fact each iteration of the
classification process proved again that a substantial challenge
lay in what to do with the residuum, the things which are only
alike in being unlike some other, more clearly delimited
category. The conclusion we found ourselves drawing was that
although the concepts we were dealing with were fairly distinct,
their application to specific textual phenomena was not by any
means straightforward. Furthermore, although the components of
the "residuum" were easily identifiable as categories which did
not fit into the other two (personal and non-personal names), we
were not confident that they represented categories which would
be useful for scholarly study, though we were confident that
trying to use them would prove to be extremely time-consuming
and hence expensive. As a result, we eventually decided to use a
simplified system which made no attempt to classify things
beyond the element level; we now distinguish between personal
names and the names of non-persons, and any other kind of
reference which the text identifies as a proper noun is encoded
using <rs> without a type attribute.ConclusionThe conclusion which emerges from this attempt seems to be that
despite the various provisions of the TEI for encoding these
complex textual phenomena, the limiting factor really is human
use and the ability to define and enforce categorization. The
question which leads from this is one of how to regard and apply
the TEI: if it is imagined as a system for accounting to one's
own satisfaction for what one
finds in the text, then the complexity available is essential.
However, if the TEI is regarded as a method of communicating
textual information to others, as long as the text itself is
allowed to determine the encoding solution we will find this
communication extremely difficult. Put another way, if an
encoding project develops a TEI-based encoding system based on
the assumption that its own data has unique requirements, that
very assumption limits drastically the possibility of
integrating that data with that of other projects to build
larger resources, or the possibility of users being able to make
common assumptions about how data will be treated. As a
strategic matter, these possibilities are best kept open by the
counterassumption that data can be
treated similarly (even if that counterassumption is to some
degree false). At this stage in the TEI's development, projects
working on similar undertakings (similar materials, similar
methodologies) have had the opportunity to discover the
uniqueness of their own data and to revel in it, and they need
to turn their attention to finding ways to share it. The
ultimate goal of this session is therefore to discuss the degree
to which this is possible, and the costs of doing so. The application of SGML/TEI to the reality of Irish TextsMavis CournaneAbstractThis paper will look at how the TEI DTD is used to encode
names in Irish texts. Most of the questions and difficulties
encountered by the TLH encoder of names will have presented
themselves to others. Some of the problems of encoding are
generated by the TEI DTD itself, while others are due to the
inherent complexity of the texts themselves. In addressing
the problems, I hope to take encoding decisions out of the
realm of the arbitrary. The proposed solutions will
demonstrate how the TEI DTD can be manipulated and the need
for consistency in encoding.IntroductionThe TEI DTD is a descriptive DTD rather than a prescriptive
one. The descriptive nature of the TEI DTD is problematic
for the encoder because very little in TEI is mandatory, and
the encoder is given several choices of how to encode
various textual features. From the outset the would-be
encoder is faced with problem of what to encode, how to
encode, and who to encode it for.The target audience determines the degree of markup necessary
in a text and also strongly influences the way in which you
mark something up. Irish texts by nature are rich in prose,
poetry, chivalry, hagiography, linguistic, genealogical, and
historical data. They appear in five languages, Old Irish,
Old Norse, Norman French, Latin, and Hiberno-English and
contain some transliterated Hebrew and Greek. Consequently,
their encoding requires a great depth and variety of markup.
The application of the theoretical world of TEI to the real
world of text presents challenges and problems. These are
particularly acute in the encoding of names. TEI puts at
your disposal several elements for naming people and things,
all of which would work, or appear reasonable. You are
provided with tags such as <name>,
<persname>,<surname>,
<forename>, <rolename>,
<addname>, <genname>,
<namelink>, and <placename>. The
encoder then has to decide which of these best suit her
needs. Close attention also has to be paid to the nesting
requirements within TEI. For example one cannot simply
decide to encode a name as a <forename>
without first nesting it within a <persname>.
The same holds true for <surname>,
<rolename>, <addname>,
<genName> and <namelink>.The problems1. What needs to be encoded?2. How much needs to be encoded?3. Should there be a difference between encoding the names of
sacred personages, people and objects?4. In historical texts which span centuries, personal names
are dynamic. What denoted a forename in the 11th century
indicated a surname by the 13th century. How can markup be
consistent yet reflect the semantic accuracy of the
text?5. Placenames present the same difficulty. Their function can
change over time. For example the place Armagh functioned intially as a church, then it
graduated to a monastery, a monastic town, then a church and
in a 20th century context it denotes a town and is also the
seat of the archbishops. For example:
<pn type="church">Ard Macha</pn> d' fothuccadh
<pn type="monastery">Ard Macha</pn> do losccadh
<pn type="town">Ard Macha</pn>Further inconsistency occurs for search purposes when there
is variation in the spelling of placenames. If a place can
be spelt eight different ways and have had a dynamic
function over eight centuries, how is the untrained person
to conduct meaningful searches?6. In non-English historical texts it can be problematic to
distinguish between <placename> and
<orgName>. Many elements have been abbreviated
in TLH for ease of reading when markup is on display. For
example, <orgName> is abbreviated to
<on>. It is used to markup organizations in
the widest possible sense, be they historical groups,
parties, lineages etc. In many instances it is not always
clear if one should encode a name as a place or
organization. Many placenames take their name from dynastic
names. This poses problems not only for the encoder, but
also for the user. As much encoding in this instance is
subjective, the user, seeking to search the database, will
have her own prejudices and expectations. For example:
(1)<on type="people/dynasty" >Connacht</on>(2)<pn type="kingdom">Connacht</pn>(3)<pn type="province">Connacht</pn>In example (1) it is the dynasty Connacht which is being
referred to and in (2) and (3) it is the place, the kingdom
or province Connacht which is in question. These ambiguities
meant that close attention has to be given to the semantics
of the text to discern how to encode.Solutions(1) the development of an encoding scheme based on a
scholarly rationale rather than one legitimized by Lust und Laune.(2) greater use of the attributes provided by TEI to further
categorise, regularise and normalise encoded text.These solutions need not be mutually exclusive.ConclusionNames in Irish texts are dynamic, and ambiguous. Their
encoding will only serve a meaningful purpose for the end
user, if regularisation and consistency guide the marking up
process. Without due consideration of these factors search
and retrieval would be a complicated, unsatisfactory
exercise.Theft of fire: meaning in the markup of namesWillard McCartyJohn BradleyOur subject is what happens when a computational metalinguistic
tagging scheme is imposed on a poetic text in order to make a
subset of the data accessible to automatic processing. In our
case, tagging is employed to mark 'names' in the broadest sense,
i.e. all devices of language by which persons are identified.
Our text is the Metamorphoses, a highly
complex mythological compendium written by the Roman poet Ovid
during the reign of Augustus. Its 12,000 lines of Latin
hexameter contain approximately 50,000 such devices, a
systematic accounting of which is the aim of the project. The
result, generated automatically by software from the tagged
text, is An Analytical Onomasticon to the
Metamorphoses of Ovid.
This is a new kind of reference work designed to help Ovidian
specialists figure out how the poem might cohere. For humanities
computing, however, the primary interest lies in the radical
'loss in translation' when ambiguous poetic phenomena are
rendered as meta-linguistic tags, and in how this loss is turned
to advantage.[1] An early stage of the research is
described in McCarty 1996 (paper submitted 1992); for more
recent reports, see McCarty 1993 and 1994. An introduction
with illustrations of the output is available on the Web,
(Europe) or (N. America). For a broader view of tagging and text, see
Sperberg-McQueen 1991; Renear 1992; Renear, Durand, and
Mylonas 1996. As one of us has argued elsewhere, the translation model is quite
useful in thinking about the literary and linguistic
consequences of tagging a poetic text (McCarty 1994). The
radical poverty of tagging 'languages' makes the discussion
about loss-in-translation, pervasive in the literature,
especially valuable (Barnstone 1993). The question of this loss
devolves into the more fundamental one of expression itself,
where it takes the useful form of meditations on that which
seems curiously to be in but not of language - George Steiner's
"flame of the spirit in the momentary fixity of the
letter".[2] See Steiner 1975. For the ineffable
in language see also Liu 1988, Burdick and Iser 1989.
What happens to this flame in the act of translation? What
happens to it when the target is a computational
pseudo-language, therefore radically deficient for representing
the rich ambiguities of poetry?A computing project that attempts to render what we loosely call
'meaning' into processible form faces such questions at every
working moment. The struggle is, of course, central to computing
as a whole insofar as it models complex realities or perceptions
of them with crude mechanical constructs (McCarty 1994: 278-81).
For humanities computing as such, the crudity of our methods is
a central issue; it is only avoided, not answered, by construing
the computer as a 'mere tool' or by taking refuge in progress.
Arguably we step over the most significant threshold in (or
rather into) the field when we realise that the inevitable
failure of all such modeling is anything but a death-sentence to
our common project, but rather the source of its integrity and
power. What we lose in tagging is in a sense what we gain.In this paper we will focus closely on how the construction of a
taxonomy for naming simultaneously falsifies and illuminates the
text. Using quite specific examples from the Metamorphoses we will demonstrate in detail how the
application of such a taxonomy to a text, in the process of
encoding it, itself constitutes a kind of literary
criticism.The central problem that the Onomasticon is intended to address,
the coherence of the Metamorphoses as a
work of literature, will require brief explication in the paper
to demonstrate how well the poem serves the interests of
humanities computing. Briefly, we will argue that because Ovid's
poem reflects the central problem of tagging literary text -
seeking "the flame of the spirit" as it transmigrates through
"the momentary fixity" of bodies - the Metamorphoses presents in a radically stubborn form
just the kind of challenge our common project needs to advance
intellectually.For the Metamorphoses, the spoor of this
transmigrating spirit is supplied by numerous kinds of language
data through which apparently disparate narrative elements are
associated. For a manageable subset, we chose all references to
persons, i.e. names, because persons are to be found in every
story of the poem and each reference to a person can be exactly
identified with particular elements of the the textual data.
(The same could not consistently be said, for example, of
metaphors, allusions, or themes.) No particular theory on how
the Metamorphoses is or could be
constructed was assumed, although a theory arguing for multiple,
simultaneous constructions is offered in the book.Details of the mark-up scheme and of how tags are processed are
available but mostly irrelevant for present purposes. Our focus
is rather on the taxonomy and how well it works, or rather how
well it fails and in failing serves a scholarly purpose.The basic taxonomy is simple. Onomastic devices are classified
under the headings of proper names (including patronymics,
matronymics, toponyms, et sim.), nominals (nouns and adjectives,
including phrases), pronouns, verbs, and personal attributes,
i.e. nominals referring to anything that in context is closely
enough associated to evoke the person. Text quoted in the tag is
lemmatised insofar as the syntax of the quoted segment will
permit, given a taxonomic type, and assigned to the person.
(More precisely, the onomastic device is given a 'standard
name', or editorial lemma that in most cases is simply the name
of the character but is used to identify synonymous references,
to declare disparate references as such, and otherwise to assign
an identity to something, such as a metamorohosed being.)
Although interpretative problems occur at every step, the most
interesting ones are in assigning names to persons. This should
not be surprising for a poem in which people become things and,
less often, things people, but the particular demands of markup
turn the expected into a surprisingly complex and revealing
operation.At root is the question of what constitutes a person. For a
majority in the Metamorphoses the
answer is as obvious as in daily life, but for a large number of
doubtful instances, we found that the most useful way to
approach the question was to look for ontological shift. Hence,
any sub-human entity (animal, vegetable, mineral) is regarded as
a person, and so tagged, if it undergoes metamorphosis up the
'chain of being' or is otherwise personified; if it reverts to
sub-human state, it ceases to be a person and so is not tagged.
Any anthropomorph that has undergone downward metamorphosis
(including a god that has temporarily changed shape) is said to
be a different through closely related person,
'X-in-the-form-of-Y'. Certain figures of speech that suggest but
do not manifest ontological shift are similarly marked by
assigning a distinct but closely related standard name: similes,
'X-compared-to-Y', and multiple persons having a corporate
identity by acting as one, 'X-and-Y-and-Z'. Dis-personification
of entities formerly persons (e.g. venus for sexual passion, bacchus for wine) is treated by dropping the entity
as a person but including the reference as his or her
attribute.Persons, then, are treated in essence as momentary constructions,
apt to appear as such under certain conditions of language, or
to change identity or vanish altogether when those conditions
change. Thus much is obvious given the text under consideration.
For the Analytical Onomasticon to be
useful, however, the tagging must be governed rigorously by
consistent editorial policies that specify these conditions of
language. The first of these specifies a fundamental binary
state: a textual phenomenon is either identified with a person
or not; no 'weights' or degrees of identity are allowed. Beyond
that, our policies have evolved inductively, through a long
series of heuristic intermediaries, almost continuously revised
during the work. Although they are to be reviewed in the final
stage of the project, they are mostly stable now and will be
described briefly in the paper. The Analytical
Onomasticon contains a long theoretical introduction
in which they are discussed in great detail.Apart from the fact that these policies are fundamental to the
usefulness of the Analytical
Onomasticon, they enable the user intelligently to
criticise the work, and since plans are to publish all the
component materials in electronic form, to modify the tagged
text effectively and so to regenerate the book along different
lines. (The scope of our paper unfortunately does not allow
further discussion of the publishing aspects.) More importantly
for the subject of this paper, these policies by nature aim
explicitly to specify how certain well-known but imperfectly
understood literary phenomena actually happen. Personification
has, for example, been well studied, both within and well beyond
classical studies, but nowhere does one find an attempt to spell
out its linguistic conditions. Similarly, metamorphosis is an
obvious and well-studied topic, yet one does not find an
empirical guide to its boundaries, when it can be said to occur,
when not, and while it is happening, of what the process exactly
consists. Tagging forces one to say, or more precisely, to make
the attempt.In the paper, explication of particular examples will demonstrate
such attempts and in particular focus on the literary-critical
value of their failure: (1) the personification of Sol in the
story of Phaethon, Met. 1.751ff, and
the dis-personification of Bacchus and Venus at various points
throughout; (2) Apollo's pursuit and metamorphosis of Daphne,
1.525-567, and her ambiguous status as the laurel tree later in
the poem; (3) rhetorical strategies in the contest of arms
between Ajax and Ulysses (13.1-398). Works citedWillisBarnstoneThe Poetics of Translation: History,
Theory, PracticeNew HavenYale University Press1993SanfordBurdickWolfgangIserLanguages of the Unsayable: The Play of
Negativity in Literature and Literary TheoryNew YorkColumbia University Press1989JamesJ.Y.LiuLanguage, Paradox, Poetics. A Chinese
PerspectiveRichardJohnLynnPrincetonPrinceton University Press1988WillardL.McCartyEncoding Persons and Places in the Metamorphoses of Ovid. Part 1:
Engineering the TextTexte13/14121-721993[published 1994]WillardL.McCartyEncoding Persons and Places in the Metamorphoses of Ovid. Part 2: the
Metatextual TranslationTexte15/16261-3051994[published 1995]WillardL.McCartyPeering Through the Skylight: Towards an
Electronic Edition of Ovid's MetamorphosesSusanHockeyNancyIdeResearch in Humanities Computing4OxfordClarendon Press1996240-262AllenRenearRepresenting Text on the Computer: Lessons
for and from PhilosophyBulletin of the John Rylands University
Library of Manchester74221-481992AllenRenearD.G.DurandE.MylonasOverlapping Hierarchies of Text Objects:
Refining our Notion of What Text Really IsSusanHockeyNancyIdeResearch in Humanities Computing4OxfordClarendon Press1996263-80C.M.Sperberg-McQueenText in the Electronic Age: Textual Study
and Text Encoding with Examples from Medieval
TextsLiterary and Linguistic Computing634-461991GeorgeSteinerAfter Babel: Aspects of Language and
TranslationLondonOxford University Press1975