Marking up in TATOE and exporting to SGML - Rule
development for identifying NITF categoriesLotharRostekGMD - Integrated Publication and Information Systems
Instituterostek@darmstadt.gmd.de1997ACH/ALLC 1997editorthe secretarial staff in the Department of French Studies at
Queen's UniversityGregLessardencoderSaraA.Schmidtsemantic mark upproper noun extractionSGMLIntroductionWe have analyzed a corpus of German news messages with the general aim of
extracting semantic information. More specifically, we have focused on the
automatic categorization of proper nouns relating to persons, organizations,
locations as well as numeral and temporal expressions. Proper noun
identification and classification has been examined for different languages,
such as English, Chinese or Japanese ((Wakao et al. 1996), (Chen and Lee
1996) Kitani and Mitamura (1994) among many others). As far as German is
concerned two features which are idiosyncratic of the language and
complicate this task are first, German makes no surface distinction in the
spelling of both proper nouns and common nouns, i.e. common nouns are also
spelled with the first letter capitalized. Second, compounds in German can
be long, without any word boundaries or hyphenation between the contained
nouns and are therefore relatively difficult to identify.This work belongs to a project which aims at a real-world application and due
to this reason the categories of the SGML-based standard News Industry Text
Format (NITF) have been applied. NITF was developed by the International
Press Telecommunication Council (IPTC) for the exchange of news messages. An
interesting feature of the NITF standard is that besides structural mark up,
it allows also semantic encoding. Our aim in this project has been twofold:
first, to develop an algorithm for the automatic identification of those
phrases in new incoming messages which contain semantic information, e.g.
names of persons, organizations, places, weekdays etc. Second, to mark up
the messages according to the respective NITF categories and export the
marked up messages as an NITF conformant SGML text. The degree of
correctness of the automatic marked up texts is decisive for the
applicability of this method for the daily practice.The general application contextThe task reported above is part of the CLIP-ing project, a national
collaborative project supported by DeTe-Berkom (a subsidiary of the German
Telekom AG). One of the partners of the consortium is the German press
agency (dpa), which provided us with the corpus of news messages. IPTC is
another partner and has had a strong interest in the possibility of
automatic semantic encoding using the NITF standard. The goal of the
CLIP-ing project is to support the linking of different agency services as
well as the planning and managing of the news production and to add further
value to news content by means of content indexing, and to conform to
standards. It is envisaged that in that way news agencies may provide their
clients with news reports which are semantically marked up according to the
NITF standard. This creates an additional value to information sources of
news agencies and provides for richer information requirements which news
agencies clients may have.Within the CLIP-ing project a specific work area relevant to content indexing
concerns the investigation of ways to analyze machine-readable news texts
for the automatic identification and semantic classification of proper
nouns, e.g. 'Hans Albrecht', 'Ontario', 'UNICEF', as well as temporal
expressions and phrases denoting person roles, e.g. 'Anfang Oktober 1996'
(beginning of October 1996), 'Bildungsminister' (minister of education).
This task bears similarities to the general text analysis systems which are
reported in the message understanding conference proceedings (Grishman and
Sundheim 1996, MUC-3, MUC-5). Our task here is not to develop a full blown
information extraction or text analysis system, but rather to extract only
certain application-related information from the news messages and render it
in a standard format, embedding it thus in the workflow process.MethodologyA corpus of 483 raw dpa news messages drawn from the dpa text database has
been analyzed. In order to have an evaluation basis for the automatic
extraction the whole corpus has been marked up by human coders. Given that
the main source of information for the development of the extraction rules
has been the news messages corpus itself, it has been important to have
flexible means for inspecting and viewing corpus words and the contexts they
occur in. Furthermore, and abstracting from the single word type level,
enabling the display of selective concordances by means of syntactic and/or
semantic patterns is of course advantageous. To enable the definition of
syntactic patterns, we have used GERTWOL, a morphological analysis tool for
German from Lingsoft, Finland (). With
regards to the semantic information, our aim has been to define mechanisms
for filtering out relevant words according to part of speech categories and
then classifying them semantically.For the analysis tasks described above we have used the Text Analysis Tool
with Object Encoding (TATOE) (Alexa & Rostek 1996). Two features of
TATOE which are important for this kind of work are: TATOE enables the
analyst to develop corpus-based pattern rules for parsing and marking up a
corpus of texts according to a categorization schema (Alexa & Rostek
1997). The other feature is that TATOE enables analysis and mark up of the
corpus texts according to different categorization schemata concurrently.
Since TATOE did not support the export of mark up into SGML encoded text, an
export procedure has been defined for the particular application context.
This is presented later on in this paper.Semantic mark upThe dpa corpus which we imported in TATOE contains 483 dpa messages with
20,407 word types and 124,691 word tokens. We have defined and used two
categorization schemata in TATOE. Each schema contains categories which are
based on the NITF categories for semantic markup. One of the schemata has
been used in order to mark up all the texts intellectually with such NITF
categories as PERSON, FUNCTION, CITY, CHRON, etc. Additionally to the
standard NITF categories, we have defined more specific ones in order to
allow for more detailed information; for example, the categories PCHRON and
FCHRON for distinguishing between past and future temporal phrases or
FUNCPERS for those phrases which express a named person together with his/her role. This intellectual mark up has been
used as a test basis for the evaluation of the correctness of the automatic
mark up.The second schema consists of the same categories (although its categories
are spelled slightly different to enable comparisons) and has been used for
storing and displaying the automatically performed mark up. A set of pattern
rules have been defined in order to parse and mark up the texts
accordingly.The evaluation of the correctness of the automatic mark up is then a
comparison between the mark up of the two schemata, i.e. it measures the
differences between the intellectual and the automatic mark up.Exporting performed semantic mark up into SGMLEach marked up position in TATOE is stored into an object representing the
paragraph it belongs to. This object contains also the text of the
paragraph. This storing mechanism separating the text from its mark up
positions has the advantage of enabling fast selection and display of all
mark up according to the current schema selected by the user. In addition,
multiple and overlapping mark up does not pose a problem. However, if one
wants to export the mark up into an SGML format, first the mark up positions
need to be selected, then sorted and finally the SGML tags need to be
inserted into the original text. Furthermore, this process has to respect
the dependencies between the marked up elements. For example, in the text
shown below, the system has stored two marked up phrases, namely
'Regisseurin' and 'Regisseurin Andrea Breth'; however, it needs to be
recognized that there is an interdependence between the two phrases, that
is, the first phrase is part of the second and that for this text an element
called FUNCTION should be inserted inside the PERSON element.<NITF><HEAD><TOBJECT></TOBJEC>>T>
<IPTC7901.WIREHEAD IPTC7901.PRIORITY="5" IPTC7901.TIMEDATE="281434
Aug 91" IPTC7901.SVCID="bas" IPTC7901.OPTINFO="vvvvb dpa 260 "
IPTC7901.KEYWORD="Theater" IPTC7901.MSGNUM="362"
IPTC7901.CATEGORY="ku"></HEAD>
<BOD>Y><HEDLINE><HL1><PERSON>Andrea
Breth</PERSON> in künstlerischer Leitung der
<ORG>Berliner
Schaubühne</ORG></HL1></HEDLINE>
<DATELINE><LOCATION>Berlin</LOCATION></DATELINE>
<P>Die
<PERSON><FUNCTION>Regisseurin</FUNCTION>
Andrea Breth</PERSON> gehört mit Beginn der Spielzeit
<CHRON NORM="19920101">1992</CHRON> /
<NUM>93</NUM> der künstlerischen Leitung der
<ORG>Berliner Schaubühne am Lehniner Platz</ORG>
an, teilte das <ORG>Theater</ORG> am
<CHRON NORM="19910827">Dienstag</CHRON> mit. Sie
übernimmt die unbesetzte Stelle von <PERSON>Jürgen
Gosch</PERSON> , von dem sich das
<ORG>Theater</ORG> zum Jahresende <CHRON
NORM="19890101">1989</CHRON> vorzeitig getrennt hatte.
<PERSON>Andrea Breth</PERSON> inszeniert derzeit
an der <ORG>Schaubühne</ORG>
<PERSON>Arthur Schnitzlers</PERSON> Stück "Der
einsame Weg". Die Premiere ist für den <CHRON
NORM="19900930">30. September</CHRON> angekündigt. Die
38jährige <FUNCTION>Regisseurin</FUNCTION>
arbeitet auch noch am <ORG>Wiener
Burgtheater</ORG>> unter dem
<PERSON><FUNCTION>Intendanten</FUNCTION>
Claus Peymann</PERSON>
.</P></BODY></NITF>During the generation of the SGML expression for each paragraph in a text we
calculate an inclusion lattice of the marked up phrases to order the
overlapping elements and to determine the insertion points. In that way mark
up export from TATOE in SGML is enabled.For the temporal information marked up as CHRON elements the system following
the NITF guidelines creates an SGML attribute NORM which has as value the
concrete date of the temporal phrase in a normalized form. In the text above
<CHRON
NORM="19910827">Dienstag</CHRON> means that
Dienstag (German for Tuesday) was 27th August 1991 calculated from the fact
that the Tuesday before the date of the message (28.8.91).ConclusionsWe have defined this export procedure from TATOE to SGML specifically for the
CLIP-ing application context. Clearly, a general solution for this
requirement has to be provided, whereby a general descriptive formalism
within TATOE is specified in order to determine the mapping from mark up
into some SGML tagged text. Nevertheless we feel that the defined export
procedure is an important step towards that direction.ReferencesMelinaAlexaLotharRostekPattern concordances - TATOE calls XGrammarPaper to be presented at ALLC-ACH97, Kingston, Canada.
June 19971997MelinaAlexaLotharRostekComputer-assisted corpus-based text analysis with
TATOEPresented at ALLC- ACH96, Bergen, Norway1996Abstracts, pp. 11-17.Hsin-HsiChenJen-ChangLeeIdentification and classification of proper nouns in
Chinese TextsProceedings of COLING-96Vol. 1Kopenhagen, Denmark1996222-229RalphGrishmanBethSundheimMessage Understanding Conference - 6: a brief
historyProceedings of COLING-96Vol. 1Kopenhagen, Denmark1996466-471T.KitaniT.MitamuraAn accurate morphological analysis and proper noun
identification for Japanese text processingTransactions of Information Processing Society of
Japan353404-4131994MUC-3: Proceedings of the Third Message Understanding
Conference (MUC-3), August 1991San Diego, CA, USAMorgan Kaufmann Publishers1991MUC-5: Proceedings of the Fifth Message Understanding
Conference (MUC-5), August 1993San Diego, CA, USAMorgan Kaufmann Publishers1993TakahiroWakaoRobertGaizauskasYorickWilksEvaluation of an Algorithm for the Recognition and
Classification of Proper NounsProceedings of COLING-96Vol. 1Kopenhagen, Denmark1996418-423