Demonstration of TATOE: Text Analysis Tool with Object
EncodingMelinaAlexaIntegrated Publication and Information Systems
Institute, GMD-IPSI alexa@darmstadt.gmd.de LotharRostekIntegrated Publication and Information Systems
Institute, GMD-IPSI rostek@darmstadt.gmd.de
1996University of BergenBergen, NorwayALLC/ACH 1996editorAnneLindebjergEspenS.OreØysteinReigemencoderSaraA.Schmidtsemi-automatic text analysis toolTATOE, a Text Analysis Tool with Object Encoding, is a support tool for
semi-automated text analysis. It has been designed and implemented at
GMD-PSI in order to assist various tasks related to the multilingual,
data-driven and multi-layered text analysis. TATOE is implemented in the
Smalltalk-based programming environment VisualWorks 2.0 (from ParcPlace).
For the data modeling we have used the Smalltalk Frame Kit (SFK), an object
oriented modeling tool which offers a spectrum of features to make model
descriptions operational (Fisher and Rostek, In Preparation).Various processes are supported:1. structuring, compiling, importing and working with one or more
text corpora,2. determining one or more, hierarchically or non-hierarchically
structured, categorization schemata to be used for on-text mark up, 3. importing or re-using an already existing categorization
schema, and (or) defining and structuring one's own categorization
schema,4. enabling the integration of automatic tagging/encoding tools
for a supplementary annotation,5. performing on-text annotation according to more than one
categorization schema concurrently,6. flexible viewing of both annotated and non-annotated text
segments (this includes on the one hand selecting and arranging
according to different criteria - frequency of occurrences, encoded
category types, etc. - and on the other hand presenting by
meaningful layout styles - fonts, colours, etc.),7. calculating different statistics on the basis of the text
corpora themselves, the encoded text segments and - if available -
the hierarchical relations within the categorization schema, e.g.
frequency of occurrence of word types, word tokens or categories or
how these are distributed in the text(s)8. re-using the encoded information as input to further processing
by exporting it in an appropriate format, e.g. sgml.TATOE has been used for text type analysis of encyclopedic texts containing
artists' biographies and descriptions of archeological sites in order to
empirically extract the necessary information for the construction of text
type specifications for generating text sensitive to both text type and
envisaged user. More specifically, the text analysis task has been to
perform a register analysis (as defined in Systemic Functional Linguistics
(Halliday 1985)) by means of identifying in a corpus of texts belonging to
the same text type a number of situational features. Corpus-based register
analysis for text generation aims at identifying, determining and then
representing, where necessary, the mechanisms of correlation or constraint
between linguistic form (linguistic features) and communicative context
(situational features). The practical aim is to provide a text generation
system with information about situational features and what constraints
particular situational features impose on linguistic expression (see Alexa
1995).Furthermore, TATOE is presently being used (and further developed) for the
analysis of a corpus of news messages (from the Deutsche Presse Agentur
(DPA)) in order to identify semantic patterns and annotate text segments
according to a domain dependent semantic categorization. Such information
can then be used to support fact extraction, the process of filling in the
DPA "fact cards" and the development of a "parliament" thesaurus.The demonstration will illustrate the different needs of the specific text
analysis tasks by means of presenting the main functionalities of the tool.
It will be shown how the user can define the categorization schema or
schemata to be used for analysis. A number of techniques for maintaining the
schema will be demonstrated. For the specific application task different
classification schemata have been set up and used, in order to annotate and
extract information according to different levels of linguistic description.
Possibilities for either determining one's own classification schema or
importing one will be demonstrated.The selection and on-text annotation procedures together with those for
updating, correcting and refining the annotation will be demonstrated. TATOE
assists mostly those tasks of the analysis which involve an intellectual
effort. However, it provides automatic support for extracting frequency of
occurrence information as well as frequency of already annotated words or
text segments according to a particular categorization schema. The different
possibilities for extracting and presenting frequency of occurrence
information will be presented.ReferencesMelinaAlexaMaking principled selections: A methodology for
register analysis and description for text generationPresented at the 22nd International Systemic-Functional
Congress, Beijing, China, July 19951995DietrichFischerLotharRostekSFK: A Smalltalk Frame Kit. Technical reportGMD/Institut fuer Integrierte Publikations- und
Informationssysteme(In Preparation, 1996)MichaelA.K.HallidayAn Introduction to Functional GrammarLondonEdward Arnold1985