Home :: DH Abstracts

Demonstration of TATOE: Text Analysis Tool with Object Encoding

Melina

Alexa

Integrated Publication and Information Systems Institute, GMD-IPSI alexa@darmstadt.gmd.de

Lothar

Rostek

Integrated Publication and Information Systems Institute, GMD-IPSI rostek@darmstadt.gmd.de

1996

University of Bergen

Bergen, Norway

ALLC/ACH 1996

editor

Anne

Lindebjerg

Espen

Ore

Øystein

Reigem

encoder

Sara

Schmidt

semi-automatic text analysis tool

TATOE, a Text Analysis Tool with Object Encoding, is a support tool for semi-automated text analysis. It has been designed and implemented at GMD-PSI in order to assist various tasks related to the multilingual, data-driven and multi-layered text analysis. TATOE is implemented in the Smalltalk-based programming environment VisualWorks 2.0 (from ParcPlace). For the data modeling we have used the Smalltalk Frame Kit (SFK), an object oriented modeling tool which offers a spectrum of features to make model descriptions operational (Fisher and Rostek, In Preparation).

Various processes are supported:

1. structuring, compiling, importing and working with one or more text corpora,

2. determining one or more, hierarchically or non-hierarchically structured, categorization schemata to be used for on-text mark up,

3. importing or re-using an already existing categorization schema, and (or) defining and structuring one's own categorization schema,

4. enabling the integration of automatic tagging/encoding tools for a supplementary annotation,

5. performing on-text annotation according to more than one categorization schema concurrently,

6. flexible viewing of both annotated and non-annotated text segments (this includes on the one hand selecting and arranging according to different criteria - frequency of occurrences, encoded category types, etc. - and on the other hand presenting by meaningful layout styles - fonts, colours, etc.),

7. calculating different statistics on the basis of the text corpora themselves, the encoded text segments and - if available - the hierarchical relations within the categorization schema, e.g. frequency of occurrence of word types, word tokens or categories or how these are distributed in the text(s)

8. re-using the encoded information as input to further processing by exporting it in an appropriate format, e.g. sgml.

TATOE has been used for text type analysis of encyclopedic texts containing artists' biographies and descriptions of archeological sites in order to empirically extract the necessary information for the construction of text type specifications for generating text sensitive to both text type and envisaged user. More specifically, the text analysis task has been to perform a register analysis (as defined in Systemic Functional Linguistics (Halliday 1985)) by means of identifying in a corpus of texts belonging to the same text type a number of situational features. Corpus-based register analysis for text generation aims at identifying, determining and then representing, where necessary, the mechanisms of correlation or constraint between linguistic form (linguistic features) and communicative context (situational features). The practical aim is to provide a text generation system with information about situational features and what constraints particular situational features impose on linguistic expression (see Alexa 1995).

Furthermore, TATOE is presently being used (and further developed) for the analysis of a corpus of news messages (from the Deutsche Presse Agentur (DPA)) in order to identify semantic patterns and annotate text segments according to a domain dependent semantic categorization. Such information can then be used to support fact extraction, the process of filling in the DPA "fact cards" and the development of a "parliament" thesaurus.

The demonstration will illustrate the different needs of the specific text analysis tasks by means of presenting the main functionalities of the tool. It will be shown how the user can define the categorization schema or schemata to be used for analysis. A number of techniques for maintaining the schema will be demonstrated. For the specific application task different classification schemata have been set up and used, in order to annotate and extract information according to different levels of linguistic description. Possibilities for either determining one's own classification schema or importing one will be demonstrated.

The selection and on-text annotation procedures together with those for updating, correcting and refining the annotation will be demonstrated. TATOE assists mostly those tasks of the analysis which involve an intellectual effort. However, it provides automatic support for extracting frequency of occurrence information as well as frequency of already annotated words or text segments according to a particular categorization schema. The different possibilities for extracting and presenting frequency of occurrence information will be presented.

References

Melina

Alexa

Making principled selections: A methodology for register analysis and description for text generation

Presented at the 22nd International Systemic-Functional Congress, Beijing, China, July 1995

1995

Dietrich

Fischer

Lothar

Rostek

SFK: A Smalltalk Frame Kit. Technical report

GMD/Institut fuer Integrierte Publikations- und Informationssysteme

(In Preparation, 1996)

Michael

Halliday

An Introduction to Functional Grammar

London

Edward Arnold

1985