Home :: DH Abstracts

Construction of Russian Corpus-Driven Dictionary Based and Monitor Corpora

Serge

Yablonsky

St. Petersburg University of Transport, Russicon Company, Russia

2000

University of Glasgow

Glasgow

ALLC/ACH 2000

editor

Jean

Anderson

Amal

Chatterjee

Christian

Kay

Margaret

Scott

encoder

Sara

Schmidt

1. Introduction

Monitor corpora are of interest to lexicographers and language learners who can trawl a stream of new texts looking for the occurrence of new words, or for changing meanings of old words (Collins COBUILD, 1995; McEnery T., Wilson A., 1996). Their main advantages are that they are not static and provide for a large and broad sample of language.

The application of language processing technologies for construction of shareable and multifunctional language corpora led to hopeful results (Varile G. B., Zampolli A., 1996).

Progress in Russian language processing affords an opportunity for applying its results for creating Russian monitor corpora strongly connected with the set of electronic dictionaries by the help of linguistic software. Our approach is particularly dependent on the language processor Russicon, and on wide usage of Russicon electronic dictionaries (Yablonsky S.A., 1998).

2. Composition of the corpora

The main part of the corpus was described in (Yablonsky S.A., 1999 a,b). Today's corpus is based on wide representation of Russian XIX and XX century literature, critics, philosophy, religion, newspapers, memoirs, law, business, computers, historical documents, stenographs, translations, folklore, Internet literature, "underground" literature etc.

The texts are taken from printed resources, CD - resources and the Internet. ASCII and Unicode text are the basic text type standards. Additionally SGML, HTML and XML markup is done by designing C-conversion programs. SGML configuring of texts is done by the SoftQuad SGML Publishing Suite. The text collection will continue to grow as resources are created and encoded. The open-ended (constantly growing) Russian monitor corpus helps in dictionary building as it enables lexicographers to keep on top of new words entering the language, or existing words changing their meanings, or the balance of their use according to genre etc.

3. Corpus-driven dictionary

The chief distinction of the corporus is its strong connection with the set of Russian electronic dictionaries and language processing tools, particularly dependent on the language processor Russicon.

Every word of the corpus simultaneously is the entry word of the corpus-driven dictionary and vice versa. For any form of a Russian word input, the dictionary outputs:

one or several lemmas (lexical homonyms);

one or several sets (in the case of morphological homonyms) of such grammatical characteristics: part of speech, case, gender, number, tense, person, degree of comparison, voice, aspect, mood, form, type, transitiveness, reflexive, animation;

the synonym row(s);

the antonyms;

the precise definitions;

the explanatory comments;

all or several examples of usage in the corpora;

At the same time users can search for patterns of word combination, check word frequencies, see examples of all the uses of particular words.

The pilot system is realized on IBM PC using Visual Basic 6.0 and MS SQL Server 7.0 and works in personal and local net mode.

References

Belyaev

Surcis

Yablonsky

Russian Language Processor RUSSICON: Design and Applications

Proceedings of the East-West Artificial Intelligence Conference (EWAIC-93), Moscow

1993

175-180

Collins COBUILD on CD ROM

London

HarperCollins

1995

McEnery

Wilson

Corpus Linguistics

Edinburgh

Edinburgh University Press

1996

Varile

Zampolli

Survey of the State of Art in Human Language Technology

Cambridge

Cambridge University Press

1996

Yablonsky

Russicon Slavonic Language Resources and Software

Rubio

Gallardo

Castro

Tejada

Proceedings of the First International Conference on Language Resources & Evaluation, Granada, Spain.

1998

1141-1147

Yablonsky

Russian Written Language Corpora Development

Proceedings of the International Seminar Dialog99, May 30-June 8, Tarussa, Russia

1999a

Yablonsky

Russian 20th Century Literature Digital Library for Language Teaching

Proceedings of the International Conference of the ACH/ALLC Digital Libraries for Humanities Scholarship and Teaching, JUNE 9-13, 1999, University of Virginia, Charlottesville, Virginia, USA

1999b