Construction of Russian Corpus-Driven Dictionary Based and Monitor Corpora Serge A. Yablonsky St. Petersburg University of Transport, Russicon Company, Russia 2000 University of Glasgow Glasgow ALLC/ACH 2000 editor Jean Anderson Amal Chatterjee Christian J. Kay Margaret Scott encoder Sara A. Schmidt 1. Introduction Monitor corpora are of interest to lexicographers and language learners who can trawl a stream of new texts looking for the occurrence of new words, or for changing meanings of old words (Collins COBUILD, 1995; McEnery T., Wilson A., 1996). Their main advantages are that they are not static and provide for a large and broad sample of language. The application of language processing technologies for construction of shareable and multifunctional language corpora led to hopeful results (Varile G. B., Zampolli A., 1996). Progress in Russian language processing affords an opportunity for applying its results for creating Russian monitor corpora strongly connected with the set of electronic dictionaries by the help of linguistic software. Our approach is particularly dependent on the language processor Russicon, and on wide usage of Russicon electronic dictionaries (Yablonsky S.A., 1998). 2. Composition of the corpora The main part of the corpus was described in (Yablonsky S.A., 1999 a,b). Today's corpus is based on wide representation of Russian XIX and XX century literature, critics, philosophy, religion, newspapers, memoirs, law, business, computers, historical documents, stenographs, translations, folklore, Internet literature, "underground" literature etc. The texts are taken from printed resources, CD - resources and the Internet. ASCII and Unicode text are the basic text type standards. Additionally SGML, HTML and XML markup is done by designing C-conversion programs. SGML configuring of texts is done by the SoftQuad SGML Publishing Suite. The text collection will continue to grow as resources are created and encoded. The open-ended (constantly growing) Russian monitor corpus helps in dictionary building as it enables lexicographers to keep on top of new words entering the language, or existing words changing their meanings, or the balance of their use according to genre etc. 3. Corpus-driven dictionary The chief distinction of the corporus is its strong connection with the set of Russian electronic dictionaries and language processing tools, particularly dependent on the language processor Russicon. Every word of the corpus simultaneously is the entry word of the corpus-driven dictionary and vice versa. For any form of a Russian word input, the dictionary outputs: one or several lemmas (lexical homonyms); one or several sets (in the case of morphological homonyms) of such grammatical characteristics: part of speech, case, gender, number, tense, person, degree of comparison, voice, aspect, mood, form, type, transitiveness, reflexive, animation; the synonym row(s); the antonyms; the precise definitions; the explanatory comments; all or several examples of usage in the corpora; At the same time users can search for patterns of word combination, check word frequencies, see examples of all the uses of particular words. The pilot system is realized on IBM PC using Visual Basic 6.0 and MS SQL Server 7.0 and works in personal and local net mode. References B. M.Belyaev A. S.Surcis S. A.Yablonsky Russian Language Processor RUSSICON: Design and Applications Proceedings of the East-West Artificial Intelligence Conference (EWAIC-93), Moscow 1993 175-180 Collins COBUILD on CD ROM London HarperCollins 1995 T. McEnery A> Wilson Corpus Linguistics Edinburgh Edinburgh University Press 1996 G. B. Varile A. Zampolli Survey of the State of Art in Human Language Technology Cambridge Cambridge University Press 1996 S. A.Yablonsky Russicon Slavonic Language Resources and Software A. Rubio N. Gallardo R. Castro A. Tejada Proceedings of the First International Conference on Language Resources & Evaluation, Granada, Spain. 1998 1141-1147 S. A.Yablonsky Russian Written Language Corpora Development Proceedings of the International Seminar Dialog99, May 30-June 8, Tarussa, Russia 1999a S. A.Yablonsky Russian 20th Century Literature Digital Library for Language Teaching Proceedings of the International Conference of the ACH/ALLC Digital Libraries for Humanities Scholarship and Teaching, JUNE 9-13, 1999, University of Virginia, Charlottesville, Virginia, USA 1999b