Construction of Russian Corpus-Driven Dictionary Based
and Monitor CorporaSergeA.YablonskySt. Petersburg University of Transport, Russicon Company, Russia 2000University of GlasgowGlasgowALLC/ACH 2000editorJeanAndersonAmalChatterjeeChristianJ.KayMargaretScottencoderSaraA.Schmidt1. IntroductionMonitor corpora are of interest to lexicographers and language learners who
can trawl a stream of new texts looking for the occurrence of new words, or
for changing meanings of old words (Collins COBUILD, 1995; McEnery T.,
Wilson A., 1996). Their main advantages are that they are not static and
provide for a large and broad sample of language.The application of language processing technologies for construction of
shareable and multifunctional language corpora led to hopeful results
(Varile G. B., Zampolli A., 1996).Progress in Russian language processing affords an opportunity for applying
its results for creating Russian monitor corpora strongly connected with the
set of electronic dictionaries by the help of linguistic software. Our
approach is particularly dependent on the language processor Russicon, and
on wide usage of Russicon electronic dictionaries (Yablonsky S.A., 1998).
2. Composition of the corporaThe main part of the corpus was described in (Yablonsky S.A., 1999 a,b).
Today's corpus is based on wide representation of Russian XIX and XX century
literature, critics, philosophy, religion, newspapers, memoirs, law,
business, computers, historical documents, stenographs, translations,
folklore, Internet literature, "underground" literature etc.The texts are taken from printed resources, CD - resources and the Internet.
ASCII and Unicode text are the basic text type standards. Additionally SGML,
HTML and XML markup is done by designing C-conversion programs. SGML
configuring of texts is done by the SoftQuad SGML Publishing Suite. The text
collection will continue to grow as resources are created and encoded. The
open-ended (constantly growing) Russian monitor corpus helps in dictionary
building as it enables lexicographers to keep on top of new words entering
the language, or existing words changing their meanings, or the balance of
their use according to genre etc.3. Corpus-driven dictionaryThe chief distinction of the corporus is its strong connection with the set
of Russian electronic dictionaries and language processing tools,
particularly dependent on the language processor Russicon.Every word of the corpus simultaneously is the entry word of the
corpus-driven dictionary and vice versa. For any form of a Russian word
input, the dictionary outputs:one or several lemmas (lexical homonyms);one or several sets (in the case of morphological homonyms) of
such grammatical characteristics: part of speech, case, gender,
number, tense, person, degree of comparison, voice, aspect, mood,
form, type, transitiveness, reflexive, animation;the synonym row(s);the antonyms;the precise definitions;the explanatory comments;all or several examples of usage in the corpora;At the same time users can search for patterns of word combination, check
word frequencies, see examples of all the uses of particular words.The pilot system is realized on IBM PC using Visual Basic 6.0 and MS SQL
Server 7.0 and works in personal and local net mode.ReferencesB.M.BelyaevA.S.SurcisS.A.YablonskyRussian Language Processor RUSSICON: Design and
ApplicationsProceedings of the East-West Artificial Intelligence
Conference (EWAIC-93), Moscow1993175-180Collins COBUILD on CD ROMLondonHarperCollins1995T.McEneryA>WilsonCorpus LinguisticsEdinburghEdinburgh University Press1996G.B.VarileA.ZampolliSurvey of the State of Art in Human Language
TechnologyCambridgeCambridge University Press1996S.A.YablonskyRussicon Slavonic Language Resources and
SoftwareA.RubioN.GallardoR.CastroA.TejadaProceedings of the First International Conference on
Language Resources & Evaluation, Granada, Spain. 19981141-1147S.A.YablonskyRussian Written Language Corpora DevelopmentProceedings of the International Seminar Dialog99, May
30-June 8, Tarussa, Russia1999aS.A.YablonskyRussian 20th Century Literature Digital Library for
Language TeachingProceedings of the International Conference of the
ACH/ALLC Digital Libraries for Humanities Scholarship and Teaching,
JUNE 9-13, 1999, University of Virginia, Charlottesville, Virginia,
USA1999b