A System for Dynamic Text Corpus Management (with an
Example Corpus of the Russian Mass Media of the 1990s)GirogriSidorovNational Polytechnic Institute (IPN), Mexico
AnatolyBaranovRussian Academy of Sciences, Russia MikhailMikhailovRussian Academy of Sciences, Russia 2000University of GlasgowGlasgowALLC/ACH 2000editorJeanAndersonAmalChatterjeeChristianJ.KayMargaretScottencoderSaraA.SchmidtWe present a system for text corpus processing which is oriented to the idea of a
"dynamic text corpus". With its help a user can search for examples of usage
(words, phrases, and even morphemes), build word lists and concordances, compile
his own subcorpus. The software was used while compiling a text corpus on modern
Russian mass media. It is a collection of texts from Russian newspapers and
magazines of the 1990s with a total size of about 15 Mb. Each text of the corpus
is classified by 6 parameters - source, date, author(s), genre, and topic(s).
Later these parameters are used to generate the subcorpora that conform to
users' needs.IntroductionCorpus linguistics is part of the computational linguistics that deals with the
problems of compilation, representation, and analysis of large text collections.
One of the most complex problems in modern corpus linguistics is defining of the
principles of the text corpus compilation. The text corpus should in the ideal
case answer the criteria of representativeness and at the same time be much
smaller than the whole dedicated field. On the other hand the representativeness
of the text corpus is directly connected with the research objectives. For
example, the research connected with text macrostructure needs quite different
parameters than sociolinguistic research or the description of contexts of usage
of a certain morpheme or a word. The difficulty of reconciling statistical
representativeness and user demands leads to the fact that many of the existing
corpora do not have any explicit and clear criteria for texts' selection. For
example, there are no clear-cut criteria of the items' selection for the
well-known Birmingham corpus of English texts; the situation is the same with
the German text corpora. We suggest a definite strategy for text corpora
compilation that allows a user to create his own subset of texts from a corpus
for his own task (as a new subcorpus). We call the initial text corpus, which is
the source for further manipulations and selection plus corresponding software,
dynamic text corpus. For the compiling of our corpus we used texts from Russian
mass media of the 1990s.General strategy of initial text corpus compilationTaking into account the requirement of representativeness we directed special
attention to choosing the most prominent mass media editions with different
political orientation which was fairly important for society during the period
the research in question was covering (1990s) and their proportional
representation considering their popularity and significance. We used as the
criterion of popularity the results of the last elections when, roughly
speaking, 25 percent voted for the communists, 10 for the ultra left, 25 for the
right, and 40 for the center. The second important factor of the corpus
compiling was quantity of texts. There should be enough texts to reflect
relevant features of the dedicated field. The upper limit was connected only
with pragmatic considerations, the disk space and the speed of the service
software. In our case, during the project that took place in 1996-1998 we
collected around 15 Megabytes of text. As we stated above the different users
have different tasks and expect different things from the text corpus. It is
also necessary to take into account the fact that some users may not be
linguists. These people may be interested in the reflection in the mass media of
certain events during a certain period. It is probable that they would like to
read the whole texts and not just concordances. To consider possible different
requirements it is necessary to compile the text corpus not of extracts from the
texts but of whole texts. The idea of using extracts (so called sampling) was
popular at the early stage of corpus linguistics, e. g., the famous "Brown
corpus", which consists of 1000-word-long text extracts. It is also necessary to
take into account that linguists from different linguistic areas have different
requirements of text corpora. For example, for morphological or syntactic
research a 1 million word text corpus would be sufficient. Sometimes it is even
more convenient to use a relatively small corpus because the concordances of
usage of function words may occupy thousands of pages and most of the examples
will be trivial. However, even for grammar research it seems reasonable to have
in the corpus the texts of different structure and genre. At the same time the
text corpus should be large enough to ensure the presence of rare words. Only in
this case is the corpus interesting for a lexicologist or a lexicograper. Thus,
the task of compilers of a text corpus is to take into account all the different
and sometimes contradictory users' requirements. We suggest allowing the user to
construct his own subset of texts (his own corpus) from the dynamic text corpus.
To ensure this possibility each document has a certain search pattern which
allows the software to filter the initial corpus and construct the corpus which
fits the needs of the user.Encoding of corpus unitsAfter the analysis of the text data the following parameters were chosen as corpus-forming.1. Source (the mass media printed editions),2. Author (about 1000 authors),3. Title of the article (1369 articles),4. Political orientation (left, ultra left, right, center),5. Genre (memoir, interview, critique, discussion, essay, reportage,
review, article, feuilleton),6. Theme (internal policy, external policy, literature, arts, etc. In
total 39 themes.),7. Date (exact date of publication. In our case we used articles
published during the period of the 1990s).The following printed editions (magazines and newspapers) were used: VEK, DruzbaNarodov, Zavtra, Znamia,Izvestiya, Itogi, Kommunist, Literaturnaya gazeta, Molodaya
gvardiya, Moskovskiy komsomolec, Moskovskie novosti, Nash
sovremennik, Nezavisimaya gazeta, Novyi mir, Ogonyok, Rossiiskaya gazeta, Russki
vestnik, Segodnya, Sobesednik, Sovetskaya Rossiya, Trud, Ekspert, Elementy, Evraziiskoe obozrenie. Every text in the corpus is
characterized by a set of these features. At the current stage it was done
manually. The most representative are the following sources: Vek (8%), Zavtra (14%), Itogi (11%), Literaturnaya gazeta (6%),
Moskovskie novosti (8%), Novy
mir (8%).Software descriptionThe text corpus seems incomplete and hard to work with without software that
assures the user-friendly interface and allows different kinds of processing. A
general problem of the corpus software is selecting the texts to work with. If
the user wants to deal just with certain parts of the corpus he has to do it
manually by choosing file names. This is typical of corpus software and it is
not convenient. The other possibility - to have all text files merged - simply
does not allow any additional selection in the corpus. However, in our system it
is possibile to select texts automatically using their feature sets. All the
user has to do is to describe his requirements for his own corpus. We should
mention that the collection of texts with descriptions are only rough material
while in the traditional technology it is the final result. In the technology
suggested in this article the 'big corpus' is a source for compilation of
subcorpora answering the user's needs with greater accuracy. The initial text
corpus is stored as a data base where each text is a record and each parameter
is a field. The texts of articles are stored in a MEMO field. Importation of the
manually marked articles into the data base is performed by a special utility.
On the basis of this information a user can create his own corpus by indicating
a set of parameters. He does it by going through a sequence of dialogue
routines, answering questions or choosing from the lists. The resulting corpus
is a text file containing the texts matching the selected parameters. The system
allows the following main functions:1. Standard browsing of the texts and their parameters.2. Selection and ordering of texts according to the chosen parameters
or their logical combinations. The system has a standard set of QBE
queries which are translated automatically into SQL. The experienced
users can write SQL queries directly.3. Generating a text corpus that is a subset of the initial corpus on
the base of a stochastic choice and the given percentage for each
parameter.4. Generating a user's text corpus.5. Browsing the user's text corpora and text processing: building
concordances or word lists.The program contains four standard variants of the initial corpus. In the whole
corpus there are its proportional subsets containing 25% of the initial one for
the parameters sources, themes, and genres.ConclusionsWe developed a system that implements dynamic text corpus management (the
software is included in the notion of the dynamic corpus) for Russian mass media
texts. The system is applicable to any corpus. All texts of the corpus are
classified according to the parameters described above. The system ensures easy
corpus processing for a user. The corpus is representative from the point of
view of the chosen parameters. It means that all values and their combinations
are presented in the corpus (except the impossible ones, e.g., the magazine
Novy mir (a literature magazine) has no articles on
finances, and the magazine Expert (a financial
magazine) has no articles on literature).