An American National Corpus: a Large Balanced Text
Corpus for American EnglishCatherineMacleodNew York University, USA NancyIdeVassar College, USA 2000University of GlasgowGlasgowALLC/ACH 2000editorJeanAndersonAmalChatterjeeChristianJ.KayMargaretScottencoderSaraA.SchmidtComputational / Corpus LinguisticsIntroduction:The importance of corpora as resources has become more and more accepted over
the years. Many types of corpora have been used for various different
purposes but if one is searching for examples of "general" application and
not restricting oneself to a particular sub-language, the development of a
balanced corpus is of primary importance. Of equal importance is the
adoption of a uniform standard annotation. The main areas of application of
a text corpus are lexicography (also computational lexicography) and natural
language processing, including specifically, adaptation to different domains
and genres. For these purposes the corpus must be large (at least 100
million words), contemporary, heterogeneous, uniformly annotated and, for
use in the United States, must contain American English. The size will
ensure the adequate representation of infrequent words. The selection of
contemporary texts is important for both lexicography and NLP, particularly
in view of the significant changes in common text genres over the last few
years brought about by electronic communication. Heterogeneity ensures that
the range of language usage needed for the creation of "general language
resources" is represented, and that one can explore a wide spectrum of
language genres for NLP. Uniform annotation is paramount in any corpus and
the collection of American texts ensures that the grammatical and lexical
differences found in British English will not interfere with the classifying
of American English.Background:The first American text corpus that strived for this balance was the Brown
Corpus developed by Kucera and Francis at Brown University in the 1960's. It
was the model for many corpora that followed and is still being used today.
However, it is a small corpus (one million words) and somewhat dated (the
texts are at least 30 years old). It is true that a written language changes
rather slowly over time with regard to grammar but there are changes in the
structure and there are quite frequent additions of new lexical items.Recently, the British National Corpus (BNC) was released. It is a rather
carefully balanced corpus and a very large corpus (one hundred million
words). It also has the advantage of covering the time period from where the
Brown Corpus left off until 1993. There are, nonetheless, two distinct
disadvantages for Natural Language researchers and dictionary producers in
the United States: (1) the corpus is, as yet, unavailable for use outside of
Europe and (2) the corpus contains texts of British not American English.
Differences between American and British English:The grammar of American English (A.E.) varies from British English (B.E.)
quite significantly. For example, British English often makes use of a
to-infinitive complement where American English does not. In the following
examples from the BNC, "assay", "engage", "omit" and "endure" appear with a
to-infinitive complement; there were no examples found in our corpus of this
construction although the verbs themselves did appear.Examples: B.E. "Jerome crept to the foot of the steps, and
there halted, baulked, rather, like a startled horse, drew hard breath
and ASSAYED TO MOUNT, and then suddenly threw up his arms to cover his
face, fell on his knees with a lamentable, choking cry, and bowed
himself against the stone of the steps." B.E. "A
magnate would ENGAGE TO SERVE with a specified number of men for a
particular time in return for wages which were agreed in advance and
paid by the Exchequer." B.E. " 'What did you OMIT TO TELL your priest?'
" A.E. "`What did you OMIT TELLING your priest?'"B.E.
"But Carteret's wife, who frequented health spas, could not ENDURE TO
LIVE with him or he with her: there were no children."
A.E. "But Carteret's wife, who frequented health
spas, could not ENDURE LIVING with him or he with her: there were no
children."For the first two verbs, one can argue that there is not an equivalent verbal
meaning in A.E. but, for the last two, the meaning can be paraphrased in
A.E. by the gerund.Adverbial usage is also different. The B.E. use of "immediately" in sentence
initial position is not allowed in A.E. For example, B.E. "Immediately I get
home, I will attend to that." is incorrect in A.E. where we would say "As
soon as I get home, I will attend to that."Other syntactic differences are formation of questions with the main verb
"have". In B.E., one can say, "Have you a pen?" where A.E. speakers must use
"do" ("Do you have a pen?"). Support verbs for nominalizations also differ.
Note the B.E. "take a decision" vs the A.E. "make a decision".With these considerable differences and the fact that lexical items may be
over- or under-represented or not present at all, it is clear that a corpus
of American English is needed.The proposed American National Corpus:As seen above, the corpora we have been working with are inadequate and the
BNC although meeting our standards of size and balance does not deal with
our language. In 1998, at the first LREC conference a proposal was made to
create an American National Corpus (ANC) much on the lines of the BNC
(Fillmore et al, 1998 [1]).The corpus should be as far as possible, contemporary (1990's). It should be
both static (like the BNC) and dynamic (COBUILD). We will add regular
increments but retain the capability to return to the initial corpus as well
as the static stages between increments.The corpus will be both balanced and heterogeneous. The collection of more
than 100 million words will make this possible. 100 million words of the ANC
should be comparable in balance to the BNC to enable cross linguistic
studies between British and American English. There is no set definition for
what it means for a corpus to be balanced. The BNC made a principled effort
to balance their corpus (see the BNC User's Reference Guide [2] for a break
down of their corpus). The ANC will use this as a model. However, since it
is also desirable to provide significant components from a wide range of
styles, the remaining text will be varied rather than balanced (i.e. we will
not try for differing percentages of texts according to their representative
importance in the language but will try for smaller samples of a greater
variety of texts). The corpus will be annotated at two levels, which serve
two different user groups. Base Level will be annotated fully automatically
with document, paragraph, sentence, token with POS marking. Level 1 will be
heavily manual with the added text structure (titles, headers, footnotes,
tables, captions, lists, etc.) which follow the CES standard (Ide, et.al
[3]). Progress towards the creation of the ANC:The ANC has progressed since its genesis at LREC 98. In May of 1999 the first
ANC meeting preceded the Dictionary Society of North America (DSNA) meeting
at the University of California at Berkeley. It was attended by a number of
representatives of publishing houses. The idea of an American National
Corpus was well received and plans for a second meeting were agreed upon. The second meeting took place at New York University. Invitees to this
meeting included not only those present at the May meeting but publishers
from Japan and representatives from various software companies from the U.S.
and Europe. More substantial issues were discussed including the structure
of the consortium, questions of balance in the corpus, funding, time
schedules and licensing agreements. Some questions were decided, others such
as balance and licensing were referred to committees for further discussion.
The shape of the consortium and future plans:The licensing and base level annotation is to be done through LDC (UPenn).
UPenn will obtain licenses from text providers and provide licenses to
users. With regard to data rights, there will be multiple classes. The
expectation is that there will be some subset of the data which can be made
available under a form of general public license, and hence can be freely
redistributed under this license.The membership agreement provides for paid memberships from commercial
organizations. These members will receive the data as soon as it is
processed and have exclusive rights to this data for a period of three
years. They are expected to make monetary as well as data contributions. The
data will be freely available to non-profit educational and research
organizations (aside from a nominal fee for licensing and distribution). Our plan is for the base level to be paid for with consortium fees. We have a
3-year time-frame starting Jan. 2000, with 10% of the corpus deliverable by
summer 2000. Level 1 annotation which will require external funding, will
proceed dependent on this funding. Therefore, this may lag as much as a year
behind the base level corpus. Our goal is a fully annotated level 1 corpus
compliant with the CES standard.ReferencesC.FillmoreN.IdeD.JurafskyC.MacleodAn American National Corpus: A ProposalThe Proceedings of LREC, Granada, Spain, May
28-301998965-969L.BurnardBritish National Corpus: User's Reference Guide for the
British National CorpusOxford University Computing ServiceMay 199513-19N.IdeL.RomaryP.BonhommeCES/XML: An XML-based Standard for Linguistic
CorporaSubmitted to the Second International Language
Resources and Evaluation Conference.(submitted)W.N.FrancisH.KuceraManual of Information to Accompany 'A Standard Sample
of Present-Day Edited American English, for Use with Digital
ComputersProvidence, RIDepartment of Linguistics, Brown University1964(revised 1979)