Tailoring a formal grammar for efficiency without
compromising its linguistic motivationNellekeOostdijkUniversity of Nijmegenoostdijk@let.kun.nl1997ACH/ALLC 1997editorthe secretarial staff in the Department of French Studies at
Queen's UniversityGregLessardencoderSaraA.Schmidtsyntactic corpus analysisformal grammarambiguityIntroductionCorpus linguistics can be characterised as the formalized approach to
descriptive linguistics. Its main objective is the study of actual language
use and variation therein on the basis of text corpora. A corpus is not just
any amount of textual data; rather, a corpus is a balanced collection of
language data and is constituted by samples of connected discourse, usually
in a single dialect. In its raw form, the corpus serves as a test-bed for
the linguistic hypotheses laid down in a formal grammar. Once annotated (in
accordance with the grammar), the corpus constitutes a database that may be
consulted in order to obtain information about linguistic structures, their
frequency of occurrence and distribution, as well as to gain insights into
co-occurrence restrictions (cf. e.g. Oostdijk and de Haan, 1994). In this
approach, the formal grammar plays a central role: the grammar both contains
the formalized description of the language under investigation, which is
validated in the process of annotating the corpus, and it is also used to
automatically derive the parser that is employed to annotate the corpus.The formal grammar is originally conceived on the basis of the linguist's
intuitions and any information found in sources such as grammatical
handbooks (for English, for example, Quirk et al. 1985) and linguistic
monographs. Through a process of iterative testing on the corpus and
augmenting the description contained in it, the formal grammar is developed
until it reaches a satisfactory level of descriptive and observational
adequacy. From the point of view of the (descriptive) linguist, the added
value of the corpus linguistic approach then lies in the fact that the
description that results is explicit, exhaustive, objective and validated,
this in contrast with other, traditionally informal accounts.Current practice: an evaluationWhile the annotation of corpora can serve a dual purpose, so far the creation
of databases has been given priority over the advancement of descriptive
theory. The reason for this is simply that this continues to be the most
urgent task, since corpora that have been annotated with detailed linguistic
information are still rare. Therefore, at present grammars are being
constructed for the purpose of analysing corpora that, once analysed, can be
used by the general linguistic community. It goes without saying that the
linguistic descriptions contained in these grammars should adhere to the
standards set by the discipline. In effect this means that the descriptions
must conform to a large extent to what is familiar and traditional.The corpus linguistic approach described above to the linguistic annotation
of corpora can be said to be a rather ambitious one, as the parser should
produce for each corpus sentence (at least) the one contextually appropriate
analysis. Since the knowledge incorporated into the formal grammar is for
various reasons insufficient, overgeneration is unavoidable. This has a
negative impact on the efficiency of the analysis process. With the present
orientation towards the production of annotated corpora that can serve as
databases for further linguistic reseach, it is required that for each
corpus sentence the database should contain ONLY the one analysis that is
contextually appropriate. In a situation in which the parser is permitted to
overgenerate, human intervention then becomes necessary (van Halteren and
Oostdijk, 1993). This not only slows down the analysis process even further,
but also the consistency in the analyses is no longer warranted (as it would
be if the analysis process were to run fully autonomously).So far the idea has always been upheld that overgeneration is unavoidable.
The overall effects overgeneration has on the analysis process and the
quality of the output are perhaps more far- reaching than is desirable.
While it is undoubtedly true as Aarts et al. (1996) point out that
"consistency is mainly endangered if the human analyst takes the initiative
in the analysis process" and that therefore it is better "to have the
linguist only react to prompts given by automatic processes, by asking him
to choose from a number of possibilities presented by the machine", the
negative effects of the interaction with the human analyst in terms of loss
of consistency and efficiency remain. With regard to our short term goal,
what we see is that there still is a great demand for corpora that have been
annotated with detailed linguistic information, while the rate at which such
corpora are being produced is far too low. Our long term goal, the
advancement of descriptive linguistic theory through iterative testing and
augmenting the description contained in the formal grammar, cannot be
achieved if we do not succeed in shortening the time-span that is needed to
complete a single iteration.Tailoring the grammarIn theory, there are two possible solutions to the problem of overgeneration.
The first solution is to resort to underspecification, a strategy which is
widely adopted both in tagging and in parsing. The portmanteau tags in
various tagsets are typical examples of underspecification. In the PENN
Treebank approach (Marcus et al., 1993) for instance the portmanteau
part-of-speech tag IN is assigned to both prepositions and subordinating
conjunctions. Underspecification is also found at all levels of English
Constraint Grammar (ENGCG, Karlsson et al. 1995). The major drawback of
underspecification is of course the loss of information. An alternative
solution can be found in incorporating into the grammar the knowledge that
is now brought into play in the analysis process in the interaction with the
human analyst. The nature of this information is diverse and includes
knowledge about semantics, pragmatics, discourse and syntax. Now at this
stage it would be too unrealistic to propose to incorporate all these types
of knowledge. However, closer examination of the overgeneration found in the
corpus we have (syntactically) analysed yields the following picture: (1)
not all knowledge that we have as far as syntax is concerned has as yet
found its way into the grammar, and (2) the knowledge we have incorporated
in our grammar so far has not been used to the full. The two points are
obviously related. They both have to do with the fact that while
constructing the formal grammar, it was not at all clear what knowledge and
what detail were required. The construction of the formal grammar so far has
amounted to formalizing what knowledge we were aware of and which was deemed
linguistically relevant.The present paper reports the results of an investigation into the nature of
the overgeneration as found in the analysis results obtained in the process
of annotating a corpus of Modern British English by means of a rule-based
parser. As these results show, there is sufficient reason to believe that it
should indeed be possible to tailor the grammar for efficiency without
compromising its linguistic motivation. Moreover, the nature of (some of)
the adaptations is such that they must be considered relevant not only with
respect to the specific parser used in the experiment, but that they can
also be of importance in a broader context.ReferencesJ.AartsH.van HalterenN.OostdijkThe TOSCA analysis systemC.H.A.KosterE.OltmansProceedings of the First AGFL WorkshopNikmegenCSI1996181-191H.van HalterenN.OostdijkTowards a linguistic database: the TOSCA analysis
systemJ.AartsP.de HaanN.OostdijkEnglish Language Corpora: Design, analysis and
exploitationAmsterdam - AtlantaRodopi1993145-161F.KarlssonA.VoutilainenJ.HeikkiläA.AnttilaConstraint Grammar. A Language-Independent System for
Parsing Unrestricted TextBerlin - New YorkMouton de Gruyter1995M.MarcusB.SantoriniM.A.MarcinkiewiczBuilding a large annotated corpus of English: The Penn
TreebankComputational Linguistics192313-3301993N.OostdijkP.de HaanClause patterns in Modern British English. A
corpus-based (quantitative) studyICAME Journal 1841-801994R.QuirkS.GreenbaumG.LeechJ.SvartvikA Comprehensive Grammar of the English LanguageLondonLongman1985