Combining corpus and experimental data: methodological
considerations Ingede MönninkUniversity of NijmegenI.deMonnink@let.kun.nl1997ACH/ALLC 997editorthe secretarial staff in the Department of French Studies at
Queen's UniversityGregLessardencoderSaraA.Schmidtcorpus-based researchelicitationmethodologyOver the last three decades the study of language use on the basis of corpus data
has been re-established in linguistics, while the use of corpora has also spread
to fields such as speech research, sociolinguistics and lexical studies where
corpora are nowadays widely used. So far, little attention has been given to the
methodological issues that arise with respect to the use of corpus data in any
of these (research) contexts. Corpora are exploited for both quantitative
(absolute and relative frequency of occurrence) and qualitative (distribution)
information. And while certain issues are being addressed, such as to how large
corpora and samples should (ideally) be, other, related and certainly no less
important issues are not being considered at all. One of these issues is the
appropriateness of the use of corpora for the study of relatively infrequent
phenomena. In this context questions must be raised such as whether and if so
how corpus data can or must be supplemented with experimental data.A good example of what appears to be an infrequent phenomenon is the variation in
the constituent structure of the noun phrase (de Mönnink, 1996). In handbooks on
English grammar the noun phrase has been described as comprising of an optional
determiner followed by zero of more premodifying elements, an obligatory head,
and zero or more postmodifying elements. However, as previous research shows,
there are several types of noun phrase that do not conform to this basic
pattern. Examples are: 1 shifted premodification: the premodifier occurs before
the determiner; 2 discontinuous modification: two constituents which are
intuitively felt to belong together are split up into two non-adjacent parts,
the first preceding the head, the second following it; 3 floating
postmodification: the postmodifier is not adjacent to the other constituents of
the noun phrase it modifies.Corpus-based studies generally comprise a qualitative and quantitative analysis.
The qualitative analysis aims at a detailed description of the phenomenon under
study. The quantitative analysis gives a precise picture of absolute and
relative (in)frequency of occurrence of the particular phenomenon. For the
qualitative analysis no generally accepted methodology exists. Which methods are
used depends on the phenomenon that is being studied, the findings of earlier
studies, hypotheses formulated by the researcher and (the nature of the
annotation in) the available corpus data. For the qualitative analysis, use is
made of generally accepted statistical methods. However, for some statistical
tests to be reliable a specific minimum frequency is needed (see e.g. de Haan,
1992). Thus, the quantitative analysis is particularly suitable for frequent
phenomena and less so (or perhaps not at all) for infrequent phenomena.This finding is confirmed by studies on the representativeness of corpora. It has
been observed (Biber, 1990, 1993; de Haan, 1992) that for infrequent structures
to be fully represented in the corpus, samples have to be large. It is, however,
difficult to predict beforehand how large a sample has to be. In turn, if you
want your corpus to contain various text types to identify linguistic variation
among texts, the corpus as a whole has to be very large. In other words, for the
(quantitative) study of an infrequent phenomenon by means of a corpus that is
representative of the population under study, the corpus has to be sufficiently
large. How large exactly depends on the actual frequency of the phenomenon.Table 1 gives the frequency of occurrence of (types of) NPs in a corpus of some
hundred and forty thousand words. The corpus contains four text types: fiction,
non-fiction, drama and spoken material. For a chi-square test to be reliable the
number of expected observations of a single variable cannot be lower than five.
For the three types of NP described above we see that only 20,000 word fiction
samples have a fair chance of containing enough occurrences of all three types.
For floating noun phrase postmodification 10,000 word samples are generally big
enough, for discontinuous modification 20,000 word samples and for shifted
premodification 30,000 word samples or even bigger samples for non-fiction and
drama.Table 1: Number of occurrences of (types of) NPs in different genres GenreSamplenumberof wordsnumberof NPsfloatingpostm.discont.modifiershiftedprem.FictionBW21,55863181178MR20,2666382821CC20,01161363465Non-fictionCB19,368578826112CM10,581328251-DramaSI14,02246291121NC5,642186611-SpokenSp114,91945481722Sp215,93851821432So far there appears to be a simple solution to use of corpora for the study of
infrequent phenomena: simply increase the corpus size. For a simple quantitative
analysis this would indeed be adequate. However, a quantitative analysis lacks
the descriptive richness of the nature of structures which a qualitative
analysis can provide. A qualitative analysis, on the other hand, can only give
subjective judgments about currency or rarity. For a corpus-based study of a
relatively infrequent phenomenon that wants to take into account both the
qualitative and the quantitative aspects, a major problem is constituted by the
fact that while large corpora are required, the detailed annotation of such
corpora is not feasible. Given the present state-of-the-art the detailed
annotation of corpora requires as yet a vast amount of handwork. Hand-annotation
is so time-consuming and subject to inconsistencies that the corpus has to
remain necessarily small. This demand is directly opposite the demand for large
corpora from quantitative approaches that take an interest primarily in the
quantitative information a corpus provides. Thus for the study of the nature and
frequency of a relatively infrequent phenomenon more is needed than the
combination of a qualitative and a quantitative analysis of corpus data.
Additional data has to be considered.In the past, elicitation data have been used to supplement corpus data (e.g.
Quirk and Svartvik, 1966, 1979; Greenbaum, 1970, 1973, 1984). When the
experiment is designed with care, elicitation including both performance and
judgment tests can form an important source to supplement corpus data. In de
Mönnink (forthcoming) I argue that the combination of corpus and experimental
data forms a valuable contribution to the description of language use. If a
phenomenon is too infrequent to be subjected to a corpus-based study alone,
elicitation tests enable the linguist to supplement his data, not only with the
native speakers' judgements on the general acceptability of structures, but also
with additional structures, either expected because they were predicted by the
grammar, or intuitively considered probable, or unexpected yet acceptable. Although experimental data have been combined with corpus data before, no
attention has so far been paid to the problems of combining these two in essence
very different approaches to gathering data. Each has its own methodology for
collecting, classifying, analysing and reporting the data in a systematic way.
While in de Mönnink (1996) I have discussed the design of elicitation
experiments that can be used for supplementation of corpus data, in this paper I
discuss ways for combining the two approaches on points of data classification
and analysis. I argue that this combination is not simply a matter of
integrating statistical outputs, but that it influences both methodologies in
such a radical way that it leads to an entirely new methodology for a
multi-method approach. I illustrate my findings with the study of non-regular
noun phrases.ReferencesD.BiberMethodological Issues Regarding Corpus-based Analysis
of Linguistic VariationLiterary and Linguistic Computing5257-691990D.BiberRepresentativeness in Corpus DesignLiterary and Linguistic Computing8243-571993S.GreenbaumInformant Elicitation of Data on Syntactic
VariationLingua31201-2121973S.GreenbaumCorpus Analysis and Elicitation TestsJ.AartsW.MeijsCorpus Linguistics. Recent Developments in the Use of
Computer Corpora in English Language ResearchAmsterdamRodopi1984193-201S.GreenbaumR.QuirkElicitation Experiments in English Linguistic Studies
in Use and AttitudeLondonLongman1970P.de HaanThe Optimum Corpus Sample Size?G.LeitnerNew Directions in English Language CorporaBerlinMouton de Gruyter19923-19I.de MönninkA First Approach to the Mobility of Noun Phrase
ConstituentsC.PercyF.MeyerI.LancashireSynchronic Corpus Linguistics. Papers from the
sixteenth International Conference on English Language Research on
Computerized Corpora (ICAME 16)AmsterdamRodopi1996143-57I.de MönninkUsing Corpus and Experimental Data: a Multi-method
ApproachM.LjungPapers from the seventeenth International Conference on
English Language Research on Computerized Corpora (ICAME 17)(forthcoming)R.QuirkJ.SvartvikInvestigating Linguistic AcceptabilityThe HagueMouton1966R.QuirkJ.SvartvikA Corpus of Modern EnglishH.BergenholtzB.SchaederEmpirisch Textwissenschaft. Aufbau und Auswertung van
TextcorporaKoenigsteinScriptor1979204-218