The Feasibility of Incremental Linguistic
AnnotationHansvan HalterenUniversity of Nijmegenhvh@let.kun.nl1997ACH/ALLC 1997editorthe secretarial staff in the Department of French Studies at
Queen's UniversityGregLessardencoderSaraA.Schmidtlinguistic annotationcorpus linguisticssyntactic analysisAbstractThis paper examines the feasibility of incremental annotation, i.e. using
existing annotation on a text as the basis for further annotation rather
than starting the new annotation from scratch. It contains a theoretical
component, describing basic methodology and potential obstacles, as well as
a practical component, describing experimental use of incremental
annotation.IntroductionIn both the linguistic and the language engineering communities it is
generally accepted that corpora are important resources and that their
usefulness increases with the presence of linguistic annotation. The added
value of annotation depends not only on what type of markup is present
(morpho-syntactic, syntactic, semantic, etc.) but also on the quality of
that markup:- how (un)ambiguous is the markup, i.e. have only the contextually
appropriate markers been selected from among the potential ones or
has some of the ambiguity been retained (either explicitly or in the
form of underspecification)?- how consistent is the annotation, e.g. is it in accordance with
an annotation manual?- how correct is the annotation, i.e. will others agree with the
applied markup (given the stated meaning of the markers)?When we examine the demands that linguistic and language engineering research
makes on the annotation with regards to these three points, we see that
fully automatic annotation is generally not an option. Beyond morpho-syntax
(i.e. wordclass tagging), the currently available computer software does not
contain sufficient knowledge about language to pinpoint the contextually
appropriate markup for a large enough percentage of most types of text.This means that linguistic annotation of corpora entails human involvement
and, given the size of present day corpora, an enormous amount of it. In
recognition of the fact that the amount of work that needs to be done
usually exceeds the amount of work that can be done during a project
(because of lack of manpower, funding or whatever), the international
community is propagating the reuse of corpus resources. Users are encouraged
to use annotated corpora already in existence and annotators are encouraged
to perform their annotation in such a way that reuse is possible. An
important factor in reusability is obviously standardization of annotation
practices (as far as this is feasible), a fact which has led to initiatives
such as EAGLES (cf. Calzolari and McNaught, 1996).If the principle of reusability really works, one can imagine taking a
well-annotated corpus and adding a further layer of annotation, which of
course should itself also be reusable. This can then be repeated, leading to
a cyclic process which in the end yields a corpus which is annotated for a
very large number of aspects. We call this process incremental annotation.
DeliberationsIncremental annotation seems to be the ideal solution for a wide-spread
problem: researchers can produce the data they need with much less work. In
practice, unfortunately, there are still some obstacles to overcome. When
somebody wants to add a new layer of annotation to an already annotated
corpus, the question always is to which degree the existing annotation is of
any real use. Most decisive are two properties of the existing annotation:
quality (i.e. (un)ambiguity, consistency and correctness) and compatibility
with the projected new annotation.The importance of quality is obvious: if the existing annotation cannot be
trusted, checking and correcting it may be as much work as starting from
scratch. Furthermore, since quality is extremely complicated to measure, it
often really is a question of trust. It would be good if all definitions of
annotation standards would also include a clear-cut description of a
procedure to measure the quality of an annotated corpus which uses that
standard. Until such measurements become available, anyone planning to reuse
an annotated corpus had better take some random samples from it and decide
for himself if the quality is sufficiently high.A high quality annotation corpus is no guarantee for unproblematic reuse,
though. Even unambiguous, consistent and correct annotation is only useful
if it provides the kind of information which is needed for the new layer.
Insufficient information can be supplemented, of course, (cf. Black, 1994)
but contradictory information will tend to be more of a problem, e.g. the
Lancaster Treebankers always mark the word "it" as a noun phrase but this
may lead to problems if the new annotation is supposed to describe an
anticipatory "it" as a syntactic marker. Compatibility is as hard to measure
as quality, maybe even harder (cf. Atwell et al, 1994). Incompatibilities
between annotation schemes are often found at a level of detail which goes
beyond superficial documentation and are usually highly context dependent.
As a result, only outright incompatibility can be recognized easily and
quickly, whereas partial incompatibility will only be noticed after
substantial work has already been done.The final complication in judging the usefulness of an existing annotation is
that quality and compatibility are not independent. It is here that the
difference between correctness and consistency becomes relevant. If the
existing annotation has to be adapted to be useful, it may be more important
that it is consistent than that it is correct. In order for the existing
annotation to be useful, adaptations should preferably be made automatically
and this is very difficult if there is a high level of inconsistency.MethodologyThe deliberations above may well appear to stress potential problems for
incremental annotation over potential gains. If this is so, it is because we
feel the gains are already obvious. We certainly do not want to give the
impression that incremental annotation is a hopeless cause and should not
even be attempted. However, we do want to temper the unbridled optimism that
tends to accompany references to the reusability principle. The choice to
commit oneself to incremental annotation should always be made only after an
increase in efficiency and/or quality for any new annotation has been
demonstrated. The feasibility of such an increase depends to a large extent
on the way in which the incremental annotation is implemented. In general,
we can distinguish two methodologically different approaches to incremental
annotation: the planned and the opportunistic approach.In the planned approach, all layers of annotation are designed to be
compatible (which includes being sufficiently consistent and correct). This
will usually mean that more work will have to be put into layer X in order
to be compatible with layers X+1, X+2, etc., but the extra work is amply
paid back by the decrease of work for those layers. Obviously, the planned
approach can only be used (fully) if one starts out with a raw corpus.
Furthermore, there should be a certain amount of confidence that all layers
of annotation will eventually be applied as planned, since otherwise the
extra effort for the initial layers may be lost. Such confidence can be
boosted by making the annotation design into a standard, but for the time
being such cross-layer standards are not to be expected, given the lack of
consensus for most types of linguistic annotation.The opportunistic approach is less structured. Its basic tenet is that any
existing annotation can be useful. Following the opportunistic approach
means looking for the most promising data available and using that as a
starting point. After the data has been located, there are two ways of using
it. One could design the new annotation layer to be compatible with the
existing annotation, in effect a post hoc planned approach. Usually,
however, one will already have one's own ideas about what the new annotation
should look like. These ideas tend to imply specific requirements for the
existing annotation, which will then have to be adapted, corrected and
extended in order to serve as the foundation for the new annotation layer.
As already indicated above, such reuse can lead to tremendous gain over
annotation from scratch but can equally well lead to complete disaster.ExperimentationIn order to illustrate the difference between the approaches we have
performed an experiment in which parts of the Spoken English Corpus (MARSEC;
cf. Arnfield, 1996 and UCREL, 1996) are annotated with TOSCA/ICE syntactic
analysis trees. The planned approach is represented by the use of the
traditional TOSCA analysis system (cf. van Halteren and Oostdijk, 1993) for
this material. The opportunistic approach is represented by the use of an
adapted and extended version of that same analysis system which takes the
Lancaster Treebank analyses (cf. Leech and Garside, 1991) of the same
portion of MARSEC as input.This paper describes the activities involved in the adaptation, examines the
experiences with both approaches and evaluates whether the use of the
Treebank data as the starting point for the analysis indeed leads to a gain
over the traditional method.ReferencesS.ArnfieldMARSEC: The Machine Readable Spoken English
Corpus1996E.AtwellJ.HughesC.SouterAMALGAM: Automatic Mapping Among Lexico-Grammatical
Annotation ModelsJ.KlavansProceedings of the ACL Workshop on The Balancing Act:
Combining Symbolic and Statistical Approaches to LanguageNew JerseyACL1994E.BlackAn experiment in customizing the Lancaster
TreebankN.OostdijkP.de HaanCorpus-based research into languageAmsterdam/AtlantaRodopi1994N.CalzolariJ.McNaughtEAGLES Editor's Introduction (EAG-EB-FR1)1996H.van HalterenN.OostdijkTowards a syntactic database: the TOSCA analysis
systemJ.AartsP.de HaanN.OostdijkEnglish Language Corpora: design, analysis and
exploitationAmsterdam/AtlantaRodopi1993G.LeechR.GarsideRunning a grammar factory: The production of
syntactically analysed corpora or "treebanks"S.JohanssonA.StenströmEnglish Computer CorporaBerlin/New YorkMouton de Gruyter1991UCRELUCREL Projects: The Machine Readable Spoken English
Corpus1996