Machine Learning Support for Evaluation and Quality
ControlHansvan HalterenUniversity of Nijmegen, The Netherlands 2000University of GlasgowGlasgowALLC/ACH 2000editorJeanAndersonAmalChatterjeeChristianJ.KayMargaretScottencoderSaraA.SchmidtComputational / Corpus LinguisticsAnnotated material which is to be evaluated and possibly upgraded is used as
training and test data for a machine learning system. The portion of the
material for which the output of the machine learning system disagrees with
the human annotation is then examined in detail. This portion is shown to
contain a higher percentage of annotation errors than the material as a
whole, and hence to be a suitable subset for limited quality improvement. In
addition, the types of disagreement may identify the main inconsistencies in
the annotation so that these can then be investigated systematically.BackgroundIn many humanities projects today, we see that large textual resources are
manually annotated with markup symbols, as these are deemed necessary for
efficient future research with those resources. The reason that the
annotation is applied manually is that there is, for the time being, no
automatic procedure which can apply the annotation with an acceptable degree
of correctness, typically because the annotation requires detailed knowledge
of language or even of the world to which the resources refer.The choice of human annotators may be unavoidable, but it is also one which
has a severe disadvantage. Human annotators are unable to sustain the amount
of concentration needed for correct annotation for the amounts of time
needed to annotate the enormous amounts of data present (cf. e.g. Marcus et
al. 1993; Baker 1997). Loss of concentration, even if only partial and
temporary, is bound to lead to a loss of correctness in the annotation.
Awareness of this problem has led to the use of quality control procedures
in large scale annotation projects. Such procedures generally consist of
spot checks by more experienced annotators or double blind annotation of a
percentage of the material. The lessons learned from such checks lead to
additional instruction of the annotators, and, if the observed errors are
systematic and/or severe enough, to correction of previously annotated
material. Even with excellent quality control measures during annotation,
though, it is likely that the end result will not be fully correct, and the
measure of correctness can, at most, be estimated from the observations made
in quality control. Obviously, it would be enormously helpful if there were
automatic procedures to support large scale evaluation and upgrade of
annotated material.MethodologyUnfortunately, as mentioned above, automatic procedures are currently unable
to deal with natural language to a sufficient degree to correctly apply most
types of annotation. However, although automatic procedures cannot provide
correctness, they are undoubtedly well-equipped to provide consistency. Now
consistency and correctness are not the same, but both are desirable
qualities and, unlike other pairs of desirable qualities such as high
precision and recall, they are not in opposition. Complete correctness is
bound to be consistent at some level of reference and complete consistency
at a sufficiently deep level of reference is bound to be correct. More
practically, a highly correct annotation can be assumed to agree most of the
time with a highly consistent annotation, which means that disagreement
between the two will tend to indicate instances with a high likelihood of
error.An example is provided by Van Halteren et al. (Forthcoming). One of the
constructed wordclass taggers is trained and tested on Wall
Street Journal material tagged with the Penn Treebank tagset. In
comparison with the benchmark, the tagger provides the same tag in 97.23% of
the cases. When the disagreements are checked manually for 1% of the corpus,
it turns out that out of 349 disagreements, 97 are in fact errors in the
benchmark. Unless this is an unfortunate coincidence, it would mean that we
can remove about 10,000 errors by checking fewer than 40,000 words, a much
less formidable task than checking the whole 1Mw corpus. In addition, the
cases where the tagger is wrong appear to be caused in 44% by
inconsistencies in the training data, e.g. the word "about" in "about 20" or
"about $20" is tagged as a preposition 648 times and as an adverb 612 times.
Such observations are slightly harder to use systematically, but can again
serve to adjust inconsistent and/or incorrect annotation.In principle, the use of such a comparison methodology is not limited to
wordclass tagging. Any annotation task which can be expressed as
classification on the basis of a (preferably small) number of information
units (e.g. for wordclass tagging the information units could be the word,
two disambiguated preceding classes and two undisambiguated following
classes) is amenable to be handled by a machine learning system. Such a
system attempts to identify regularities in the relation between the set of
information units and uses these regularities to classify previously unseen
cases (cf. e.g. Langley 1996; Carbonell 1990). Several machine learning
systems are freely available for research purposes, e.g. the memory-based
learning system TiMBL () and the decision
tree system C5.0 (). If we have a
machine learning system and if we can translate the annotation task into a
classification task, we can train the system on the annotated material and
then compare the system's output with the human annotation. The instances
where the two disagree can then (a) be used as prime candidates for
rechecking correctness and (b) point to systematic inconsistencies to be
reconsidered. Overview of the PaperUsing various types of annotated material and machine learning systems, this
paper will attempt to answer the following questions:For which types of annotation is this method useful?How does the error rate in the 'highlighted' portion of the
material compare to the overall error rate?At which levels of correctness of the annotation is the method
useful?Are some machine learning systems better than others for the
purpose at hand?Can we benefit from the fact that we have more than one system at
our disposal and, if so, how?Should we use the full material in the training phase or is it
better to use cross-validation?ReferencesJ.P.BakerConsistency and Accuracy in Correcting Automatically
Tagged DataR.GarsideG.LeechA.P.MceneryCorpus AnnotationLondonAddision Wesley Longman1997243-250J.CarbonellMachine Learning: Paradigms and MethodsCambridge, MAMIT Press1990H.van HalterenW.DaelemansJ.ZavrelImproving Accuracy in NLP through Combination of
Machine Learning SystemsComptutational Linguistics272(Forthcoming)P.LangleyElements of Machine LearningLos Altos, CAMorgan Kaufmann1996M.MarcusB.SantoriniM.MarcinkiewiczBuilding a large annotated Corpus of English: the Penn
TreebankComputational Linguistics 192313-3301993