Home :: DH Abstracts

Machine Learning Support for Evaluation and Quality Control

Hans

van Halteren

University of Nijmegen, The Netherlands

2000

University of Glasgow

Glasgow

ALLC/ACH 2000

editor

Jean

Anderson

Amal

Chatterjee

Christian

Kay

Margaret

Scott

encoder

Sara

Schmidt

Computational / Corpus Linguistics

Annotated material which is to be evaluated and possibly upgraded is used as training and test data for a machine learning system. The portion of the material for which the output of the machine learning system disagrees with the human annotation is then examined in detail. This portion is shown to contain a higher percentage of annotation errors than the material as a whole, and hence to be a suitable subset for limited quality improvement. In addition, the types of disagreement may identify the main inconsistencies in the annotation so that these can then be investigated systematically.

Background

In many humanities projects today, we see that large textual resources are manually annotated with markup symbols, as these are deemed necessary for efficient future research with those resources. The reason that the annotation is applied manually is that there is, for the time being, no automatic procedure which can apply the annotation with an acceptable degree of correctness, typically because the annotation requires detailed knowledge of language or even of the world to which the resources refer.

The choice of human annotators may be unavoidable, but it is also one which has a severe disadvantage. Human annotators are unable to sustain the amount of concentration needed for correct annotation for the amounts of time needed to annotate the enormous amounts of data present (cf. e.g. Marcus et al. 1993; Baker 1997). Loss of concentration, even if only partial and temporary, is bound to lead to a loss of correctness in the annotation. Awareness of this problem has led to the use of quality control procedures in large scale annotation projects. Such procedures generally consist of spot checks by more experienced annotators or double blind annotation of a percentage of the material. The lessons learned from such checks lead to additional instruction of the annotators, and, if the observed errors are systematic and/or severe enough, to correction of previously annotated material. Even with excellent quality control measures during annotation, though, it is likely that the end result will not be fully correct, and the measure of correctness can, at most, be estimated from the observations made in quality control. Obviously, it would be enormously helpful if there were automatic procedures to support large scale evaluation and upgrade of annotated material.

Methodology

Unfortunately, as mentioned above, automatic procedures are currently unable to deal with natural language to a sufficient degree to correctly apply most types of annotation. However, although automatic procedures cannot provide correctness, they are undoubtedly well-equipped to provide consistency. Now consistency and correctness are not the same, but both are desirable qualities and, unlike other pairs of desirable qualities such as high precision and recall, they are not in opposition. Complete correctness is bound to be consistent at some level of reference and complete consistency at a sufficiently deep level of reference is bound to be correct. More practically, a highly correct annotation can be assumed to agree most of the time with a highly consistent annotation, which means that disagreement between the two will tend to indicate instances with a high likelihood of error.

An example is provided by Van Halteren et al. (Forthcoming). One of the constructed wordclass taggers is trained and tested on Wall Street Journal material tagged with the Penn Treebank tagset. In comparison with the benchmark, the tagger provides the same tag in 97.23% of the cases. When the disagreements are checked manually for 1% of the corpus, it turns out that out of 349 disagreements, 97 are in fact errors in the benchmark. Unless this is an unfortunate coincidence, it would mean that we can remove about 10,000 errors by checking fewer than 40,000 words, a much less formidable task than checking the whole 1Mw corpus. In addition, the cases where the tagger is wrong appear to be caused in 44% by inconsistencies in the training data, e.g. the word "about" in "about 20" or "about $20" is tagged as a preposition 648 times and as an adverb 612 times. Such observations are slightly harder to use systematically, but can again serve to adjust inconsistent and/or incorrect annotation.

In principle, the use of such a comparison methodology is not limited to wordclass tagging. Any annotation task which can be expressed as classification on the basis of a (preferably small) number of information units (e.g. for wordclass tagging the information units could be the word, two disambiguated preceding classes and two undisambiguated following classes) is amenable to be handled by a machine learning system. Such a system attempts to identify regularities in the relation between the set of information units and uses these regularities to classify previously unseen cases (cf. e.g. Langley 1996; Carbonell 1990). Several machine learning systems are freely available for research purposes, e.g. the memory-based learning system TiMBL () and the decision tree system C5.0 (). If we have a machine learning system and if we can translate the annotation task into a classification task, we can train the system on the annotated material and then compare the system's output with the human annotation. The instances where the two disagree can then (a) be used as prime candidates for rechecking correctness and (b) point to systematic inconsistencies to be reconsidered.

Overview of the Paper

Using various types of annotated material and machine learning systems, this paper will attempt to answer the following questions:

For which types of annotation is this method useful?

How does the error rate in the 'highlighted' portion of the material compare to the overall error rate?

At which levels of correctness of the annotation is the method useful?

Are some machine learning systems better than others for the purpose at hand?

Can we benefit from the fact that we have more than one system at our disposal and, if so, how?

Should we use the full material in the training phase or is it better to use cross-validation?

References

Baker

Consistency and Accuracy in Correcting Automatically Tagged Data

Garside

Leech

Mcenery

Corpus Annotation

London

Addision Wesley Longman

1997

243-250

Carbonell

Machine Learning: Paradigms and Methods

Cambridge, MA

MIT Press

1990

van Halteren

Daelemans

Zavrel

Improving Accuracy in NLP through Combination of Machine Learning Systems

Comptutational Linguistics

(Forthcoming)

Langley

Elements of Machine Learning

Los Altos, CA

Morgan Kaufmann

1996

Marcus

Santorini

Marcinkiewicz

Building a large annotated Corpus of English: the Penn Treebank

Computational Linguistics

313-330

1993