Home :: DH Abstracts

Short substrings as document discriminators: An empirical study

Richard

Forsyth

University of the West of England

rs-forsyth@csm.uwe.ac.uk

1997

ACH/ALLC 1997

editor

the secretarial staff in the Department of French Studies at Queen's University

Greg

Lessard

encoder

Sara

Schmidt

Monte-Carlo methods

stylometry

text categorization

1. Introduction

In seeking to capture consistent and distinctive features of linguistic style, stylometrists have used a variety of textual indicators (see: Holmes, 1994). In the majority of stylometric studies, however, the choice of which indicators to use in a given problem is left to the discretion of the investigator. This practice permits the exercise of human judgement, and thus can sometimes save a time-consuming search for suitable descriptors; but it inevitably involves subjectivity. Very often the choice of suitable linguistic markers is crucial to the development of an effective discriminant rule; but, being subjective, it may not be replicable on another problem.

The situation is similar in the related fields of multivariate pattern recognition and machine learning (Everitt & Dunn, 1991; Quinlan, 1993): most studies begin by presuming that a suitable set of attributes or features has already been found. In text analysis this presumption is more than usually questionable.

For these and other reasons, a number of studies have appeared recently (e.g. Burrows, 1992; Binongo, 1994; Burrows & Craig, 1994; Kjell, 1994; Ledger & Merriam, 1994; Bayer et al., 1996) in which the features used as indicators are not imposed by the prior judgement of the analyst but are -- to a large extent -- determined by the texts being analyzed. The main aim of the present paper is to advance this trend.

2. Method of Classification Used

As studying the strengths and weaknesses of different classification techniques was not the main focus of this investigation, a simple nearest-centroid classifier was used for the tests reported in this paper. This algorithm, as implemented here, uses a training set of examples with known class membership to compute a centroid (multi-dimensional average) for each category. Then on a fresh or unseen case a measure of distance from each class centroid is computed and the current case is assigned to the category of the centroid which it most resembles. This method is described as "simultaneous key identification" by Sneath & Sokal (1973). It belongs to a class of similarity-based classifiers which McKenzie & Forsyth (1995), among others, have shown to be generally robust in practice.

3. The Bristol Benchmark Suite

In many areas of computing, benchmarking is a routine practice. For instance, when compilers are tested for compliance to a programming-language specification, it is normal to apply them to a suite of benchmark cases and record any divergence from expected behaviour.

Although benchmarking does have drawbacks as well as advantages, it does help in setting objective standards. For example, it is arguable that in the field of forecasting, the work of Makridakis and colleagues (e.g. Makridakis & Wheelwright, 1989), who tested a number of forecasting methods on a wide range of (mostly economic) time series, transformed the field -- leading to both methodological and practical advances.

Even though billion-byte public-domain archives of text exist, e.g. Project Gutenberg and the Oxford Text Archive, stylometry currently lacks an equivalent set of accepted test problems. Therefore we at Bristol have compiled a textual benchmark suite to meet this need. The current version of this suite is known as Tbench96. Despite its deficiencies, it does present a broader variety of test problems than other workers in stylometry and allied fields have previously used.

4. Data-Driven Feature-Finding

The main focus of this investigation is a comparative examination of a number of methods of data-driven feature-finding. Only methods that do not require the exercise of judgemental expertise on the part of an investigator were considered. Thus the feature types examined below share the following desirable properties: (1) they are easy to compute; (2) they are interlingual, i.e. they are not limited to the English language.

Perhaps the simplest possible method of obtaining a set of textual features is simply to treat each letter of the alphabet as a feature, i.e. to count the frequency of each of the 26 letters in each kilobyte line. This requires no background knowledge except that the texts concerned are encoded using the Roman alphabet. At first glance this would seem not just simple, but simplistic. However, some previous studies -- most notably Ledger (1989) and Ledger & Merriam (1994) -- have reported surprisingly good results when using letter-counts as stylistic indicators. At the very least, this establishes a baseline level of performance: more sophisticated features sets need to outperform letter counting to justify their added complexity.

To avoid restricting other marker sets to an unrealistic size, the baseline technique included here is actually that of Ule (1982) rather than that of Ledger & Merriam (1994). Ule counted not just the frequencies of the 26 latin letters A to Z (ignoring the distinction between upper and lower case) but also the ten arabic numerals 0 to 9.

A natural progression from counting the 36 pre-specified alphanumeric characters in each text block is simply to count the frequencies, in each block, of the 36 commonest characters in the training text as a whole. This allows spaces and other punctuation symbols to play a role as discriminators. The danger here is that such characters may be less intrinsic to the original texts than letters. For instance, Shakespeare's punctuation is mostly the work of subsequent editors. Nevertheless, it was decided to investigate the effect of moving from letter-counting to character counting. (In the two cases in TBench96 where punctuation is not original, punctuation marks have been removed from the files held on disk.)

If counts of single characters are useful as features then so might counts of digrams be useful, or of higher-order n-grams. Kjell (1994) reported good results in assigning Federalist essays written either by Hamilton or Madison to the correct authors using a neural-network classifier to which letter-pair frequencies were given as input features. This approach was extended to characters (not just letters) and generalized to include comparison of trigrams and tetragrams as well.

Another type of textual feature has been used extensively by Burrows (1992) as well as Binongo (1994), among others. Essentially it involves finding the most frequently used words and treating the rate of usage of each such word in a given text as a quantitative attribute. The exact number of common words used varies by author and application. Burrows and colleagues (Burrows, 1992; Burrows & Craig, 1994) discuss examples using anywhere from the 50 most common to the 100 most common words. Binongo (1994) uses the commonest 36 words (after excluding pronouns). Greenwood (1995) uses the commonest 32 (in New Testament Greek). In the present study, the most frequent 36 words in the combined training samples of each problem were used.

In addition, two novel approaches to textual feature-finding were tested in the present study. These are described in sections 5 and 6.

5. Progressive Pairwise Chunking

The first of these derives what are called DOUBLETS in Table 1 from text samples, by adapting a method proposed independently by Wolff (1975) and Dawkins (1976) in different contexts.

This algorithm scans byte-encoded text sequences, seeking the most common pair of symbols. At the end of each scan it replaces all occurrences of that pair by a newly allocated code. This process is repeated for the next most common pair, and so on till the requested number of pairings have been made. The program used here assumes that ASCII codes from 128 onwards are free for reassignment (as is the case with Tbench96) so byte codes from 128 upwards are allocated sequentially. As output the program produces a list of doublets. These need not be digrams, since previously joined doublets can be linked in later passes -- thus building up quite long chains, if they occur in the data, without demanding great computational resources.

The program thus gets away from the artificiality of using fixed- length substrings such as digrams, trigrams or tetragrams. It is able to find markers that are shorter than words (e.g. an affix such as `ed ') or longer (e.g. a collocation such as `of the').

6. Monte-Carlo Feature-Finding

The last method of feature-finding tested in this study takes the idea implicit in the progressive chunking method to what may be thought its logical conclusion. Monte-Carlo Feature-Finding is simply a random search for substrings that exist in the training data. Here this process is implemented by a program called TEFF (Text Extending Feature Finder). This finds markers merely by searching through a given set of training texts.

Essentially TEFF picks many substrings (3600 in the experiment reported here) from the combined training text completely at random. The length of each substring is also a random number from 1 to 8. All distinct substrings thus found are saved and then ranked according to a distinctiveness measure, i.e. according to a measure of their differential rate of occurrence in the different text categories for the problem concerned. Chi-squared is used to measure distinctiveness. For the present experiment only the most distinctive 36 substrings were kept.

6.1 Elastic Strings

However, earlier trials of Monte-Carlo Feature-Finding (Forsyth, 1995) have indicated that the basic procedure, as described above, tends to generate substrings that are fragmented at what seem linguistically inappropriate boundary points, even when they prove effective as discriminators. For instance, with samples from the Federalist Papers, a precursor of TEFF often found `pon' or `upo' instead of `upon '. So TEFF incorporates a procedure that extends each substring, as soon as it is generated, to the maximal length justified by the training text. This refinement does not require the program to have any knowledge of English orthography or morphology, but it does eliminate the most glaring examples of improper text fragmentation. For instance, applying this procedure to a list of substrings produced by from the Federalist data the substring `deraci` became ` confederacies`. Other expansions included `partmen' to ` department' and `ongres' to ` congress'.

6.2 Measuring Distinctiveness

The measure of distinctiveness used in TEFF is Chi-squared (Pearson, 1900), with expected values computed on the basis of equal rates of occurrence in the various categories of text being examined. This has been reported as effective in characterizing textual differences by McMahon et al. (1979) and Hofland & Johansson (1982). However, Chi-squared has also been criticized, e.g. by Church & Hanks (1989) and by Kilgarriff (1996), on the grounds that overall frequency does not distinguish words or strings that occur plentifully in only one section of a text from those that occur more steadily throughout it. Ideally we want markers that permeate a particular kind of text. Thus:

"an ideal index should actually be a measure of permeation; that is, it should consider both frequency and distribution information."

Stone & Dunphy (1966), p. 33.

In pursuit of this ideal, TEFF offers two measures of distinctiveness. In the first, Chi-squared is computed using unadjusted frequency counts; in the second, the square root of the count for each kilobyte line is accumulated rather than the frequency count itself. This latter measure penalizes strings that occur very often but only in a narrow segment of the text. It thus has a better claim to be an index of permeation than the unmodified Chi-squared score. Both variants were compared in the present experiment.

7. Results

The main response variable measured is the percentage of correct classifications made on unseen test data. Mean values are shown in Table 1.

Table 1 -- Mean Success Rates on Test Data.

Marker Type

Percent Correct

Klecka's tau

ALPHANUM

69.15

0.4369

UNIGRAMS

73.41

0.5078

DIGRAMS

70.11

0.4528

TRIGRAMS

69.65

0.4426

TETRAGRAMS

68.89

0.4301

WORDS

68.99

0.4309

6a.

DOUBLETS(combined)

70.26

0.4527

6b.

DOUBLETS(merged)

69.38

0.4400

7a.

TEFFSUBS(raw counts)

73.77

0.5203

7b.

TEFFSUBS(sqrt counts)

75.00

0.5325

These figures are aggregated over all 13 problems in Tbench96. Percentage success rate is given because it is familiar and thus easy to interpret. However, as different problems vary in the number of categories to be distinguished (2, 3 or 4), the rightmost column gives a measure of accuracy that compensates for unequal numbers of categories (Klecka, 1980).

Using an overall significance level of 0.05 Hsu's multiple- comparison method indicated that all but three methods were significantly different from the best. These three were: UNIGRAMS, TEFFSUBS using raw counts, and TEFFSUBS using the square-root transformation. Only these three candidates remain as realistic contenders to be the best method of the 10 tested here.

8. Discussion

A variety of stylometric indicator types have been examined empirically. This has given rise to a preliminary `pecking order' among marker types based on results obtained on a benchmark suite of text classification problems. This ranking suggests firstly that it is worthwhile to allow non-alphanumeric characters as textual features as well as alphanumeric ones, and secondly that it is worthwhile to select textual features for distinctiveness rather than simply by frequency.

It is also worth noting that one of the simplest ways of finding textual features (UNIGRAMS, which just uses the commonest 36 characters in the training texts) is also one of the best. Moreover, no advantage was evident from using longer n-grams.

The two novel types of textual marker tested performed creditably. Both merit serious consideration by future researchers in this area. DOUBLETS obtained by progressive pairwise chunking would appear to be at least as informative as common words, which have previously held pride of place in stylometry.

Of course this conclusion only carries weight to the extent that the benchmark suite used here (Tbench96) is adequate, and it can only be regarded as a prototype. Nevertheless the very idea of using an agreed suite of benchmark problems is in itself a contribution to stylometry; and while Tbench96 is admittedly imperfect, it already presents a wider range of problems than customarily used, and provides a starting point for future developments.

Acknowledgements

I thank Dr David Holmes and Professor Colin Martindale for providing some of the text files used in this benchmarking exercise, as well as for helpful comments.

References

Bayer

Renz

Stein

Kressel

Domain and Language Independent Feature Extraction for Statistical Text Categorization

Evett

Rose

Language Engineering for Document Analysis and Recognition

Nottingham

Nottingham Trent University

1996

Binongo

Joaquin's Joaquinesquerie, Joaquinesquerie's Joaquin: A Statistical Expression of a Filipino Writer's Style

Literary & Linguistic Computing

267-279

1994

Burrows

Not unless you Ask Nicely: the Interpretive Nexus between Analysis and Information

Literary & Linguistic Computing

91-109

1992

Burrows

Craig

Lyrical Drama and the "Turbid Montebanks": Styles of Dialogue in Romantic and Renaissance Tragedy

Computers & the Humanities

63-86

1994

Church

Hanks

Word Association Norms, Mutual Information and Lexicography

Proc. 27th Annual ACL Meeting, Vancouver

1989

76-83

Dawkins

Hierarchical Organisation: a Candidate Principle for Ethology

Bateson

Hinde

Growing Points in Ethology

Cambridge University Press

1976

Everitt

Dunn

Applied Multivariate Data Analysis

London

Edward Arnold

1991

Forsyth

Stylistic Structures: a Computational Approach to Text Classification

Unpublished Doctoral Thesis

Faculty of Science, University of Nottingham

1995

Greenwood

Common Word Frequencies and Authorship in Luke's Gospel and Acts

Literary & Linguistic Computing

183-187

1995

Holmes

Authorship Attribution

Computers & the Humanities

1-20

1994

Kilgarriff

Which Words are Particularly Characteristic of a Text?

Evett

Rose

Language Engineering for Document Analysis and Recognition

Nottingham

Nottingham Trent University

1996

Kjell

Authorship Determination Using Letter Pair Frequency Features with Neural Net Classifiers

Literary & Linguistic Computing

19-124

1994

Klecka

Discriminant Analyis

Newbury Park

Sage Publications

1980

Ledger

Re-Counting Plato

Oxford

Oxford University Press

1989

Ledger

Merriam

Shakespeare, Fletcher, and the Two Noble Kinsmen

Literary & Linguistic Computing

235-248

1994

Makridakis

Wheelwright

Forecasting Methods for Managers

fifth edition

New York

John Wiley & Sons

1989

McKenzie

Forsyth

Classification by Similarity: An Overview of Statistical Methods of Case-Based Reasoning

Computers in Human Behavior

273-288

1995

McMahon

Cherry

Morris

Statistical Text Processing

Bell System Technical Journal

2137-2154

1978

Mosteller

Wallace

Applied Bayesian and Classical Inference: the Case of the Federalist Papers

New York

Springer-Verlag

1984

[Extended edition of: Mosteller & Wallace (1964). Inference and Disputed Authorship: the Federalist. Addison-Wesley, Reading, Massachusetts.]

Pearson

On the Criterion that a Given System of Deviations from the Probable in the Case of a Correlated System of Variables is Such that It Can Reasonably Be Supposed to Have Arisen from Random Sampling

Philosophical Magazine

157-176

1900

Quinlan

C4.5: Programs for Machine Learning

San Mateo, California

Morgan Kaufmann

1993

Sneath

Sokal

Numerical Taxonomy

San Francisco

W.H. Freeman & Co.

1973

Stone

Dunphy

Smith

Ogilvie

The General Inquirer: a Computer Approach to Content Analysis in the Behavioral Sciences

Cambridge, Mass.

MIT Press

1966

Ule

Recent Progress in Computer Methods of Authorship Determination

ALLC Bulletin

73-89

1982

Wolff

An Algorithm for the Segmentation of an Artificial Language Analogue

British Journal of Psychology

79-90

1975