Short substrings as document discriminators: An
empirical study RichardForsythUniversity of the West of Englandrs-forsyth@csm.uwe.ac.uk1997ACH/ALLC 1997editorthe secretarial staff in the Department of French Studies at
Queen's UniversityGregLessardencoderSaraA.SchmidtMonte-Carlo methodsstylometrytext categorization1. IntroductionIn seeking to capture consistent and distinctive features of linguistic
style, stylometrists have used a variety of textual indicators (see: Holmes,
1994). In the majority of stylometric studies, however, the choice of which
indicators to use in a given problem is left to the discretion of the
investigator. This practice permits the exercise of human judgement, and
thus can sometimes save a time-consuming search for suitable descriptors;
but it inevitably involves subjectivity. Very often the choice of suitable
linguistic markers is crucial to the development of an effective
discriminant rule; but, being subjective, it may not be replicable on
another problem.The situation is similar in the related fields of multivariate pattern
recognition and machine learning (Everitt & Dunn, 1991; Quinlan,
1993): most studies begin by presuming that a suitable set of attributes or
features has already been found. In text analysis this presumption is more
than usually questionable.For these and other reasons, a number of studies have appeared recently (e.g.
Burrows, 1992; Binongo, 1994; Burrows & Craig, 1994; Kjell, 1994;
Ledger & Merriam, 1994; Bayer et al., 1996) in which the features
used as indicators are not imposed by the prior judgement of the analyst but
are -- to a large extent -- determined by the texts being analyzed. The main
aim of the present paper is to advance this trend. 2. Method of Classification UsedAs studying the strengths and weaknesses of different classification
techniques was not the main focus of this investigation, a simple
nearest-centroid classifier was used for the tests reported in this paper.
This algorithm, as implemented here, uses a training set of examples with
known class membership to compute a centroid (multi-dimensional average) for
each category. Then on a fresh or unseen case a measure of distance from
each class centroid is computed and the current case is assigned to the
category of the centroid which it most resembles. This method is described
as "simultaneous key identification" by Sneath & Sokal (1973). It
belongs to a class of similarity-based classifiers which McKenzie &
Forsyth (1995), among others, have shown to be generally robust in practice.
3. The Bristol Benchmark SuiteIn many areas of computing, benchmarking is a routine practice. For instance,
when compilers are tested for compliance to a programming-language
specification, it is normal to apply them to a suite of benchmark cases and
record any divergence from expected behaviour.Although benchmarking does have drawbacks as well as advantages, it does help
in setting objective standards. For example, it is arguable that in the
field of forecasting, the work of Makridakis and colleagues (e.g. Makridakis
& Wheelwright, 1989), who tested a number of forecasting methods on
a wide range of (mostly economic) time series, transformed the field --
leading to both methodological and practical advances.Even though billion-byte public-domain archives of text exist, e.g. Project
Gutenberg and the Oxford Text Archive, stylometry currently lacks an
equivalent set of accepted test problems. Therefore we at Bristol have
compiled a textual benchmark suite to meet this need. The current version of
this suite is known as Tbench96. Despite its deficiencies, it does present a
broader variety of test problems than other workers in stylometry and allied
fields have previously used.4. Data-Driven Feature-FindingThe main focus of this investigation is a comparative examination of a number
of methods of data-driven feature-finding. Only methods that do not require
the exercise of judgemental expertise on the part of an investigator were
considered. Thus the feature types examined below share the following
desirable properties: (1) they are easy to compute; (2) they are
interlingual, i.e. they are not limited to the English language.Perhaps the simplest possible method of obtaining a set of textual features
is simply to treat each letter of the alphabet as a feature, i.e. to count
the frequency of each of the 26 letters in each kilobyte line. This requires
no background knowledge except that the texts concerned are encoded using
the Roman alphabet. At first glance this would seem not just simple, but
simplistic. However, some previous studies -- most notably Ledger (1989) and
Ledger & Merriam (1994) -- have reported surprisingly good results
when using letter-counts as stylistic indicators. At the very least, this
establishes a baseline level of performance: more sophisticated features
sets need to outperform letter counting to justify their added
complexity.To avoid restricting other marker sets to an unrealistic size, the baseline
technique included here is actually that of Ule (1982) rather than that of
Ledger & Merriam (1994). Ule counted not just the frequencies of the
26 latin letters A to Z (ignoring the distinction between upper and lower
case) but also the ten arabic numerals 0 to 9.A natural progression from counting the 36 pre-specified alphanumeric
characters in each text block is simply to count the frequencies, in each
block, of the 36 commonest characters in the training text as a whole. This
allows spaces and other punctuation symbols to play a role as
discriminators. The danger here is that such characters may be less
intrinsic to the original texts than letters. For instance, Shakespeare's
punctuation is mostly the work of subsequent editors. Nevertheless, it was
decided to investigate the effect of moving from letter-counting to
character counting. (In the two cases in TBench96 where punctuation is not
original, punctuation marks have been removed from the files held on
disk.)If counts of single characters are useful as features then so might counts of
digrams be useful, or of higher-order n-grams. Kjell (1994) reported good
results in assigning Federalist essays written
either by Hamilton or Madison to the correct authors using a neural-network
classifier to which letter-pair frequencies were given as input features.
This approach was extended to characters (not just letters) and generalized
to include comparison of trigrams and tetragrams as well.Another type of textual feature has been used extensively by Burrows (1992)
as well as Binongo (1994), among others. Essentially it involves finding the
most frequently used words and treating the rate of usage of each such word
in a given text as a quantitative attribute. The exact number of common
words used varies by author and application. Burrows and colleagues
(Burrows, 1992; Burrows & Craig, 1994) discuss examples using
anywhere from the 50 most common to the 100 most common words. Binongo
(1994) uses the commonest 36 words (after excluding pronouns). Greenwood
(1995) uses the commonest 32 (in New Testament Greek). In the present study,
the most frequent 36 words in the combined training samples of each problem
were used.In addition, two novel approaches to textual feature-finding were tested in
the present study. These are described in sections 5 and 6.5. Progressive Pairwise ChunkingThe first of these derives what are called DOUBLETS in Table 1 from text
samples, by adapting a method proposed independently by Wolff (1975) and
Dawkins (1976) in different contexts.This algorithm scans byte-encoded text sequences, seeking the most common
pair of symbols. At the end of each scan it replaces all occurrences of that
pair by a newly allocated code. This process is repeated for the next most
common pair, and so on till the requested number of pairings have been made.
The program used here assumes that ASCII codes from 128 onwards are free for
reassignment (as is the case with Tbench96) so byte codes from 128 upwards
are allocated sequentially. As output the program produces a list of
doublets. These need not be digrams, since previously joined doublets can be
linked in later passes -- thus building up quite long chains, if they occur
in the data, without demanding great computational resources.The program thus gets away from the artificiality of using fixed- length
substrings such as digrams, trigrams or tetragrams. It is able to find
markers that are shorter than words (e.g. an affix such as `ed ') or longer
(e.g. a collocation such as `of the').6. Monte-Carlo Feature-FindingThe last method of feature-finding tested in this study takes the idea
implicit in the progressive chunking method to what may be thought its
logical conclusion. Monte-Carlo Feature-Finding is simply a random search
for substrings that exist in the training data. Here this process is
implemented by a program called TEFF (Text Extending Feature Finder). This
finds markers merely by searching through a given set of training texts. Essentially TEFF picks many substrings (3600 in the experiment reported here)
from the combined training text completely at random. The length of each
substring is also a random number from 1 to 8. All distinct substrings thus
found are saved and then ranked according to a distinctiveness measure, i.e.
according to a measure of their differential rate of occurrence in the
different text categories for the problem concerned. Chi-squared is used to
measure distinctiveness. For the present experiment only the most
distinctive 36 substrings were kept.6.1 Elastic StringsHowever, earlier trials of Monte-Carlo Feature-Finding (Forsyth, 1995)
have indicated that the basic procedure, as described above, tends to
generate substrings that are fragmented at what seem linguistically
inappropriate boundary points, even when they prove effective as
discriminators. For instance, with samples from the Federalist Papers, a precursor of TEFF often found `pon' or
`upo' instead of `upon '. So TEFF incorporates a procedure that extends
each substring, as soon as it is generated, to the maximal length
justified by the training text. This refinement does not require the
program to have any knowledge of English orthography or morphology, but
it does eliminate the most glaring examples of improper text
fragmentation. For instance, applying this procedure to a list of
substrings produced by from the Federalist data
the substring `deraci` became ` confederacies`. Other expansions
included `partmen' to ` department' and `ongres' to ` congress'. 6.2 Measuring DistinctivenessThe measure of distinctiveness used in TEFF is Chi-squared (Pearson,
1900), with expected values computed on the basis of equal rates of
occurrence in the various categories of text being examined. This has
been reported as effective in characterizing textual differences by
McMahon et al. (1979) and Hofland & Johansson (1982). However,
Chi-squared has also been criticized, e.g. by Church & Hanks
(1989) and by Kilgarriff (1996), on the grounds that overall frequency
does not distinguish words or strings that occur plentifully in only one
section of a text from those that occur more steadily throughout it.
Ideally we want markers that permeate a particular kind of text. Thus:
"an ideal index should actually be a measure of permeation;
that is, it should consider both frequency and distribution
information."Stone & Dunphy (1966), p. 33.In pursuit of this ideal, TEFF offers two measures of distinctiveness. In
the first, Chi-squared is computed using unadjusted frequency counts; in
the second, the square root of the count for each kilobyte line is
accumulated rather than the frequency count itself. This latter measure
penalizes strings that occur very often but only in a narrow segment of
the text. It thus has a better claim to be an index of permeation than
the unmodified Chi-squared score. Both variants were compared in the
present experiment.7. ResultsThe main response variable measured is the percentage of correct
classifications made on unseen test data. Mean values are shown in Table
1.Table 1 -- Mean Success Rates on Test Data.Marker TypePercent CorrectKlecka's tau0.ALPHANUM69.150.43691.UNIGRAMS73.410.50782.DIGRAMS70.110.45283.TRIGRAMS69.650.44264.TETRAGRAMS68.890.43015.WORDS68.990.43096a.DOUBLETS(combined)70.260.45276b.DOUBLETS(merged)69.380.44007a.TEFFSUBS(raw counts)73.770.52037b.TEFFSUBS(sqrt counts)75.000.5325These figures are aggregated over all 13 problems in Tbench96. Percentage
success rate is given because it is familiar and thus easy to interpret.
However, as different problems vary in the number of categories to be
distinguished (2, 3 or 4), the rightmost column gives a measure of accuracy
that compensates for unequal numbers of categories (Klecka, 1980).Using an overall significance level of 0.05 Hsu's multiple- comparison method
indicated that all but three methods were significantly different from the
best. These three were: UNIGRAMS, TEFFSUBS using raw counts, and TEFFSUBS
using the square-root transformation. Only these three candidates remain as
realistic contenders to be the best method of the 10 tested here.8. DiscussionA variety of stylometric indicator types have been examined empirically. This
has given rise to a preliminary `pecking order' among marker types based on
results obtained on a benchmark suite of text classification problems. This
ranking suggests firstly that it is worthwhile to allow non-alphanumeric
characters as textual features as well as alphanumeric ones, and secondly
that it is worthwhile to select textual features for distinctiveness rather
than simply by frequency.It is also worth noting that one of the simplest ways of finding textual
features (UNIGRAMS, which just uses the commonest 36 characters in the
training texts) is also one of the best. Moreover, no advantage was evident
from using longer n-grams.The two novel types of textual marker tested performed creditably. Both merit
serious consideration by future researchers in this area. DOUBLETS obtained
by progressive pairwise chunking would appear to be at least as informative
as common words, which have previously held pride of place in stylometry. Of course this conclusion only carries weight to the extent that the
benchmark suite used here (Tbench96) is adequate, and it can only be
regarded as a prototype. Nevertheless the very idea of using an agreed suite
of benchmark problems is in itself a contribution to stylometry; and while
Tbench96 is admittedly imperfect, it already presents a wider range of
problems than customarily used, and provides a starting point for future
developments.AcknowledgementsI thank Dr David Holmes and Professor Colin Martindale for providing some of
the text files used in this benchmarking exercise, as well as for helpful
comments.ReferencesT.BayerI.RenzK.SteinU.KresselDomain and Language Independent Feature Extraction for
Statistical Text CategorizationL.J.EvettT.G.RoseLanguage Engineering for Document Analysis and
RecognitionNottinghamNottingham Trent University1996J.N.G.BinongoJoaquin's Joaquinesquerie, Joaquinesquerie's Joaquin: A
Statistical Expression of a Filipino Writer's StyleLiterary & Linguistic Computing94267-2791994J.F.BurrowsNot unless you Ask Nicely: the Interpretive Nexus
between Analysis and InformationLiterary & Linguistic Computing7291-1091992J.F.BurrowsD.H.CraigLyrical Drama and the "Turbid Montebanks": Styles of
Dialogue in Romantic and Renaissance TragedyComputers & the Humanities2863-861994K.ChurchP.HanksWord Association Norms, Mutual Information and
LexicographyProc. 27th Annual ACL Meeting, Vancouver198976-83R.DawkinsHierarchical Organisation: a Candidate Principle for
EthologyP.P.G.BatesonR.A.HindeGrowing Points in EthologyCambridge University Press1976B.S.EverittG.DunnApplied Multivariate Data AnalysisLondonEdward Arnold1991R.S.ForsythStylistic Structures: a Computational Approach to Text
ClassificationUnpublished Doctoral ThesisFaculty of Science, University of Nottingham1995H.H.GreenwoodCommon Word Frequencies and Authorship in Luke's Gospel
and ActsLiterary & Linguistic Computing103183-1871995D.I.HolmesAuthorship AttributionComputers & the Humanities281-201994A.KilgarriffWhich Words are Particularly Characteristic of a
Text?L.J.EvettT.G.RoseLanguage Engineering for Document Analysis and
RecognitionNottinghamNottingham Trent University1996B.KjellAuthorship Determination Using Letter Pair Frequency
Features with Neural Net ClassifiersLiterary & Linguistic Computing9219-1241994W.R.KleckaDiscriminant AnalyisNewbury ParkSage Publications1980G.R.LedgerRe-Counting PlatoOxfordOxford University Press1989G.R.LedgerT.V.N.MerriamShakespeare, Fletcher, and the Two Noble
KinsmenLiterary & Linguistic Computing93235-2481994S.MakridakisS.C.WheelwrightForecasting Methods for Managersfifth editionNew YorkJohn Wiley & Sons1989D.P.McKenzieR.S.ForsythClassification by Similarity: An Overview of
Statistical Methods of Case-Based ReasoningComputers in Human Behavior112273-2881995L.E.McMahonL.L.CherryR.MorrisStatistical Text ProcessingBell System Technical Journal5762137-21541978F.MostellerD.L.WallaceApplied Bayesian and Classical Inference: the Case of
the Federalist PapersNew YorkSpringer-Verlag1984[Extended edition of: Mosteller & Wallace (1964). Inference and Disputed Authorship: the Federalist.
Addison-Wesley, Reading, Massachusetts.]K.PearsonOn the Criterion that a Given System of Deviations from
the Probable in the Case of a Correlated System of Variables is Such
that It Can Reasonably Be Supposed to Have Arisen from Random
SamplingPhilosophical Magazine62157-1761900J.R.QuinlanC4.5: Programs for Machine LearningSan Mateo, CaliforniaMorgan Kaufmann1993P.H.A.SneathR.R.SokalNumerical TaxonomySan FranciscoW.H. Freeman & Co.1973P.J.StoneD.C.DunphyM.S.SmithD.M.OgilvieThe General Inquirer: a Computer Approach to Content
Analysis in the Behavioral SciencesCambridge, Mass.MIT Press1966L.UleRecent Progress in Computer Methods of Authorship
DeterminationALLC Bulletin10373-891982J.G.WolffAn Algorithm for the Segmentation of an Artificial
Language AnalogueBritish Journal of Psychology66179-901975