Towards a text benchmark suiteRichardS.ForsythUniversity of the West of Englandrs-forsyth@csm.uwe.ac.uk1997ACH/ALLC 1997editorthe secretarial staff in the Department of French Studies at Queen's UniversityGregLessardencoderSaraA.Schmidtbenchmarkingstylometrytext categorization1. IntroductionIn many areas of computing, benchmarking is a routine practice. There is
insufficient room here to go into the pros and cons of benchmarking in any
depth, except to acknowledge that sets of benchmarks do have drawbacks as
well as advantages. Nevertheless benchmarking does have a role to play in
setting objective standards. For example, in the field of forecasting, the
work of Makridakis and colleagues (e.g. Makridakis & Wheelwright,
1989), who tested a number of forecasting methods on a wide range of time
series, transformed the field -- leading to both methodological and
practical advances. Likewise, in machine learning, the general acceptance of
the Machine-Learning Database Repository (Murphy & Aha, 1991) as an
agreed standard, and its employment in extensive comparative tests (e.g.
Michie et al., 1994) has thrown new light on the strengths and weaknesses of
competing algorithms.Although billion-byte public-domain archives of text exist, e.g. Project
Gutenberg and the Oxford Text Archive, stylometry currently lacks an
equivalent set of accepted test problems. Therefore we at Bristol have
compiled a textual benchmark suite. The current version of this suite is
known as Tbench96. Despite its deficiencies, it does present a broader
variety of test problems than other workers in stylometry and allied fields
have previously used.1.1 Selection CriteriaThe text-categorization problems in this suite were selected to fulfil a
number of requirements. 1. Provenance: the true category of each text should be well
attested.2. Variety: problems other than authorship should be
included.3. Language: not all the texts should be in English.4. Difficulty: both hard and easy problems should be
included.5. Size: the training texts should be of `modest' size, such
as might be expected in practical applicatiThe last point may need amplification. Although some huge text samples
are available, most text-classification tasks in real life require
decisions to be made on the basis of samples in the order of thousands
or tens of thousands of words. An enormous training sample of undisputed
text is, therefore, something of a luxury.Subject to these constraints, 13 test problems were chosen: four
authorship problems, three chronology problems, three content-based
problems, and three miscellaneous problems. As is usual in machine
learning, each category of text was divided into non-overlapping
training and test sets. See section 2.1.2 Pre-processingIn order to impose uniformity of layout and thus reduce the effect of
factors such as line-length (not usually an authorial decision) all text
samples have been passed through a program called PRETEXT. This program
makes some minor formatting changes, e.g. case-folding and conversion of
tabs into blanks. However, the most important change made is to break
running text into segments that are then treated as cases to be
classified.Just what consititutes a natural unit of text is by no means obvious.
Different researchers have made different decisions about the best way
of segmenting long texts. Some have used fixed-length blocks (e.g.
Elliott & Valenza, 1991); others have respected natural
subdivisions in the text (e.g. Ule, 1982). Both approaches have merits
as well as disadvantages.Because linguistic materials have a hierarchical structure there is no
universally correct segmentation scheme. In Tbench96 each block boundary
is taken as the first new-line in the text on or after the 999th byte in
the block being formed. Such units will be referred to as kilobyte
lines.The number of words per kilobyte line varies according to the type of
writing. A representative figure for Tbench96 as a whole is 185 words
per line. Thus this is an attempt to work with text units near the lower
limit of what has previously been considered feasible. Evidence of this
is provided by the two quotations below, made 20 years apart. "It is clear in the present study that there is considerable
loss in discriminatory power when samples fall below 500
words".(Baillie, 1974)"We do not think it likely that authorship characteristics
would be strongly apparent at levels below say 500 words, or
approximately 2500 letters. Even using 500 word samples we
should anticipate a great deal of unevenness, and that
expectation is confirmed by these results."(Ledger & Merriam, 1994)Although Felton (1996) has studied 100-word text blocks (in New Testament
Greek) and Simonton (1990) even analyzed word usage in the final
couplets of Shakespeare's 154 sonnets (averaging 17.6 words each), the
block size in Tbench96 is small relative to most previous stylometric
studies; therefore it poses a relatively challenging series of tests.
2. Details of Data SetsThe 13 text-classification problems that constitute TBench96 (Text Benchmark
Suite, 1996 edition) form an enhanced version of the test suite used by
Forsyth (1995). They constitute a potentially valuable resource for future
studies in text analysis.Summary information is given below about the texts used in the benchmark
suite. Note: A policy adhered to throughout was never to split a single work
(article, essay, poem or song) between training and test sets.Authorship / ProseFEDS (2 classes): A selection of papers by two Federalist authors, Hamilton
and Madison. This difficult authorship problem -- subject of a ground-
breaking analysis by Mosteller & Wallace (1984 [1964]) -- is
possibly the best candidate for an accepted benchmark in stylometry.An electronic text of the entire Federalist papers was obtained by anonymous
ftp from Project Gutenberg at GUTNBERG@vmd.cso.uiuc.edu For checking
purposes the Dent Everyman edition was used (Hamilton et al., 1992 [1788]).
Division into test and training sets was as follows. Author Training Test
Hamilton 6, 7, 9, 11, 12, 17, 1, 13, 16, 21, 29, 30,
22, 27, 32, 36, 61, 31, 34, 35, 60, 65, 75,
67, 68, 69, 73, 76, 81 85
Madison 10, 14, 37-48 49-58, 62, 63This division implies accepting the view expounded by Martindale &
McKenzie (1995), who state that: "Mosteller and Wallace's conclusion that
Madison wrote the disputed Federalist papers is so firmly established that
we may take it as given."JOJO (2 classes): Writings by Joseph Smith, the founder of the Mormon
religion, and Joanna Southcott, a religious prophet contemporary with Smith
-- from files kindly donated by Dr David Holmes of UWE Bristol. Southcott's
work was supplied in four files: one from her diaries, two files of
prophetic meditations, and one file of prophetic verse. Smith's three files
were all extracts from his diaries. These texts (and others) have been
analyzed by Holmes (1992).Authorship / PoetryEZRA (3 classes): Poems by Ezra Pound, T.S. Eliot and William B. Yeats --
three contemporaries who influenced each other's writings. For example,
Pound is known to have given editorial assistance to Yeats and, famously,
Eliot (Kamm, 1993).A random selection of poems by Ezra Pound written up to 1926 was taken from
Selected Poems 1908-1969 (Pound, 1977), and entered by hand. It was
supplemented by random selection of 18 pre-1948 Cantos, obtained from the
Oxford Text Archive. Poems by T.S. Eliot were from Collected Poems 1909-1962
(Eliot, 1963). A random selection of 148 poems by W.B. Yeats was taken from
the Oxford Text Archive. For checking purposes Collected Poems (Yeats, 1961)
was used.NAMESAKE (2 classes): Poems by Bob Dylan and Dylan Thomas. Songs by Bob Dylan
(born Robert A. Zimmerman) were obtained from Lyrics 1962-1985 (Dylan,
1994). In addition, two tracks from the album Knocked Out Loaded (Dylan,
1988) and the whole A-side of Oh Mercy (Dylan, 1989) were transcribed by
hand and included, to give fuller coverage.Poems of Dylan Thomas were obtained from Collected Poems 1934-1952 (Thomas,
1952) with four more early works added from The Notebook Poems 1930-1934
(Maud, 1989).ChronologyED (2 classes): Poems by Emily Dickinson, early work being written up to 1863
and later work being written after 1863. Emily Dickinson had a great surge
of poetic composition in 1862 and a lesser peak in 1864, after which her
output tailed off gradually. The work included is all of A Choice of Emily
Dickinson's Verse selected by Ted Hughes (Hughes, 1993) as well as a random
selection of 32 other poems from the Complete Poems (edited by T.H. Johnson,
1970).JP (3 classes): Poems by John Pudney, divided into three classes. The first
category came from Selected Poems (Pudney, 1946) and For Johnny: Poems of
World War II (Pudney, 1976); the second from Spill Out (Pudney, 1967) and
the third from Spandrels (Pudney, 1969). Every distinct poem in these four
books was used.John Pudney (1909-1977) described his career as follows: "My poetic life has
been a football match. The war poems were the first half. Then an interval
of ten years. Then another go of poetry from 1967 to the present time"
(Pudney, 1976). Here the task is to distinguish his war poems (published
before 1948) from poems in two other volumes, published in 1967 and 1969. WY (2 classes): Early and late poems of W.B. Yeats. Early work taken as
written up to 1914, the start of the First World War, and later work being
written in or after 1916, the date of the Irish Easter Rising, which had a
profound effect on Yeats's beliefs about poetry.For these problems the classification objective was to discriminate between
early and late works by the same poet.Subject-MatterMAGS (2 classes): This used articles from two academic journals Literary and
Linguistic Computing (75 articles) and Machine Learning (69 articles). The
task was to classify texts according to which journal they came from. In
fact, each `article' consisted of the Abstract and first paragraph of a
single paper.NEWS (4 classes): This data-set consists of News stories extracted from the
Associated Press wire service during December 1979. A total of about 250,000
words was obtained from the Oxford Text Archive, where it was deposited by
Dr G. Akers in 1980. Stories in this archive are classified into at least
six mutually exclusive categories. For Tbench96, four of these story types
were extracted: F -- Financial stories; I -- International stories; S --
Sports stories; and W -- Washington stories. The Washington category covers
US domestic politics. For training data stories up to 15th December were
used. For test data stories after that date were used.TROY (2 classes): Electronic versions of the complete texts of Homer's Iliad
and Odyssey, both transliterated into the Roman alphabet in the same manner,
were kindly supplied by Professor Colin Martindale of the University of
Maine at Orono. Traditionally each book is divided into 24 sections or
`books'. For both works the training sample comprises the odd-numbered books
and the test sample consists of the even-numbered books. The classification
task is to tell which work each kilobyte line comes from. (It is possible
that this task is an authorship discrimination as well (Griffin, 1980).)Miscellaneous:GENDERS (2 classes): short stories written by first-year undergraduate
students at the University of Maine on the subject: boy meets girl (or vice
versa). These texts were kindly supplied by Professor Colin Martindale of
the Psychology Department of the University of Maine at Orono. These stories
arrived in an arbitrary order. Even-numbered stories were used as training
data, odd numbered stories as test data. The objective was to distinguish
tales written by males from those written by females.AUGUSTAN (2 classes): The Augustan Prose Sample donated by Louis T. Milic to
the Oxford Text Archive. For details of the rationale behind this corpus and
its later development, see Milic (1990). This data consists of extracts by
many English authors during the period 1678 to 1725. It is held as a
sequence of records each of which contains a single sentence. Sentence
boundaries identified by Milic were respected.RASSELAS (2 classes): The complete text of Rasselas by Samuel Johnson,
written in 1759. This was obtained in electronic form from the Oxford Text
Archive. For checking purposes, the Clarendon Press edition was used
(Johnson, 1927 [1759]). This novel consists of 49 chapters. These were
allocated alternately to four different files.The inclusion of random or quasi-random data may need justification. The
chief objective of doing so here was to provide an opportunity for what
statisticians call overfitting to manifest itself. The author's view is that
some `null' cases should form part of any benchmark suite: as well as
finding what patterns do exist, a good classifier should avoid finding
patterns that don't exist.AcknowledgementsThanks are due to Dr David Holmes and Professor Colin Martindale for
providing some of the text files used in this benchmarking suite, as well as
for helpful comments. In addition, the following institutions -- the Oxford
Text Archive, Project Gutenberg, and UWE's Bolland Library -- have also
provided resources without which this collection could not have been
compiled.ReferencesW.M.BaillieAuthorship Attribution in Jacobean Dramatic
TextsJ.L.MitchellComputers in the HumanitiesEdinburgh Univ. Press1974B.DylanKnocked Out LoadedSony Music Entertainment Inc.1988B.DylanOh MercyCBS Records Inc.1989B.DylanLyrics 1962-1985LondonHarper Collins Publishers1994[Original U.S. edition published 1985.]T.S.EliotCollected Poems 1909-1962LondonFaber & Faber Limited1963W.E.Y.ElliottR.J.ValenzaA Touchstone for the BardComputers & the Humanities25199-2091991R.FeltonPersonal Communication1996[From: Manukau Institute of Technology, Auckland, N.Z.]R.S.ForsythStylistic Structures: a Computational Approach to Text
ClassificationUnpublished Doctoral ThesisFaculty of Science, University of Nottingham1995J.GriffinHomerOxfordOxford University Press1980A.HamiltonJ.MadisonJ.JayThe Federalist PapersEveryman editionW.R.BrockLondonDent1992[First edition, 1788.]D.I.HolmesA Stylometric Analysis of Mormon Scripture and Related
TextsJ. Royal Statistical Society (A)155191-1201992E.J.HughesA Choice of Emily Dickinson's VerseLondonFaber & Faber Limited1993S.JohnsonThe History of Rasselas, Prince of AbyssiniaOxfordClarendon Press1927[First edition 1759.]T.H.JohnsonEmily Dickinson: Collected PoemsLondonFaber & Faber Limited1970A.KammBiographical Dictionary of English LiteratureGlasgowHarperCollins1993G.R.LedgerT.V.N.MerriamShakespeare, Fletcher, and the Two Noble
KinsmenLiterary & Linguistic Computing93235-2481994S.MakridakisS.C.WheelwrightForecasting Methods for Managersfifth editionNew YorkJohn Wiley & Sons1989C.MartindaleD.P.McKenzieOn the Utility of Content Analysis in Authorship
Attribution: the FederalistComputers & the Humanities291995in pressR.MaudDylan Thomas: the Notebook Poems 1930-1934LondonJ.M. Dent & Sons Limited1989D.MichieD.J.SpiegelhalterC.C.TaylorMachine Learning, Neural and Statistical
ClassificationChichesterEllis Horwood1994L.T.MilicThe Century of Prose CorpusLiterary & Linguistic Computing53203-2081990F.MostellerD.L.WallaceApplied Bayesian and Classical Inference: the Case of
the Federalist PapersNew YorkSpringer-Verlag1984[Extended edition of: Mosteller & Wallace (1964). Inference and Disputed Authorship: the Federalist.
Addison-Wesley, Reading, Massachusetts.] P.M.MurphyD.W.AhaUCI Repository of Machine Learning DatabasesDept. Information & Computer Sceince, University
of California at Irvine, CA.1991[Machine-readable depository: .]E.L.PoundSelected PoemsLondonFaber & Faber Limited1977J.S.PudneySelected PoemsLondonJohn Lane The Bodley Head Ltd.1946J.S.PudneySpill OutLondonJ.M. Dent & Sons Ltd.1967J.S.PudneySpandrelsLondonJ.M. Dent & Sons Ltd.1969J.S.PudneyFor Johnny: Poems of World War IILondonShepheard-Walwyn1976D.K.SimontonLexical Choices and Aesthetic Success: a Computer
Content Analysis of 154 Shakespeare SonnetsComputers & the Humanities24251-2641990D.M.ThomasCollected Poems 1934-1952LondonJ.M. Dent & Sons Ltd.1952L.UleRecent Progress in Computer Methods of Authorship
DeterminationALLC Bulletin10373-891982W.B.YeatsThe Collected Poems of W.B. YeatsLondonMacmillan & Co. Limited.1961