The State of Authorship Attribution Studies: (1) The
History and the Scope; (2) The Problems -- Towards Credibility and Validity. JoeRudmanCarnegie Mellon Universityrudman@cmphys.phys.cmu.edu DavidI.HolmesUniversity of the West of Englanddavid.holmes@csm.uwe.ac.ukFionaJ.TweedieUniversity of Glasgow, United Kingdomfiona@stats.gla.ac.ukR.HaraldBaayenMax Planck Institute for Psycholinguisticsbaayen@mpi.nl1997ACH/ALLC 1997editorthe secretarial staff in the Department of French Studies at
Queen's UniversityGregLessardencoderSaraA.Schmidtauthorship attributionstylisticsstatisticsSession AbstractThere are many serious problems with the science of authorship attribution
studies. This session proposes to look at the history of the field, identify
many of the more major problems, and offer some solutions that will go a
long way towards giving the field credibility and validity.Willard McCarty's recent posting on "Humanist" (Vol. 10, No. 137)
"Communication and Memory" points out one of these problems, "...scholarship
in the field is significantly inhibited, I would argue, by the low degree to
which previous work in humanities computing and current work in related
fields is known and recognized."A major indication that there are problems in a field is when there is no
consensus as to correct methodology or technique. Every area of authorship
attribution studies has this problem -- research, experimental set-up,
linguistic methods, statistical methods....It seems that for every paper announcing an authorship attribution method
that "works" or a variation of one of these methods, there is a counter
paper pointing out crucial flaws:Donald McNeil points out that scientists disagree as to Zipf's
law;Christian Delcourt raises objections against current practice in
co-occurrence analysis;Portnoy and Petersen show errors in Radday and Wickmann's use of
the correlation coefficient, chi-squared test, and t-test;Hilton and Holmes showed problems in Morton's QSUM
techniques;Smith raised many objections against Morton's early
methods;There is Merriam vs Smith;There is Foster vs Elliott and Valenza.This widespread disagreement has not only kept authorship attribution studies
out of most United States court proceedings, but it also threatens to
undermine even the legitimate studies in the court of public and
professional opinion.The time has come to sit back, review, digest, and then present a theoretical
framework to guide future authorship attribution studies.The first paper, by David Holmes, will give the necessary history, scope, and
present direction of authorship attribution studies with particular emphasis
on recent trends.The second paper, by Harald Baayen and Fiona Tweedie, will focus on one
problem: the use of so-called constants in authorship attribution
questions.The third paper, by Joseph Rudman, will point out some of the problems that
are keeping authorship attribution studies from being universally accepted
and will offer suggestions on how these problems can be overcome.Stylometry: Its Origins, Development and Aspirations.David I. HolmesIntroductionThis paper is the opening paper in the session on stylometry and
aims to review the historical development of stylometry up to
and including its current standing as a statistical tool within
the humanities.Stylometry - the statistical analysis of literary style -
complements traditional literary scholarship since it offers a
means of capturing the often elusive character of an author's
style by quantifying some of its features. Most stylometric
studies employ items of language and most of these items are
lexically based. A sound exposition of the rationale behind such
studies has been provided by Laan (1995). The main assumption underlying stylometric studies is that
authors have an unconscious as well as a conscious aspect to
their style. Every author's style is thought to have certain
features that are independent of the author's will, and since
these features cannot be consciously manipulated by the author,
they are considerd to provide the most reliable data for a
stylometric study. The two primary applications are
attributional studies and chronological problems, yet a
difference in date or author is not the only possible
explanation for stylistic peculiarities. Variation in style can
be caused by differences of genre or content, and similarity by
literary processes such as imitation.By measuring and counting stylistic traits, we hope to discover
the 'characteristics' of a particular author. This paper looks
at criteria which may serve as a basis of measurement within the
context of stylometry's origins and historical development.Word-length and sentence-lengthThe origins of stylometry may be traced back to the work of
Mendenhall (1887) on word-lengths and the idea of counting
features of a text was extended by Yule (1938) to include
sentence-lengths. Morton (1965) used sentence-lengths for tests
of authorship of Greek prose, but we now know that neither of
these measures are wholly reliable indicators of authorship.Function wordsWord-usage offers a great many opportunities for discrimination.
Some words vary considerably in their rate of use from one work
to another by the same author, others show remarkable stability
within an author. For discrimination purposes we need
context-free or 'function' words and this paper reviews the
seminal work of Mosteller and Wallace (1964) on function word
frequencies. Morton (1978) developed techniques of studying the
position and immediate context of individual word-occurrences
but his method has, however, come under much criticism and Smith
(1985) has demonstrated that it cannot reliably distinguish
between the works of Elizabethan and Jacobean playwrights. The idea of using sets (at least 50 strong) of common
high-frequency words and conducting what is essentially a
principal components analysis on the data has been developed by
Burrows (1987) and represents a landmark in the development of
stylometry. The technique is very much in vogue now as a
reliable stylometric procedure and Holmes and Forsyth (1995)
have successfully applied it to the classic 'Federalist Papers'
problem. Examples of the technique will be displayed.Vocabulary distributionsOne of the fundamental notions in stylometry is the measurement
of what is termed the 'richness' or 'diversity' of an author's
vocabulary. If we sample a text produced by a writer we might
expect the extent of his/her vocabulary to be reflected in the
frequency profile of word-usage. This paper reviews measures
which may be thought of as 'indices of diversity'.Mathematical models for the frequency distributions of the number
of vocabulary items appearing exactly r times (r=1,2,3....) have
aroused the interest of statisticians ever since the work of
Zipf (1932). The best fitting model appears to be that
attributed to Sichel (1975) and this paper will cover the Sichel
model in addition to looking at the behaviour of the
once-occurring words (hapax legomena) and twice-occurring words
(hapax dislegomena) as useful stylometric tools.Content analysisContent analysis refers to tabulating the frequency of types of
words in a text, the aim being to reach the denotative or
connotative meaning of the text. Although content analysis
should be useful in stylometry it has seldom been employed, but
this paper will review the successful application of content
analysis to the 'Federalist' problem by Martindale and McKenzie
(1995).Neural networksStylometry is essentially a case of pattern recognition. Neural
networks have the ability to recognise the underlying
organisation of data which is of vital importance for any
pattern recognition problem, so their application in stylometry
is both inevitable and welcome. The results achieved by Merriam
and Matthews (1994) and by Lowe and Matthews (1995) will be
discussed.The futureAs the amount of available computer-readable literary texts
continues to increase, we can expect expansion in the use of
automated pattern recognition techniques, such as neural
networks, to act as 'assistants' to help in the resolution of
outstanding authorship disputes. Automated feature finders will
be developed (Forsyth and Holmes, 1996) to let the computer take
over the task of finding the features that best discriminate
between two candidate authors for a disputed text. There will be
theoretical advances too, as in the change from lexically based
techniques to syntactic annotation proposed by Baayen, Van
Halteren and Tweedie (1996).Stylometry, though, presents no threat to traditional
scholarship. In the context of authorship attribution,
stylometric evidence must be weighed in the balance along with
that provided by more conventional studies made by literary
scholars.ReferencesH.BaayenH.Van HalterenF.J.TweedieOutside the Cave of Shadows: Using Syntactic
Annotation to Enhance Authorship AttributionLiterary and Linguistic Computing11121-1311996J.F.BurrowsWord Patterns and Story Shapes: The Statistical
Analysis of Narrative StyleLiterary and Linguistic Computing261-701987R.S.ForsythD.I.HolmesFeature-Finding for Text ClassificationLiterary and Linguistic Computing1141996D.I.HolmesR.S.ForsythThe 'Federalist' Revisited: New Directions in
Authorship AttributionLiterary and Linguistic Computing10111-1271995N.M.LaanStylometry and Method. The Case of
EuripidesLiterary and Linguistic Computing10271-2781995D.LoweR.MatthewsShakespeare vs. Fletcher: A Stylometric
Analysis by Radial Basis FunctionsComputers and the Humanities29449-4611995C.MartindaleD.McKenzieOn the Utility of Content Analysis in Author
Attribution: The 'Federalist'Computers in the Humanities29259-2701995T.C.MendenhallThe Characteristic Curves of
CompositionScienceIX237-2491887T.MerriamR.MatthewsNeural Computation in Stylometry II: An
Application to the Works of Shakespeare and MarloweLiterary and Linguistic Computing91-61994A.Q.MortonThe Authorship of Greek ProseJournal of the Royal Statistical Society
(A)128169-2331965A.Q.MortonLiterary DetectionNew YorkScribners1978F.MostellerD.L.WallaceInference and Disputed Authorship: The
FederalistReadingAddison-Wesley1964H.S.SichelOn a Distribution Law for Word
FrequenciesJournal of the American Statistical
Association70542-5471975M.W.A.SmithAn Investigation of Morton's Method to
Distinguish Elizabethan PlaywrightsComputers and the Humanities193-211985G.U.YuleOn Sentence Length as a Statistical
Characteristic of Style in Prose with Application to Two
Cases of Disputed AuthorshipBiometrika30363-3901938G.K.ZipfSelected Studies of the Principle of Relative
Frequency in LanguageHarvard University Press1932Lexical `constants' in stylometry and authorship studies Fiona J. TweedieR. Harald BaayenIntroductionVarious measures of lexical richness have been employed in
stylometry and authorship attribution (see, e.g., Holmes, 1994,
for a review). These measures have been advanced as
characteristic constants whose value is not influenced by the
text size. This study investigates in detail to what extent
these measures are truly constant, how well they are suited for
discriminating authors, and to what extent the values assumed by
these measures are influenced by discourse structure (see
Baayen, 1996).Text constants have been developed because the simplest measure
of lexical richness, the vocabulary size V(N), varies with the
number of tokens in the text, N. In order to remove this
dependency, constants have been proposed that are supposed to be
independent of N. These range from the simple type-token ratio
to more complex measures such as Orlov's Zipf size (Orlov,
1983).Combinations of these constants have been used to investigate
problems of authorship (see for example Holmes, 1992 and Baayen
et al, 1996). The latter discriminated two authors at lexical
and syntactic levels using analyses of function words and
lexical richness. They found that function words performed
better than the constants at both levels, and that the inclusion
of syntactic information improved the discrimination. Baayen et
al. (1996) concluded that considering the text at a more
abstract level, using the reduced variability of the syntactic
vocabulary, increases the efficacy of the techniques.
Nevertheless, the constants also tap into stylistic properties
of texts at a fairly abstract level. In order to properly
evaluate the discriminatory potential of the text constants, we
must clarify whether and how effectively they capture
similarities and differences between authors, and to what extent
they are truly constant.Validity - Are the Constants Constant?Figure 1 shows plots of 4 constants for Carroll's Alice in Wonderland that illustrate the
main patterns in our survey of 15 measures of lexical richness.
Measurements have been taken at 20 equally-spaced points in the
text. The first panel shows Brunet's W to be an increasing
function of the text size. The second plot, that of Honore's H,
initially appears unstable, but less variable above N=13,000.
The lower plots show Sichel's S and Yule's K; S is quite
variable while K descends sharply from its initial value, then
rises with N.Figure 1To evaluate the extent to which violation of the randomness
assumption is responsible for the observed variability in the
values of the `constants', the order of words in the texts was
completely randomised and the measurements retaken. One hundred
such randomisations were carried out.. The means of the
randomisations are shown as points, the maxima and minima by +
and - respectively. Crucially, only the mean values for K
indicate that its value is theoretically truly constant for
randomised text; those for W and H increase and decrease with
text size, while S rises then decreases. It is clear from these
graphs, both of the actual and randomised texts, that far from
being stable, the constants are as variable as V, the variable
that they were intended to replace.In sum, with the exception of K and possibly the Zipf size,
constants are not constant in theory, and, without exception,
none are constant in practice. The empirical values of the
constants are co-determined by the way in which the randomness
assumption is violated in running text, namely by coherence in
lexical use at the discourse level (see Baayen, 1996).Developmental ProfilesThus far we have considered a single text. It is possible that
the variability we have observed is very small when compared to
other texts and that discrimination is still possible between
authors. In order to investigate this we analysed a total of
fifteen texts, detailed in Table 1.The resulting graphs for W, H, S and K are shown in Figure 2.
Examining the first plot, it can be seen that, while the value
of W varies with N, texts by the same author vary in the same
way; the Carroll texts are coincident, as are the James texts
and two of the Conan Doyle texts. It is also clear that this is
not necessarily the case; the Baum texts are widely separated,
as is the third Conan Doyle text from the other pair. A similar
structure is found in the graph of H, with slightly different
orderings. Turning to S, however, we find that the constant is
so variable that it is impossible to separate authors, even at
larger text sizes. The plot of K again yields a pattern in which
texts are fairly well separated. The different ordering of the
texts in this graph indicates that K is measuring a different
facet of the lexical structure of these texts. The Conan Doyle
texts now group together, as do the Baum texts, but now the
Carroll texts diverge.Figure 2We have calculated the values for fifteen lexical richness
constants and found that the resulting profiles could be
classified into four families, exemplified in the graphs above.
The largest family of constants is that to which W belongs.
Honore's H represents a much smaller family. S comprises the
family of constants that are of no discriminatory value. K makes
up a family with D, variables that are theoretically constant
given the urn model of word distribution within text. Some texts
that are separated in the other families are coincident in this
family, others are more divergent.It is clear from the above that several constants measure the
same facet of the vocabulary structure. Thus, only those
constants with the greatest discriminatory sensitivity within a
given family need to be considered. The developmental profiles
of the constants show sensitivity to authorship, although this
is not absolute in that texts written by the same author may
diverge. We have also developed techniques for evaluating the
statistical significance of patterns of similarity and
dissimilarity in the developmental curves. While the variance of
most constants is not known, so that comparisons on the basis of
constants for full texts remain impressionistic, we can now
evaluate in a more precise way whether or not the developmental
profile of a constant differentiates between texts.ConclusionsAlmost all textual constants in our survey are highly variable,
and assume values that change systematically as the text size is
increased. Some constants are inherently variable, others are
truly constant in theory. All constants are substantially
influenced by the non-random way in which word usage is governed
by discourse cohesion. This variability indicates that the
constants cannot be relied on to compare texts of different
lengths. Crucially, however, the developmental profiles of the
majority of constants have an interesting discriminatory
potential, in that they reveal consistent and interpretable
patterns that pick up author-specific aspects of word use.For authorship attribution studies, we strongly recommend the use
of the developmental profiles of selected constants, rather than
the isolated values of the constants for complete texts. Our
data shows, however, that authors are not `prisoners' of their
own developmental profile. The discourse structure of texts by
the same author can be quite different, and the same holds for
the kind of vocabulary an author exploits for a given text.
Compared to the use of syntax, word use is more easily
influenced by choices which are under the conscious control of
authors. Consequently, the developmental profiles of constants
are less reliable than syntax-based measures for the purpose of
authorship attribution. At the same time, the developmental
profiles capture essential differences in word use and discourse
structure. From this perspective, we would like to defend their
usefulness in the domain of quantitative stylistics.ReferencesR.H.BaayenThe Randomness Assumption in Word Frequency
StatisticsG.PerissinottoResearch in Humanities Computing5OxfordOxford University Press199617-31R.H.BaayenH.van HalterenF.J.TweedieOutside the Cave of Shadows: Using syntactic
annotation to enhance authorship attributionLiterary and Linguistic Computing113121-1311996D.I.HolmesA Stylometric Analysis of Mormon Scripture and
Related TextsJournal of the Royal Statistical Society Series
A155191-1201992D.I.HolmesAuthorship AttributionComputers and the Humanities28287-1061994J.K.OrlovEin Model der häufigskeitstruktur des
VokabularsH.GuiterM.ArapovStudies in Zipf's LawBochumBrockmeyer1983154-233Table 1: Texts used in this studyAuthorTitleKeyBaum, L. F.The Wonderful Wizard of Ozb1Tip Manufactures a Pumpkinheadb2Carroll, L.Alice's Adventures in Wonderlanda1Through the Looking-glass andwhat Alice found
therea2Conan Doyle, A.The Hound of the Baskervillesc1The Valley of Fearc2The Sign of Fourc3James,Confidencej1The Europeansj2St LukeGospel according to St Luke (KJV)L1Acts of the Apostles (KJV)L2London, J.The Sea Wolfl1The Call of the Wildl2Wells, H. G.The War of the Worldsw1The Invisible Manw2The State of Authorship Attribution Studies: Problems and
Solutions.Joseph RudmanIntroduction:There are major problems in the science of "non-traditional"
authorship attribution studies (those using statistics and the
computer). This paper will show that the problems exist, will
list and explain some of the more major problems, and will offer
some suggestions on how these problems can be resolved.Problems exist:Non-traditional authorship attribution research has had enough
time and effort -- well over 300 studies and 30 years -- to pass
through the "shake-down" phase and enter one marked by steady,
solid, and scientific studies that force a consensus among its
practitioners.A major indication that there are problems in a field is when
there is no consensus as to correct methodology or technique.
Every area of authorship attribution studies has this problem --
e.g. research, experimental set-up, linguistic methods,
statistical methods.It seems that for every paper announcing an authorship
attribution method that "works" or a variation of one of these
methods, there is a counter paper pointing out crucial flaws,
e.g.:Donald McNeil points out that scientists disagree as
to Zipf's law;Christian Delcourt raises objections against current
practice in co-occurrence analysis;Portnoy and Petersen show errors in Radday and
Wickmann's use of the correlation coefficient,
chi-squared test, and t-test;Hilton and Holmes showed problems in Morton's QSUM
techniques;Smith raised many objections against Morton's early
methods;There is Merriam vs Smith;There is Foster vs Elliott and Valenza.This widespread disagreement has not only kept authorship
attribution studies out of most United States court proceedings,
but it also threatens to undermine even the legitimate studies
in the court of public and professional opinion.Most authorship attribution studies have been governed by
expediency, e.g.:The text is not what should be used but it was
available. This is not how the data should have been treated but
the packaged program didn't do exactly what was
needed.The control data isn't complete but it would have
taken too long to input the correct data.There is a lack of experimental memory. Researchers working in
the same "area" of authorship attribution fail to cite and make
use of pertinent previous efforts. Willard McCarty's recent
posting on "Humanist" (Vol. 10, No. 137) "Communication and
Memory" points this out, "...scholarship in the field is
significantly inhibited, I would argue, by the low degree to
which previous work in humanities computing and current work in
related fields is known and recognized."The problems with the use of statistics by many authorship
attribution are many and varied. Too many researchers are led
into the swampy quicksand of statistical studies by the ignis
fatuus of a "more sophisticated statistical technique".Problems and suggested solutions:The "umbrella" problem is that most non-traditional authorship
attribution researchers do not understand what constitutes a
valid study. They do not understand that it is a scientific
experiment and must be approached and carried out as such.The corrections for many of the specific problems become apparent
once the problem is pointed out and there is a consensus that
there is a problem. This paper will expand upon, expound, and
give examples from published studies of the following problems.
The paper will also give the detailed solutions for these
problems. One of the "solutions" will be the dissemination of a
bibliography of over 500 entries.Problem 1:Not really knowing the field of the questioned work (e.g. does
someone trained in physics know enough about Plato and all that
is involved with the study of the classics to do a valid
authorship attribution study of a questioned Plato work). Not knowing the sub-disciplines of authorship attribution
studies (e.g. linguistics, statistics, stylistics, computer
science).Problem 2:Not doing the necessary research for each step of the study. (The
steps will be shown.)Not doing a traditional authorship attribution study.Problem 3:Not knowing when the flaws in the experimental set-up are fatal.
And, therefore, not realizing that the study should not be
done.Problem 4:Taking shortcuts and making unverified assumptions with the
experimental set-up, the data, and the statistical tests (e.g.
poor or wrong controls,"cherry picking").Problem 5:Ad hominem attacks and self-serving critiques.