Back to the Cave of Shadows: Stylistic Fingerprints in
Authorship AttributionR.HaraldBaayenUniversity of Nijmegen, The Netherlands FionaJ.TweedieUniversity of Glasgow, UK AnnekeNeijtUniversity of Nijmegen, The Netherlands Hansvan HalterenUniversity of Nijmegen, The Netherlands LoesKrebbersMax Planck Institute for Psycholinguistics, The
Netherlands 2000University of GlasgowGlasgowALLC/ACH 2000editorJeanAndersonAmalChatterjeeChristianJ.KayMargaretScottencoderSaraA.SchmidtStylisticsIntroductionAttempts to assign authorship of texts have a long history. They have been
applied to influential texts such as the Bible, the works of Shakespeare and
the Federalist Papers. A wide variety of techniques from many disciplines
have been considered, from multivariate statistical analysis to neural
networks and machine learning. Many different facets of texts have been
analysed, from sentence and word length to the most common or the rarest
words, or linguistic features. Holmes (1998) provides a chronological review
of methods used in the pursuit of the authorial "fingerprint".A key issue raised at the panel on non-traditional authorship attribution
studies at the ACH-ALLC conference in Virginia, 1999, by Joe Rudman is
whether authorial "fingerprints" do in fact exist. Is it truly the case that
any two authors can always be distinguished on the basis of their style, so
that stylometry can provide unique stylistic fingerprints for any author,
given sufficient data?Despite the long history of authorship attribution, almost all stylometric
studies have been carried out on the assumption that stylometric
fingerprinting is possible. However, often control texts are inappropriately
chosen or not available. In addition, the imposition of editorial or
publisher's style can distort the original words of the author. To our
knowledge, no one has yet carried out a strictly controlled experiment of
authorship attribution, with texts of known authorship being analysed
between and within genres as well as between and within authors.In this abstract we present such an experiment. The next section describes
the design of the experiment. This is followed by a description of the
analysis carried out, then by the results and our conclusions.Experimental DesignThe experiment was carried out in Dutch. Eight students of Dutch literature
at the University of Nijmegen participated in the study. All the students
were native speakers of Dutch, four were in their first year of study, and
four were in their fourth year. The students were asked to write texts of
around 1000 words.Each student wrote in three genres: fiction, argument and description. Three
texts were written in each genre, on the following topics.Fiction: a retelling of the fairy tale of
Little Red Riding-Hood, a detective story about a murder in the
university, and a romance of chivalry.Argument: defending a position about the
television program 'Big Brother', the unification of Europe, and
smoking.Descriptive: football, the upcoming new
millennium, and a book-review of the book read most recently by the
participant.The order of writing the texts was randomised so that practice effects were
reduced as much as possible. We thus have nine texts from each participant,
making a total of seventy-two texts in the analysis. The main question is
whether it will be possible to group texts by their authors using the
state-of-the-art methods of stylometry. A positive answer would support the
hypothesis that stylistic fingerprints exist, even for authors with a very
similar background and training. A negative answer would argue against the
hypothesis that each author has her/his unique stylistic fingerprint.AnalysisThere are many methods proposed for the analysis of texts in the attempt to
identify authorship. In this abstract we describe three, and a fourth will
be described at the conference. The first is that proposed by Burrows in a
series of papers, see e.g. Burrows (1992), and used by many practitioners.
Here we consider the frequencies of the forty most common words in the text.
Principal components analysis is used to identify the most important aspects
of the data.The second method considered is that of letter frequency. Work by Ledger and
Merriam indicates that the frequencies of letters used in texts may be
indicators of authorship. We use the standardised frequencies of the 26
letters of the alphabet, with capital and lower-case letters being treated
together. As above, the standardised frequencies are analysed using
principal components analysis.Thirdly, we consider methods of vocabulary richness. Tweedie and Baayen
(1998) show that Orlov's Z and Yule's K represent two separate families of
measures, measuring richness and repeat rate respectively. Plots of Z and K
can be examined for structure.Finally, we are planning to tag the text and to annotate the text for
constituent structure. Baayen et al. (1996) show that increased accuracy in
authorship attribution can be obtained by considering the syntactic, rather
than lexical vocabulary. The results from this part of the analysis will be
presented at the conference.The texts written in this analysis are available from the authors upon
request and, once all annotation has been completed, will be made available
on the Web as well.ResultsEach student was asked to write around 1000 words in each text. In fact, the
average text length is 908 words. The shortest text has 628 words and the
longest 1342. The texts were processed using the UNIX utility awk and the R
statistics package.We first consider all of the texts together. The Burrows analysis of the most
common function words shows no authorial structure. Genre appears to be the
most important factor, with fiction texts having negative scores on the
first principal component, while argumentative and descriptive texts have
positive scores on this axis. In addition, argumentative texts tend to have
higher values on the second principal component than descriptive texts. It
appears that fiction texts are more similar to other fiction texts than they
are to other texts by the same author. Analysis of letter frequencies gives
similar results, while the measures of vocabulary richness show some
indication of structure with respect to the education level of the writer.
Those in their first year of studies appear to have lower values of K, and
hence a lower repeat-rate. In addition, higher values of Z are the province
of first-year students also, indicating a greater richness of vocabulary.
When all of these measures are incorporated into a single principal
components analysis the genre structure becomes even clearer. Fiction texts
are found to the lower left of a plot of the first and second principal
component scores, while the other genres are found in the upper right of the
graph.Given the structure evident in the principal components analysis, it seems
sensible to split the texts by genre and consider each separately. In each
case, within fiction, argumentative, and descriptive texts, again the
education level is the only factor to be apparent.ConclusionsIt is apparent from the results described above that in this study,
differences in genre override differences in education level and authorship.
The absence of any authorial structure in the analyses shows that it is not
the case that each author necessarily has her/his own stylometric
fingerprint. Texts can differ in style while originating from the same
author (Baayen et al., 1996; Tweedie and Baayen, 1998), and texts can have
very similar stylometric properties while being from different authors. Of
course, it is possible that larger numbers of texts from our participants
might have made it possible to discern authorial structure more clearly.
Similarly, it may also be that more fine-grained methods than we have used
will prove sensitive enough to consistently cluster texts by author even for
the small number of texts in our study. We offer, therefore, our texts to
the research community as a methodological challenge. Given what we have
seen thus far, we believe our results must alert practitioners of authorship
attribution to take extreme care when choosing control texts and drawing
conclusions from their analyses.ReferencesR.H.BaayenH.van HalterenF.J.TweedieOutside the cave of Shadows. Using syntactic annotation
to enhance authorship attributionLiterary and Linguistic Computing113121-1311996J.F.BurrowsNot Unless You Ask Nicely: The Interpretative Nexus
between Analysis and InformationLiterary and Linguisitic Computing7291-1091992D.I.HolmesThe evolution of stylometry in humanities
scholarshipLiterary and Linguistic Computing133111-1171998G.LedgerT.MerriamShakespeare, Fletcher, and the Two Noble
KinsmenLiterary and Linguistic Computing93235-2481994F.J.TweedieR.H.BaayenHow Variable May a Constant Be? Measures of Lexical
Richness in PerspectiveComputers and the Humanities325323-3521998