A Comparison of Methods for the Attribution of
Authorship of Popular FictionFionaJ.TweedieUniversity of Glasgow, UK LisaLenaOpas-HänninenUniversity of Joensuu, Finland 2000University of GlasgowGlasgowALLC/ACH 2000editorJeanAndersonAmalChatterjeeChristianJ.KayMargaretScottencoderSaraA.SchmidtIntroductionIn this poster we present the stylistic analysis of a number of popular
fiction genres. Popular fiction generally receives less academic attention
than literature, but its ability to draw the reader into the story is
noteworthy. In previous work (Opas and Tweedie, 1999a, 1999b) we have
examined measures of stance in an attempt to quantify this degree of reader
involvement. In this paper we turn to measures used to discriminate between
authors, in order to find consistent differences between genres and authors.
Textual SourcesWe have taken texts from three distinct sources: romance novels, detective
novels and American short stories. Our total corpus is 590,000 words. We
have analysed romance novels published between 1990-1996 from the Harlequin
Presents and Regency Romance series. We have also analysed Danielle Steel's
works, which are classified as women's fiction or 'cross-overs'. The romance
texts make up 245,000 words. The detective fiction part of our corpus is
made up of popular contemporary female authors published in the 1990s, i.e.
Cornwell, Grafton, James, Leon, Peters, and Rendell. Where an author has
created many detectives, we chose the most well-known one to represent that
author. Some of the detectives are male and others female and we expect them
to express stance differently. These texts make up 295,000 words. Short
stories were also taken from the works of Carver and Barth. These make up
almost 50,000 words.MethodsWe will compare and contrast the results from three analyses of these texts.
The analyses are based on methods used in determining authorship: the
frequency of the most common words, letter frequency and measures of
vocabulary richness. The data from each of these procedures is then used in
a principal components analysis in order to identify the most important
elements.1) Word frequencies The use of principal
components analysis of the most common words to determine authorship was
proposed by Burrows in 1988 and has become an essential tool for
stylistic analysis. Here, the most commonly-occurring forty words were
employed. Their frequencies were measured and standardised for text
length. A principal components analysis was then carried out and the
texts plotted in the principal components space. The first two principal
components corresponded to 32.2% of the total variation. The first
principal component separates the romantic Steel texts, with high
negative scores, from the American short stories which have high
positive scores. Detective stories by Sue Grafton and Patricia Cornwell
also have high positive scores on this axis. The second principal
component appears to act as a rough genre separator; romantic texts tend
to have positive scores and all of the detective novels have negative
scores. Consideration of a loadings plot indicates that the Steel texts
use a high proportion of "she", "her" and "they", while the short
stories use "at", "said" and "on". The Grafton and Cornwell texts are
written in the first person and this is highlighted by their use of
"me", "my" and "I".2) Letter frequencies Ledger and Merriam use
letter frequencies in their analysis of Shakespearean texts with
remarkable success. Here we consider the relative frequencies of 'A' -
'Z', with capital and lower-case letters amalgamated. These 26 variables
are then subjected to principal components analysis. The results are
plotted in the first two dimensions of the principal components space
which account for 34.5% of the total variation. In this analysis the
separation is not as good as the word frequency analysis. The American
texts tend to have high negative scores on the first principal
component, while the texts by P. D. James have very high positive
scores.3) Measures of Vocabulary Richness A great
number of measures of vocabulary richness have been proposed. Tweedie
and Baayen (1998) carry out a review of these measures and find that
two, K and Z, contain the vast majority of the information from the
author's vocabulary. Yule's K measures the 'repeat-rate' used by the
author, while Orlov's Z measures vocabulary richness in the sense of the
number of different words used. We therefore plot the texts in the K-Z
plane. As might be expected, the American short stories are found to
have a low repeat rate and high vocabulary richness. The Steel texts
have a higher richness than the other romantic texts, but the detective
and romantic fiction texts are not separated by this analysis. 3. ConclusionsThese three analyses offer views of different facets of the style of popular
romantic and detective fiction. The genres are most clearly separated when
the most common words are used as data, while the letter frequency analysis
is, not surprisingly, more affected by the particular names of heroes or
heroines. The measures of vocabulary richness distinguish clearly between
the more popular texts and the short stories. At the conference we shall
also present the analysis of markers of stance, used in Opas and Tweedie
(1999a, 1999b).ReferencesJ.F.BurrowsWord patterns and stay shapes: the statistical analysis
of narrative styleLiterary & Linguistic Computing2261-701987G.LedgerT.MerriamShakespeare, Fletcher and the Two Noble KinsmenLiterary & Linguistic Computing93235-2481994L.L.OpasF.J.TweedieThe Magic Carpet Ride: Reader Involvement in Romantic
FictionLiterary & Linguistic Computing14189-1011999aL.L.OpasF.J.TweedieCome into my World: Styles of Stance in Detective and
Romantic FictionAbstracts of the ALLC/ACH conference 1999,
Virginia1999b247F.J.TweedieR.H.BaayenHow Variable May a Constant Be? Measures of Lexical
Richness in PerspectiveComputers and the Humanities325323-3521998