The Style-Marker Mapping Project: a Rationale and
Progress ReportJosephRudmanCarnegie Melon University, USA 2000University of GlasgowGlasgowALLC/ACH 2000editorJeanAndersonAmalChatterjeeChristianJ.KayMargaretScottencoderSaraA.SchmidtIntroductionThis paper explicates the what, why, and how of a substantially completed but
ongoing project to identify and categorize all style-markers in written
English that are quantifiable (e.g. type/token ratios, word length
distributions, word length correlations, hapax legomena).Section I treats the what - defining the project - and then addresses the why
- the rational for the project. Section II outlines the how. Section III
gives a status report of the project.Although the final mapping will have value in various disciplines (e.g.
stylistics, corpus linguistics, computational linguistics, and computer
science), the impetus for the project is from non-traditional authorship
attribution studies. Non-traditional attribution practitioners define style
in the seemingly narrow framework of only those stylistic traits that are
quantifiable.The main hypothesis behind this project (and all non-traditional authorship
attribution studies) is that every author has a verifiably unique style. If
we look at style as an organism, style-markers are its genetic material -
making this project analogous to the human genome project. This analogy is
somewhat of a stretch because each style-marker is analyzed by an
independent study, whereas all of the loci of an autoradiogram are obtained
in one scientific analysis.The identification of a quantifiable style-marker does not necessarily mean
that that particular style-marker should be included in an authorship study
(e.g. the orthography might be dictated by an editor or typesetter).Section IThis section treats the what - defining the project - and then moves into the
why of the project.A short overview of what is in this section follows:The style-marker mapping project is a study to identify every style-marker in
written English that can be quantified. The project began in a preliminary
fashion in 1983 when I started recording the various style-markers that were
used in non-traditional authorship attribution studies so that I could use
them in my studies of the canon of Daniel Defoe. The project continued in
this vein (along with a few attempts on my part to come up with "new"
style-markers) until five years ago when I realized the importance of
identifying all of the quantifiable style-markers.There is no one style-marker or even a combination of several style-markers
that have proven to be a definitive discriminator in all non-traditional
authorship studies. What works in one case often does not work in others.
Word length distributions and sentence length distributions are two examples
of style-markers that seemingly work in some cases but not in others. The
idea is to look at style as a combination of all of the quantifiable
style-markers and then to do the analysis as if each style-marker were a
locus in the autoradiogram (See RUDMAN for a more detailed explanation). Another reason for using all of the quantifiable style-markers is to
eliminate any charges of statistical cherry-picking.Section IIThis section treats the how. References to all of the literature and
techniques will appear in the final report. For example, HOLMES, DELCOURT,
and ELLIOTT AND VALENZA are three of the references under non-traditional
authorship studies.1) Search the literature, e.g.:StylisticsGrammarNon-traditional Authorship Attribution StudiesLinguisticsComputationalCorpusTextRhetoricDiscourse Analysis2) Query the practitioners in all of the above fields, e.g.:Professor Erwin R. Steinberg of Carnegie Mellon for
stylistics,Professor Paul G. Hopper of Carnegie Mellon for Grammar - all
of the active practitioners in the non-traditional authorship
studies.3) Establish a clearinghouse on a web page that allows anyone to query
the up-to-date mapping and allows anyone to suggest "new" quantifiable
style markers that would be added by the curator. This will lead to a
continually updated list. Negotiations are under way to make this site
an extension of the Carnegie Mellon University English Web Site.4) Use various strategies to identify new style-markers. This is where
the innovative work supplements the drudge work, e.g.:Neural networksPattern searching programsBrainstorming sessionsSection IIIThis section gives a status report of the project and reports a timeline for
its "completion." References will be given for all of the style-markers in
the final report, e.g. MOSTELLER AND WALLACE is one of the function word
references. Only a few representative examples of each section are listed
for this abstract.1) META-WORDParagraphSentencePhrasesClauses2) WORDPart of speechRatiosPositionsFunction WordsMost frequent words3) SUB-WORDSyllablesLettersPhonemesMorphemes4) OTHERPunctuationImageryRhetorical DevicesZeugmaChiasmusConclusionThe questions, "Can this project ever be completed?", and, "Is the number of
style-markers infinite?", are addressed.The success of this project will not solve all of the problems of
non-traditional authorship studies. This project does not address the
problems that gender, genre, time constraints, or conscious vs. unconscious
style bring to the table. Nor does it treat the problem of lemmatization. Identifying the style-markers is only a small part of the overall problems
with non-traditional authorship attribution studies. The statistics that
should be used in any study for each of these style-markers and the
statistics for combining all of the markers into a "final" answer is the
subject of another ongoing project.BibliographyChristianDelcourtStylometryALLC-ACH 1994, École Normale Supérieure de Lettres et Sciences Humaines, Paris, 14-151994WardE.Y.ElliottRobertJ.ValenzaAnd Then There Were None: Winnowing the Shakespeare
ClaimantsComputers and the Humanities (CHum)303191-2451996DavidI.HolmesThe Analysis of Literary Style - A ReviewJournal of the Royal Statistical Society, Series
A148Part 4328-3411985FredrickMostellerDavidL.WallaceApplied Bayesian And Classical Inference: The Case Of
The "Federalist Papers."2nd EditionNew YorkSpringer-Verlag1984JosephRudmanThe State of Authorship Attribution Studies: Some
Problems and SolutionsComputes and the Humanitis (CHum)314351-3651997