On Determining a Valid Text for Non-Traditional
Authorship Attribution Studies: Editing, Unediting, and De-EditingJosephRudmanCarnegie Mellonjr20@andrew.cmu.edu2003University of GeorgiaAthens, GeorgiaACH/ALLC 2003editorEricRochesterWilliamA.Kretzschmar, Jr.encoderSaraA.SchmidtINTRODUCTION:The work’s material history since its
inception, the vast and largely uncharted alterations imposed by
the history and by the mediation of generation upon generation
of printers, editors, publishers—this is a relativism we are
prone to ignore, but ignore at our peril.(Marcus 1996)The literary texts often are not homogenous
since they may comprise dialogues, narrative parts, etc. An
integrated approach, therefore, would require the development of
text sampling tools for selecting the parts of the text that
best illustrate an author’s style.(Stamatatos et al. 2001)Most non-traditional authorship attribution studies place too much emphasis
on statistics, stylistics, and the computer and not enough focus is given to
the integrity and validity of the primary data— the text itself.It is intuitively obvious and easily shown empirically that if you are
conducting a study of the patterns of an author’s stylistic usage (e.g.
Daniel Defoe), the study will be systematically denigrated by each
interpolation of non-Defoe text and even by each interpolation of Defoe text
of a different genre or significantly different time period.The crux of this paper is about one important element in the empirical
methodology of a valid non-traditional authorship attribution study—the
preparation of the text for stylistic and statistical analysis: unediting,
de-editing, and editing.The general emphasis of this presentation is on prose analysis with some
peripheral treatment of drama and poetry.I. BACKGROUND AND DEFINITIONSA. Why a valid text is necessary should not even be asked. No
valid experiment can be done if the input data is flawed—garbage
in, garbage out!Too many practitioners simply grab a text
from any available source—without any thought to its pedigree.
(e.g. Khmelev and Tweedie’s “Using Markov Chains for the
Identification of Writers.”)Are undertakings such as
Project Gutenberg or the Oxford Text Archive with their easily
available machine readable texts a boon or a bane to
non-traditional authorship atudies? This question is explored in
some detail.B. Selecting a starting textThe validity of using texts
from the oral tradition and the scribal tradition is
discussed.Before any manipulation and analysis of a text is
carried out, a valid starting text must be acquired that
fulfills many necessary requirements. This selection is
primarily bibliographically driven. If a practitioner is not
savvy in the bibliographical arts, a collaborator who is should
be recruited.Examples of bad starting texts causing
problems are given (e.g. Peng and Hengartner’s “Quantitative
Analysis of Literary Styles.”)If you cannot obtain a valid
text, do not do the study.C. Unediting—getting back to the
state of “not yet edited”De-editing—removing selected textEditing—changing (preparing) a text for
statistical analysisII. EXPLICATIONThe statement, “each age, each author, each study
demands a different mixture of the following particulars,” is discussed.A. UneditingAs a rule, the closest text to the holograph
should be found and used.1. Editorial interpolationa. Filled in lacunaeb. Marginal notationc. ‘Changes’ in the textd. Critical editions2. Printer interpolationFor the Printer is a beast, and
understands nothing I can say to him of correcting
the press.Dryden (Ward p. 97)a. Catchwords (the first word of the next leaf
or gathering)b. Signatures (combinations of letters and
numerals used something like catchwords)c. Removing obvious typesetting mistakes (a
slippery slope)i. ‘f’ for the long ‘s’ii. Double words (e.g. ‘the the’ ‘was
was’B. De-editing1. Quotesa. Factual, unattributedb. Factual, attributedc. Self quotes from earlier writings2. Plagiarisma. Direct copyb. Paraphrasingc. Imitation3. Collaborationa. Sectionalb. Phrasalc. Word leveld. Ghostwriting4. Genrea. Poetry, prose, drama, letters, etc.b. Mixture (e.g. verse drama)5. Graphs and Numbersa. Tablesb. Listsc. Arabic and Roman numerals6. Guide wordsa. Titles—chapter headings—the end word
‘Finis’b. Marginal annotation7. Foreign Languagesa. Sentence level and greaterb. Phrase or word level8. Translationsa. Verbatimb. Concepts9. Examples of items de-edited (or not de-edited)
incorrectly by practitionersa. Biblical quotesb. Titles in direct appositionc. Numbers that are spelled outd. Words with an initial capitalC. Editing1. Encoding the texta. Why (e.g. homographic forms)b. TEI2. Regularizinga. Spellingb. Contracted forms (simple, compound)c. Hyphenationd. Masked words (e.g. ‘D_ _ _ e’ for ‘Defoe’)
3. Lemmatizinga. Prob. ConD. Special Problems in Drama and Poetry1. Stage directions2. The ‘age’ dependency of transmission and technique.
III. SOME EXAMPLESStudies that are compromised by mistakes of
commission and/or omission in editing, unediting, or de-editing.A. Historia Augusta1. Twelve individual studiesB. Shakespeare1. Eliott and Valenza2. Foster3. HortonC. Defoe1. Hargevik2. RothmanIV. CONCLUSION1. Some items that are de-edited are valid style markers in
their own right (e.g. latin phrases, different genre) and should
be treated as such in a parallel study.2. No matter which text is selected, the practitioner must
disclose which text was used and everything that was done to
it.3. The same care must be taken with every text in the
study—the anonymous text, the suspected author’s text, and all
of the control texts.4. If valid texts cannot be located and correctly edited,
unedited, and de-edited, do not do the study5. A valid text does not guarantee a valid study. However, a
non-valid text guarantees a non-valid study.REFERENCESRichardD.AltickJohnJ.FenstermakerThe Art of Literary Research(Fourth Edition)New YorkW.W. Norton & Company1993JohnBurrowsQuestions of Authorship: Attribution and Beyond. A
Lecture Delivered on the Occasion of the Roberto Busa AwardACH-ALLC01 Conference. New York University, New York,
June 14, 20012001WardE.Y.EliottRobertJ.ValenzaSo Many Hardballs, So Few Over the Plate: Conclusions
From Our ‘Debate’ With Donald FosterComputers and the Humanities36450-4602002DonFosterAuthor Unknown: On the Trail of AnonymousNew YorkHenry Holt and Company2000BertrandA.GoldgarImitation and Plagiarism: The Lauder Affair and Its
Critical Aftermath Studies in Literary Imagination3411-162001D.C.GeethamTextual Scholarship: An IntroductionNew YorkGarland1992GregoryGrefenstettePasiTapanainenWhat is a Word, What is a Sentence? Problems of
TokenizationProceedings of the 3rd International Conference on
Computational LexicographyBudapest Research Institute for Linguistics, Hungarian Academy of
Sciences1994SteigHargevikThe Disputed Assignment of “Memoirs of an English
Officer to Daniel Defoe”(Part I and Part II)StockholmAlmqvist and Wiksell1974DavidI.Holmes, et alA Widow and Her Soldier: Stylometry and the American
Civil WarLiterary and Linguistic Computing164403-4202001ThomasB.Horton The Effectiveness of the Stylometry of Function Words
in Discriminating between Shakespeare and FletcherThesisUniversity of Edinburg1987DmitriV.KhmelevFionaJ.TweedieUsing Markov Chains for Identification of
Writers.Literary and Linguistic Computing163299–3072001AlexanderLindeyPlagiarism and OriginalityNew YorkHarper and Brothers1952LeahS.MarcusAfterword: Confessions of a Reformed UneditorAndrewMurphyThe Renaissance Text: Theory, Editing,
TextualityManchesterManchester University Press2000211–216LeahS.MarcusUnediting the Renaissance: Shakespeare, Marlow,
MiltonLondonRoutledge1996MaximillianE.NovakThe Defoe Canon: Attribution and De-attributionHuntington Library Quarterly 59183–1041997RogerD.PengNicolasW.HengartnerQuantitative Analysis of Literary StylesThe American Statistician563175-1852002Project GutenbergURL: PatRogers The Text of Great Britain: Theme and Design in Defoe's
‘Tour’Cranbury, NJ1998IrvingN.RothmanDefoe De-Attributions Scrutinized Under Hargevik
Criteria: Applying Stylometrics to the CanonPapers of the Bibliographic Society of America943375–3982000JosephRudmanThe State of Authorship Attribution Studies: Some
Problems and SolutionsComputers and the Humanities31351-3651998JosephRudmanNon-Traditional Authorship Attribution Studies in the
Historia Augusta: Some CaveatsLiterary and Linguistic Computing133151-1571998EliotSlaterThe Problem of “The Reign of King Edward III:” A
Statistical ApproachCambridgeCambridge University Press1988E.Stamatatos, et alComputer-Based Authorship Attribution Without Lexical
MeasuresComputers and the Humanities35193–2142001Text Encoding InitiativeJamesThorpWatching the Ps & Qs: Editorial Treatment of
AccidentalsLawrence, KansasUniversity of Kansas Printing Service1971CharlesE.WardThe Letters of John Dryden: With Letters Addressed to
HimDurham, NCDuke University Press1942DavidS.WilliamsStylometric Authorship Studies in Flavius Josephus and
Related LiteratureLewistown, New YorkThe Edwin Mellen Press1992