A Workbook Application for Digital Text
AnalysisWorthyN.MartinUniversity of Virginia, USA OlgaGurevichUniversity of Virginia, USA ThomasB.HortonFlorida Atlantic University, USA RobertBinglerUniversity of Virginia, USA 2000University of GlasgowGlasgowALLC/ACH 2000editorJeanAndersonAmalChatterjeeChristianJ.KayMargaretScottencoderSaraA.SchmidtText EncodingIntroductionThe workbook facility for scholars in the humanities aims to help them
organize the results of their research in a convenient and easily accessible
manner. The recent proliferation of marked-up electronic corpora has
produced a need for tools that would allow structured search and extraction
from texts. However, not all potentially interesting features of a text can
be described in terms of the mark-up hierarchy: some features involve
overlapping elements of markup, others are too fine-grained to be marked up.
Thus we need a mark-up independent method of searching, and tools that
combine both. The text region-based approach to processing digital text
resources is the proposed method of searching and extracting both marked-up
and non marked-up information from a collection of texts. The workbook
facility is a prototype application of this method that allows extraction
and linking of portions of XML-formatted texts. The selection of the regions
can be based either on their structural characteristics or on other
features.ConceptsA text region (a.k.a. span) is a continuous portion of a document identified
by its start and end offsets. It can be a complete XML element or just a
string of characters. A text region can be created through a variety of
operations described below. In our use of the concept, most XML elements are
assigned a unique identifier within the document. The offsets for text
regions are therefore relative to the nearest preceding ID within the
document. A text occurrence object (TOO) consists of one or more text
regions as well as notes and a user-defined name. The text regions can come
from one or more documents and do not have to be contiguous. SGREP
(structured grep) is a command-line search tool. It allows structured
searches on XML and SGML formatted texts and collections of texts, as well
as simple searches. The search results are returned as a set of text regions
that can be organized into a text object occurrence. SGREP allows nested
searches (for example an XML element labeled "verse" containing the word
"Hamlet" within a DIV1 element) as well as unions and intersections of
search expressions.Workbook FacilityThe workbook facility aims to help humanities scholars to organize the
results of their research in a convenient and easily accessible manner. It
provides a way to bookmark and annotate documents without changing the
original texts, and to store and link annotated extractions from texts. The
workbook consists of a set of TOO's, each of which contains one or more text
regions that can originate from different documents. A TOO can thus link
portions of texts from different places in a document or from different
documents. The workbook facility has a built-in XML parser that creates a
DOM structure. We are using IBM's Java-based parser, and the rest of the
workbook is also written in Java. The following operations are available to
the user:Select a collection of XML documents with which he wants to work
(the base collection).For each document, view the raw text or the DOM (Document Object
Model) structure resulting from parsing the XML document. The DOM
structure is displayed in a tree control.Create text occurrence objects in several different ways.Select a complete XML element from the DOM, in which case the TOO
will consist of a single text region.Select any continuous portion of raw text from any document in the
collection to create a TOO.Run an SGREP search on one or more documents in the collection. If
the search is successful, SGREP will return one or more text regions
which will be put together into a TOO.Name and annotate all TOO's in the same manner, regardless of how
a particular TOO was created. The list of TOO's constitutes the
workbook and is displayed separately.By clicking on any text region within a TOO, view the spot within
the document from which the text region originated. Thus, it is
possible to view the larger context of a particular
extraction.Produce a word distribution list for any particular TOO, and
compare word distributions between several TOO's. The word
distribution list can be sorted by relative frequency of words or by
alphabetical order, and two lists can be viewed side by side.Order and re-order the TOO's within the workbook.Organize the TOO's into folders with arbitrary depth of nesting,
similar to how files are organized on a disk.Save the workbook (i.e. the individual TOO's along with folders or
TOO's) and re-open it later. The ability to return to the original
documents remains after the workbook has been re-opened.Save particular TOO's and sets of TOO's as separate XML documents
and include them in the base collection.Run SGREP searches on documents created from parts of the workbook
in the same way as on the original ones.ImpactThe proposed workbook facility will be useful for several research goals.
Extracting, ordering and naming textual fragments is a convenient way for an
instructor to prepare for a lecture about a particular text. Scholars that
study different versions of the same text (i.e. versions in different
languages or different editions) can use the workbook to link parallel
passages and annotate the resulting TOO. Since the creation of text regions
can be markup-independent, this can be done even if the parallel passages in
two documents are not contained within a single XML element. Moreover,
extracting regions that share particular features can be automated with the
help of SGREP. The word distribution feature of the program is intended to
demonstrate that operations found in software like TACT and similar tools
can be easily integrated with our workbook approach. Once a workbook is
created, it still contains links to the original documents and the history
of how the extractions were made. That is, the process is completely
retraceable, and the user can view the context from which any text region
came.ReferencesDOM (Document Object Model) standard<>ThomasB.HortonA region-based approach for Processing Digital Text
ResourcesDigital Resources for the Humanities, King's College,
London, Sept. 12-15, 1999199947-49JaniJaakkolaPekkaKilpeläinenSGREP (structured grep)at the University of Helsinki, Finland<>XML standard<>XML4J, the XML parser for Java produced by IBM<>