Home :: DH Abstracts

Solutions for the Delivery of Thematically-Tagged Text

Terry

Butler

University of Alberta, Canada

Greg

Coulombe

University of Alberta, Canada

Sue

Fisher

University of Alberta, Canada

2000

University of Glasgow

Glasgow

ALLC/ACH 2000

editor

Jean

Anderson

Amal

Chatterjee

Christian

Kay

Margaret

Scott

encoder

Sara

Schmidt

Text Encoding

Introduction

The Orlando Project has developed prototype delivery software which gives end users access to our literary history textbase. Richly-tagged SGML data is automatically converted to XML, and presented to users through a custom application (which runs locally on their machine, and communicates with a back-end XML server). The design of the user interface has been developed through a formal user needs analysis, conducted with a local Pilot Users Group. In the process, we have learned a great deal about how to exploit the richness of a heavily-tagged textbase, and how to present this information selectively to end users (meeting their information requirements without overburdening them with complexity).

The Goals of our Project

The Orlando Project is applying state-of-the-art software technology to traditional fields of study in the humanities. We are writing a literary history of women's writing in Britain, as both a conventional published text and as an SGML tagged textbase. At present (November 1999) we have documents on 850 British women writers and documents on 590 other writers. For each author we have a pair of interdependent documents - a biography and a writing life history. This material is supplemented by 13,600 events, which are discrete dated items providing the further essential and enriching political, social and cultural background to the work. Events vary in their depth of coverage, but are in every case in one way or another related to the literary history which we are writing. Here are three examples of events:

1863: Selective chronology: British Women writers: Florence Nightingale privately printed an anonymous pamphlet, Note on the supposed protection afforded against venereal disease by recognizing and putting it under police regulation. [keyword: law and legislation] [keyword: body/health - venereal disease]

August 1863: Selective chronology: British Women writers: Florence Nightingale corresponded with Harriet Martineau, outlining the case against the Contagious Diseases Acts. (Vicinus 441)

by 1871: Comprehensive chronology: Social climate: The Royal Commission on the Contagious Diseases Acts rejected a suggestion that soldiers and sailors be required to submit to the same regular examinations required of the prostitutes they frequented. The commission believed "there is no comparison to be made between prostitutes and the men who consort with them. With the one sex the offence is committed as a matter of gain; with the other it is an irregular indulgence of a natural impulse." This illustrates the double standard that held women to be sexually unresponsive and men to be prey to strong desire; paradoxically, this belief coexisted with the notion that women were emotional and irrational, while men were more enlightened and controlled.

Delivery Plans

The Orlando Project received SSHRC funding in 1994. Our grant proposal at that time argued that SGML was the only feasible means to capture and encode the complex thematic approach to literary history which the project required. As to the ultimate means of delivery for this information, we anticipated that the technology landscape would be utterly changed 5 years on. We believed that there would be ways to deliver SGML to end users at the very end of the 20th century (we were also aware, in 1994, that there were perfectly acceptable ways of converting and delivering SGML information). We have found that XML is the means to the end which we hoped would appear. XML is a rapidly developing W3 Consortium standard, which will permit the direct delivery of tagged information to end users. An XML audit of our textbase, carried out in 1998, showed us that (for delivery purposes) our textbase could be transformed from SGML to XML without any loss of its intellectual value. We are able today to deliver our richly-tagged information to a client program (running in an XML browser, such as Internet Explorer 5; or in a custom application which support XML though third party software such as IBM's XML toolkit.

User Needs

Assessment

Having received a great deal of positive encouragement from the scholarly community that the information we are developing is of considerable interest to them, we began a formal process of user needs assessment. A richly-tagged textbase such as ours can be exploited by end users in a wide variety of ways:

subject-specific searches can create customized chronologies and research texts for reading

imposing chronological limits can highlight issues and create connections which standard "period" labels obscure

consistent tagging allows one or more documents to be compared "side-by-side", to reveal new insights about authors and their context

The most important issue for us was to "bridge" between the complex tag set which we have created and the terminology and information expectations which will characterise our end users. The strengths of our tagging are their rigour, and the highly detailed descriptions of their meaning. Their deficiency (from the point of the end user) is that this knowledge is locked up in a single tag name which may be opaque (such as our Cultural Formation tag) or dangerously obvious (such as our Name tag, which has a precise meaning and occupies a specific niche in a constellation of about a dozen "personal name" tags). In order to drive the development of the software from the users' point of view (rather than our own), we struck a Pilot Users Group. This group (about a dozen people) were drawn from representative communities who we expect will be interested in accessing our information, including:

professors, graduate students, and undergraduate students

scholars in fields such as English literature and History

librarians and information scientists

The program for this group was devised in order to elicit their expectations and desires for our software, without raising the question of what the software would look like or how it would work. We began with meetings where the group were given only written and oral accounts of our Project's goals and content; we elicited the group's own descriptions and terminology for our areas of interest. In the fall of 1999, building upon our team's sense of what kinds of access we could provide to end users, the Pilot Users Group was asked to comment on an on-screen mock-up of our delivery software. These sessions were conducted as formal focus groups [Greenbaum; Jordan]; the sessions were recorded and team notetakers wrote down the comments and suggestions from the users group. Because the software on-screen was truly "throw away", we are able to genuinely encourage the users to critique it and explore their preferences and expectations. We also surveyed the computer equipment and level of experience of the user group; we will expand this survey, to make sure we create delivery software which our target users can run, and which they will be able to learn to use effectively.

Software Architecture

Our prototype delivery software is being written in a client/server fashion. The client end is a Java program which uses XML-aware code to request XML documents from the server to process them (by sorting, selecting, and sub-setting), and then displays them using XSL (the XML stylesheet language). Although it is technically possible to execute this part of the process inside an XML-capable browser, the nature of our textbase and the kinds of interaction which we wish to provide are rather unlike the Web-page metaphor. Our textbase can be queried to draw together coherent document sub-sections from many documents at once, which can be presented to the user in various forms, such as a customised chronology or a synoptic view of relevant sections from the lives or works of many authors at once. For this reason we feel the creation of an independent delivery program is desirable. A similar consideration operates with respect to linking within our textbase. We are implementing a much richer form of linking that the web at present provides; a great deal of the linking which end users will be able to explore will be generated automatically through the carefully and consistently tagged text. Users who are viewing text of interest will be able to pursue that interest by traversing automatic links which will open up from our elaborately tagged text. The server side of this architecture will make available our tagged textbase (as an XML document collection) which will respond to user queries by selecting and sending XML documents to the client program. We have explored various technologies to provide this searching and delivery on the back-end, including Java and CGI formats (using both Perl and SGREP to handle the searching). The obvious advantage of this approach is that the server can be implemented in more than one way (and be revised and extended as new technologies appear), while the front end client program remains the same (or is extended and improved on an independent trajectory). We are making extensive use of standard technologies, such as XML, XSL, and HTTP (for the communication between client and server). This will aid the process of generalising this software to meet the needs of other users who wish to present SGML or XML text to users without "rendering it down" to display-only formats like HTML.

Issues

XML is an emerging standard. The software support for XML is beginning to appear; our strategy will be more effective as XML becomes ubiquitous and a variety of robust XML-capable tools emerge.

The current effort is a "prototype"; the exercise of deploying it will have both successes and failures, from which we will learn.

We have been very careful to avoid using the "Web" metaphor - our textbase can be delivered in ways which are much more dynamic and more informative that a Web delivery metaphor would imply. This ambition is to some extent undercut by the expectations of our Pilot Users Group, who came to the material with "Web on the brain". A classic case of this was the specific comment that we ought not to use a certain shade of blue for text if it was not a link, because "blue means link".

References

Thomas

Greenbaum

The Handbook for Focus Group Research

Second edition

Sage Publications

1997

Patrick

Jordan

et al

Usability Evaluation in Industry

Taylor & Francis

1996

Steve

McConnell

Rapid Development

Microsoft Press

1996

Terry

Butler

Sue

Fisher

Orlando Project: Issues when Moving from SGML to XML for Delivery of Content-Rich Encoded Text

Presentation at Markup Technologies '98, Chicago, Nov. 12-13, 1998

1998

Terry

Butler

Can a Team Tag Consistently? Experiences on the Orlando Project

Presentation given at ACH-ALLC 1999, Charlottesville VA, June 1999

1999