SMART Project: Methods for Computer-based Research of
Premodern Chinese TextsChristianWitternChung-Hwa Institute of Buddhist Studies, Taiwan
2000University of GlasgowGlasgowALLC/ACH 2000editorJeanAndersonAmalChatterjeeChristianJ.KayMargaretScottencoderSaraA.SchmidtStylisticsThis presentation will start with a look at some of the problems encountered so
far in a number of projects that tried to apply TEI [TEIP3] markup to premodern
Chinese Buddhist texts. I have been working with the TEI Guidelines for more
than seven years and published the first text, rather heavily marked up in TEI
fashion, in 19951. The Chan-Buddhist genealogical history Wudeng Huiyuan (first printed in 1253) on the ZenBase1
CD-ROM, see [App et al 95]. . Since then I became involved with some
other projects digitizing Chinese Buddhist texts, most prominently the work by
the Chinese Buddhist Electronic Texts Association (CBETA)2. The
CBETA project website (mostly in Chinese) is at <>. We now have about
200 MB of texts basically marked up3. This basic markup follows the
general ideas lined out in [Wit96]. according to the Guidelines.All of these projects worked from printed editions published 80-100 years ago.
One of the most obvious problems we encountered is the large amount of
non-standard characters found in these texts, but TEI and SGML in general is
quite able to handle this elegantly - nevertheless there are some important
details that should be noted4. I will not go into detail for this
audience, but some references to these problems can be found in the work by
the Chinese Characters Analysis Group. More recently, we based our efforts
on the work done by the Mojikyo Font Institute in Japan <>.. Some of the more subtle
problems involve structural elements specific to texts of the sphere of Chinese
cultural influence. Examples of these elements include the notion of a scroll,
that is carried over from the time when the documents were actually written on
scrolls, but still mark divisions in the printed editions. Being based on the
physical medium, they fall into a similar category as the LB, PB and MILESTONE
elements in TEI, but they are usually associated with some other heading-like
text, colophons and the like. While this could be taken care of with the FW in
some way, we decided to come up with our own solution, which was to introduce a
new element, JUAN, (Chinese for scroll) and encode the information therein.
Other structural elements that presented difficulties include colophons or other
backmatter-like text at the end of a scroll, but in the middle of a DIV element
that continued on the next scroll and sound glosses in the text.A second part of this presentation will give an overview of the recent
developments in the SMART (System for Markup and Retrieval of Texts)
project5. The project website is at <>.. This
project aims at providing a working environment for research and markup on East
Asian texts by utilizing the TEI Guidelines (see also [SpMcQ91]) and other
international, open standards. The environment tries to enable network based
collaboration and layered, private markup added to a central repository of
texts, but it is intended to make it possible to use it on stand-alone machines
without a live connection to the Internet. So far, the basic framework has been
outlined and some of the utilities built. Originally, the plan was to develop
this into a collection of open modules, that can interact through an open
protocol in the spirit of presentations at ACH/ALLC 1999 by Michael
Sperberg-McQueen, Jon Bradley and others. However, since such a protocol
specification is far from being finalized, I found that I would rather have a
concrete implementation to play with and to iron out problems. I therefore
recently decided to build the tools I would need on top of the Zope6. For more information on Zope see <>. Web-Application platform. This is an OpenSource™
project build mainly with Python, implementing an object-oriented database and a
complete framework for developing dynamic Web-Applications. It has a strong
support for XML and related standards and thus seems especially suited for the
purpose at hand. All the methods are exposed through a URL-based interfaced, but
also callable through XML-RPC.The presentation in the context of the ALLC/ACH conference aims at contributing
to a discussion of how such an open framework can be implemented, while at the
same time showing some of the problems that arise when dealing with East Asian
languages (see [ApWi96] and [CCAG80-85]). East Asian languages do not normally
mark the word boundaries and even the definition of a word is highly disputed
among linguists. In this situation, a list of all occurring words in the manner
of a word-wheel cannot be applied. Additionally, the texts used here contain
markup of textual variants, which complicates the creation of an index.
Furthermore, different representations of the same character in machine-readable
encodings have to be accounted for. An indexing method that takes these problems
into account and also provides an abstraction from indexing of actual low-level
locations in the text has been developed7. More information can be
found in [Wit99].The SMART project will be utilized in two different contexts:1. As a retrieval and interface engine for the Buddhist text database
produced by the Chinese Buddhist Electronic Text Association. SMART will
allow for retrieval with enhanced queries, and add markup based on these
queries, thus providing a powerful way to gradually enrich the
markup.2. As the central research platform for a research project of texts of
the Chan school in Chinese Buddhism. A smaller corpus of texts is here
used for building not only text with rich markup, but also supporting
databases of proper names, sites and historical dates to allow for
knowledge-base centered retrieval of the texts.A demonstration of both applications will be given in this presentation.ReferencesUrsAppChristianWitternA New Strategy for Dealing with Missing Chinese
CharactersHumanities and Information Processing1052-59February 1996UrsAppFujimotoKumikoChristianWitternZenBase CD1KyotoInternational Institute for Zen Buddhism1995Chinese Character Analysis GroupChinese Character Code for Information
InterchangeVol. I-IIITaipeh198019821985NicolaCalzolariAntonioZampolliLexical Databases and Textual Corpora: A Trend of
Convergence between Computational Linguistics and Literary and
Linguistic ComputingResearch in Humanities Computing1OxfordClarendon1991273-307IanLancashireThe Humanities Computing Yearbook 1989-90 A
Comprehensive Guide to Software and other ResourcesOxfordClarendon1991Hans-WalterLatzEntwurf eines Modells der Verarbeitung von
SGML-Dokumenten in versionsorientierten Hypertext-Systemen Das
HyperSGML KonzeptDiss.Technische Universität Berlin1992MichaelNeumanYou Can’t Always Get What You Want: Deep Encoding of
Manuscripts and the Limits of RetrievalResearch in Humanities Computing5OxfordClarendon1996209-219PeterM.W.RobinsonCollate: A program for Interactive Collation of Large
Textual TraditionsResearch in Humanities Computing3OxfordClarendon199432-45C.MichaelSperberg-McQueenText Encoding and EnrichmentIanLancashireThe Humanities Computing Yearbook 1989-90 A
Comprehensive Guide to Software and other ResourcesOxfordClarendon Press1991503fC.MichaelSperberg-McQueenLouBurnardGuidelines for Electronic Text Encoding and
InterchangeChicago, Oxford1994ChristianWitternChinese Character EncodingThe Electronic BodhidharmaNr. 344-47July 1993ChristianWitternCode und Struktur: Einige vorläufige Überlegungen zum
Aufbau chinesischer VolltextdatenbankenChinesisch und ComputerNr. 9S. 15-21April 1994ChristianWitternThe IRIZ KanjiBaseThe Electronic BodhidharmaNr. 458-62June 1995ChristianWitternChinese character codes: an updateThe Electronic BodhidharmaNr. 463-65June 1995ChristianWitternMinimal Markup and More - Some Requirements for Public
TextsConference presentation at the 3rd EBTI meeting on
April 7th, 1996 in Taipei, Taiwan1996ChristianWitternSMART: Format of the Index Files1999Technical note published on the Internet at <>.
(First published July 20th, 1999, last revised January 10th, 2000)
KoichiYasuokaYasukoYasuokaKanjibukuroKyoto1996<>