Mots15 - An interactive concordance system (built from
mostly off-the- shelf parts)PaulMeurerUniversity of Bergenpaul.meurer@hit.uib.noMichaelSperberg-McQueenWorld Wide Web Consortiumcmsmcq@acm.org2002University of TübingenTübingenALLC/ACH 2002editorHaraldFuchsencoderSaraA.SchmidtMots 15 is an interactive Web-based concordance or full-text retrieval system
built mostly out of off-the-shelf software.The goals of the Mots-15 project are:to build a reasonably capable full-text retrieval system, with
functionality generally similar to Tact, ARRAS, and the like, but
with better markup awarenessto keep minimum investment low for both implementors and usersto allow experimentation with interesting parts of the query
systemFrom these design goals follow several design principles:simplicity of implementationuse of off-the-shelf components wherever possiblemodularity, loose coupling among modules using predefined
interfaces wherever possible1. Basic interfaces in a query system1.1. MonolithsAt a very simple level, an interactive query system simply accepts
queries from a user, which return responses from the data. In systems
like Arras and Tact, the single monolithic software package controls
everything in the diagram.Image 1: A monolithic query system 1.2. Web interfaceWith the advent of graphical browsers for the World Wide Web, however, it
is possible to provide a fairly attractive interface at a much lower
cost than would otherwise be possible. It may still make sense to devise
special-purpose user interface software for specific purposes, but we
can go a long way without it, just relying on the users to have chosen a
Web browser they like reasonably well. The Web, that is, exposes an
interface between the user interface and the data in the back end.Image 2: A Web-based query systemThis interface sets certain limits to our freedom we must now use HTML
to describe what the user sees (we can use arbitrary XML if we are
willing to require that the user have an XML-capable browser) and the
user's interactions with the server are limited to what can be done
using HTML forms (and possibly a browser scripting language like
Javascript) but within those limits we can develop better user
interfaces at a lower cost than if we were building from scratch.Even more important, we can now swap front- and back-ends in and out. We
can experiment with different user interfaces by writing different
front-end forms and HTML style sheets. In theory, we can also experiment
with different back ends by substituting one for the other and using the
same front end; in practice, the existing systems built on this model
don't easily allow for swapping different back ends in and out, because
the interface between the front end and the back end varies with the
specific product used as the back end. Because different commercial
products rarely support identical interfaces, this means it's rarely
possible to swap a new back end in with minimal effort.1.3. Mots 15The Mots 15 system differs from the generic Web-based system primarily by
exposing a generic query interface in front of the back-end-specific
query interface, in order to buffer the front end and back end from each
other.Image 3: Basic plan of MOTS query system Ideally, this generic query interface should follow some open
specification; ideally, it should provide all the functionality we want
(to keep life simple), and no more (so that it is easy to build back
ends if we want to do it ourselves); the exact choice depends on the
tradeoff between these incompatible goals.Assuming that we have some suitable query language, and a way to
translate from it into the query language of the back end, then any XML
query engine may be used as back end.A SQL dbms may be the most flexible back end. The design made by MSM for
this would involve a few light-weight scripts which run on top of the
SQL database system. The SQL system itself would in this design produce
not elements but element pointers, which would be used to extract the
elements from a saved copy of the XML file.The task of translating from the open query language to the proprietary
back end query language is, of course, simplified if the back end
accepts the open query language itself.Another alternative would be using the corpus management system Corpus
WorkBench (IMS, Stuttgart) as the MOTS back end, possibly combined with
an XML query facility. This would allow for more advanced linguistic
queries.2. Pieces of Mots 15Mots 15 is designed to make it relatively simple to specify and implement
each piece of the system. The better we succeed in this goal, the easier it
will be for us to experiment with different parts of the system, and the
easier it will be for eventual users to customize it for their own purposes.
Eventually, the designers hope that Mots 15 will grow into a library of
reusable and customizable pieces, which individuals and small projects can
modify to make useful special-purpose systems.The Mots 15 design requires the following pieces of software:browser: an off-the-shelf Web browser; this handles the actual
display of results on the user's screen and interaction with the
userforms: one or more HTML forms which allow the user to specify
searches; these produce an HTML-forms data stream which the parser
hands to an appropriate CGI scriptform-to-query translator: a program to translate the forms data
into a query, expressed in the open query languagequery-to-query translator: a program to translate the query from
the open query language into the query language supported by the
back endback end: a program, which accepts queries in some (possibly
proprietary) query language and returns as results e.g. some set of
XML elementswrapper: a program which takes the results and places them in
two-level wrapper: (a) an outermost mots:result XML element and (b)
an element depending on the hit type wrapped around each hit, each
with attributes providing useful information about the query and its
resultsXML-to-HTML translator: a program which takes the wrapped results
and translates them into HTML suitable for display in the user's
off-the-shelf browsertransaction manager: a CGI script to manage the query/response
transaction, by calling (or incorporating) the various other
programs in this list; it may also be responsible for session
management3. Open problems and opportunitiesThe existing implementation of Mots15 (as of November 2001) is a minimal
system witha choice of several simple Web interfacessupport for straightforward XML documents onlyXSLT stylesheets for XML-HTML translationa limited (XPath based) query language, extended with word
frequency queriesThere are several obvious challenges for the future development of Mots
15:serious Web interface (room for experiment)XML++ support display of parallel versions, textual variation external and user supplied annotation proximity searchingexploiting grammatical annotation of textsupporting documents with overlap (i.e. TexMECS)allowing users to search as if the text were marked up more simply
than it is (e.g. with a uniform chapter/section/paragraph/sentence
hierarchy)supporting more powerful back ends (either by means of wormholes
in the open query language, or by means of a second interface)managing selection of texts from a corpus or collection; federated
searchesReferencesJohnPrice-WilkinUsing the World-Wide Web to Deliver Complex Electronic Documents: Implications for LibrariesPublic-Access Computer Systems Review535-211994.JohnPrice-WilkinA Gateway between the World Wide Web and PAT: Exploring SGML Through the WebPublic-Acces Computer Systems Review575271994JohnPrice-WilkinThe Feasibility of Wide-area Textual Analysis Systems in Libraries: A Practical AnalysisPresented at Literary Texts in an Electronic Age: Scholarly Implications and Library Services, the 31st Annual Clinic on Library Applications of Data Processing (University of Illinois at Urbana- Champaign). April 10-12, 19941994. Published in the Proceedings of the Clinic. Gateway between the World Wide Web and PAT: Exploring SGML Through the Web.JohnPrice-WilkinJust-in-time Conversion, Just-in-case Collections: Effectively leveraging rich document formats for the WWWD-Lib MagazineMay 1997