Multi Dimensions Concept-Based Information Retrieval
SystemZainalA.HasibaunUniversity of Indonesia, Indonesia 2000University of GlasgowGlasgowALLC/ACH 2000editorJeanAndersonAmalChatterjeeChristianJ.KayMargaretScottencoderSaraA.SchmidtComputational / Corpus LinguisticsIntroductionMost of the problems in information retrieval systems occur from three
sources: impreciseness of query and document representations, the changing
state of mind in document relevant judgement, and the discrepancy of
retrieval technique to match query and information need. The traditional
approach to information retrieval systems, such as Boolean based retrieval
technique, cannot solve these problems (Belkin and Croft, 1987). According
to McCain (1989) and Pao and Worthen (1989), retrieval by keywords and cited
documents will end up with different sets of documents retrieved. There were
relevant documents retrieved by keywords but not retrieved by references,
and vice versa. Hence, a new approach to retrieval technique is needed.This study is intended to build a new approach to information retrieval
systems by employing the inherent structure of a document collection in an
effort to learn more about document components that might improve
information retrieval performance. The document components examined are the
pattern of keywords, citing documents, and cited documents. Three
independent variables were studied: co-keyword, co-citing document, and
co-cited document (Hasibuan, 1995). These three variables constitute the
multi-dimensions concept-based information retrieval system. By providing
such variables as entry points to search relevant documents, it is widening
the naturalness of the system in order to accommodate users' information
needs.MethodologyA test collection was constructed from a collection of research articles
published by National Atomic Research Agency (BATAN), covering the period
1985 through 1998. An automatic program was written to build indexes of
keywords, citing documents, and cited documents for each document. The
relationship that may occur between two documents can be depicted as in
Figure 1. Pair-wise document similarity is calculated on those three
variables. As in Figure 1, the similarities between documents A and B can be
viewed in terms of documents Q and S, which cite documents A and B
(co-cited), and documents Y and Z, which are cited by A and B (co-citing).
In addition to that, the similarity of documents A and B can be counted on
the number of shared terms (co-keyword).Figure 1. Relationship of Two Documents (A and B)Document similarity is measured by using the simple matching coefficient
proposed by Van Rijsbergen (1979) and Salton (1989). The similarity of
document A and B is calculated as follows:Similarity (A, B) = | X Ç Y | The variables X and Y represent the sets of index terms occurring in two
documents. Hence, the similarity of document A and B is measured in terms of
the number of shared index terms. These sets of index terms can be expanded
to include values for citing documents and cited documents. The retrieval
technique used is based on this document similarity.ResultsThe preliminary results of the research showed that relevant documents
retrieved by one component did not always agree with other components (see
Figure 1). Figure 1 shows document 0376 and document 0419 have 11 shared
index terms, two shared cited documents and three shared citing documents.
As we expected, most of the document pairs have zero frequency of shared
cited document and shared citing document. This finding is in line with our
previous research finding in Hasibuan (1995). According to these results, it
is suggested to build a multi-dimension concept-based retrieval system. The
system built is able to provide users with a facility to search several
interrelated options of search strategy. Given that kind of facility, a user
can be more flexible to start his/her search by using one of the dimensions,
says keyword, then navigates the system using other dimensions of document
similarity. At any moment, the documents retrieved can be viewed, evaluated
and judged for their relevance.Figure 1. A Portion of Search Results of the Multi-dimension
Concept-based Information Retrieval SystemThe search results shown in Figure 1 are posted in a hypertext based, so that
a user can continue browsing uninterruptedly, in order to further his/her
search to retrieve more on the other possible relevant documents. For each
pair of documents retrieved, the system will provide the frequency of its
co-keyword, co-cited documents and co-citing documents. Furthermore, the
non-zero frequency of each entry in the columns of co-citing and co-cited
documents becomes an active icon. If we click 2 in column co-cited, then we
can see the documents that are co-cited by document 0376 and 0419 (see
Figure 2). There are documents 0523 and 0531. The abstract of each document
retrieved can be viewed by clicking the number of the document (see Figure
3).Figure 2. An Example of Co-cited Documents Figure 3. An Example of Abstract Document RetrievedConclusionMulti-dimensions concept-based retrieval system provides relaxed facility in
order to search information by utilizing the components of documents -
co-keyword, co-cited documents, and co-citing documents. With this facility,
the system can widely accommodate the range of user's search strategy in
information seeking. The drawback of the system compared to the traditional
system is that this new approach needs more space of computer storage in
order to accommodate all its index files. Furthermore, it can slow down the
search process. However, this trade-off is compensated for by the
flexibility of the system to provide more search strategies, more
comprehensive retrieval of relevant documents, and easy to browse from one
document to another document. Ultimately, by utilizing these three
components of a document, the system can reduce the possibility of low
retrieval system performance due to the impreciseness of query and document
representations, lack of relevant judgement, and lack of matching function
between query and document.ReferencesNicholasJ.BelkinW.BruceCroftRetrieval TechniquesAnnual Review of Information Science And
Technology22109-1451987ZainalA.HasibuanDocument Similarity and Structure: Using Bibliometric
Methods and Index Terms As Approaches to Improving Information
Retrieval PerformanceDoctoral DissertationSchool of Library, Archive and Information Science,
Indiana University, Bloomington, Indiana1995KatherineW.McCainDescriptor and Citation Retrieval in the Medical
Behavioral Sciences Literature: Retrieval Overlaps and
NoveltyJournal of the American Society for Information
Science402110-1141989MirandaL.PaoDennisB.WorthenRetrieval Effectiveness by Semantic and Citation
SearchingJournal of the American Society for Information
Science404226-2351989GerardSaltonAutomatic Text Processing: The Transformation, Analysis
and Retrieval of Information by ComputerNew YorkAddison-Wesley Publishing Company1989S.R.Van RijsbergenInformation RetrievalLondonButterworth1979