| Inference Networks for Document Retrieval | | BIBA | 1-24 | |
| Howard Turtle; W. Bruce Croft | |||
| The use of inference networks to support document retrieval is introduced. A network-based retrieval model is described and compared to conventional probabilistic and Boolean models. | |||
| A Retrieval Model Based on an Extended Modal Logic and its Application to the RIME Experimental Approach | | BIBA | 25-43 | |
| Yves Chiaramella; Jianyun Nie | |||
| This paper focuses on the processing module of RIME, an experimental prototype of an intelligent information retrieval system designed to manage high precision queries on a corpus of medical reports. Though highly specific this particular corpus is representative of an important class of applications: information retrieval among full-text specialized documents which constitute critical sources of information in several organizations (medicine, law, space industry...). This experience allowed us to design and implement an elaborate model for the semantic content of the documents which is an extension of the Conceptual Dependancy approach. The underlying retrieval model is inspired from the Logic model proposed by C. J. van Rijsbergen, which has been considerably refined using an Extended Modal Logic. After presenting the context of the RIME project, we briefly describe the models designed for the interval representation of medical reports and queries. The main part of the paper is then devoted to the retrieval model and its application to the query processing module of RIME which has a natural language interface. Processing a query involves two main phases: the interpretation which transforms the natural language query into a search expression, and the evaluation phases which retrieves the corresponding medical reports. We focus here on the evaluation phases and show its relationship with the underlying retrieval model. Evaluations from practical experiments are also given with indications about current developments of the project. | |||
| Probabilistic Document Indexing from Relevance Feedback Data | | BIBA | 45-61 | |
| Norbert Fuhr; Chris Buckley | |||
| Based on the binary independence indexing model, we apply three new concepts
for probabilistic document indexing from relevance feedback data:
1. Abstraction from specific terms and documents, which overcomes the
restriction of limited relevance information for parameter estimation. 2. Flexibility of the representation, which allows the integration of new text analysis and knowledge-based methods in our approach as well as the consideration of more complex document structures or different types of terms (e.g. single words and noun phrases). 3. Probabilistic learning or classification methods for the estimation of the indexing weights making better use of the available relevance information. We give experimental results for five test collections which show improvements over other indexing methods. | |||
| EXPRESS: An Experimental Interface for Factual Information Retrieval | | BIBA | 63-81 | |
| Heinz Ulrich Hoppe; Karin Ammersbach; Barbara Lutes-Schaab; Gaby Zinssmeister | |||
| The EXPRESS system has been designed and implemented in order to explore methods for user assistance in accessing complexly structured factual databases, e.g. relational product databases. Terminological support in this area has to take into account that different controlled vocabularies may be used in a variety of attributes spread over several relations. In our approach, traditional thesaurus structures are extended in order to cope with these problems and to encode further domain-specific knowledge. User support in query reformulation is based on this enriched thesaurus as well as on the local evaluation of the retrieved data sets. Concepts for the representation of retrieval strategies in the form of plans and their potential use in future systems are discussed. | |||
| Hypertext, Full Text, and Automatic Linking | | BIBA | 83-98 | |
| James H. Coombs | |||
| Current computing systems typically support only mid-century information structures: simple hierarchies. Hypertext technologies enable users to impose many structures on document sets and, consequently, provide many paths to desired information, but they require that users work their way through some structure. Full-text search eliminates this requirement by ignoring structure altogether. The search strategy can also be restricted to work within specified contexts. The architecture provided for search readily supports automatic linking. These ideas have been tested in IRIS Intermedia. | |||
| Machine Learning and Vectorial Matching for an Image Retrieval Model: EXPRIM and the System RIVAGE | | BIB | 99-114 | |
| G. Halin; M. Crehange; P. Kerekes | |||
| Online Query Refinement on Information Retrieval Systems: A Process Model of Searcher/System Interactions | | BIBA | 115-133 | |
| Hsinchun Chen; Vasant Dhar | |||
| This article reports findings of empirical research that investigated information searchers' online query refinement process. Prior studies have recognized the information specialists' role in helping searchers articulate and refine queries. Using a semantic network and a Problem Behavior Graph to represent the online search process, our study revealed that searchers also refined their own queries in an online task environment. The information retrieval system played a passive role in assisting online query refinement, which was, however, one that confirmed Taylor's four-level query formulation model. Based on our empirical findings, we proposed using a process model to facilitate and improve query refinement in an online environment. We believe incorporating this model into retrieval systems can result in the design of more "intelligent" and useful information retrieval systems. | |||
| A Direct Manipulation Interface for Boolean Information Retrieval via Natural Language Query | | BIBA | 135-150 | |
| Peter G. Anick; Jeffrey D. Brennan; Rex A. Flynn; David R. Hanssen; Bryan Alvey; Jeffrey M. Robbins | |||
| This paper describes the design of a direct manipulation user interface for Boolean information retrieval. Intended to overcome the difficulties of manipulating explicit Boolean queries as well as the "black box" drawbacks of so-called natural language query systems, the interface presents a two-dimensional graphical representation of a user's natural language query which not only exposes heuristic query transformations performed by the system, but also supports query reformulation by the user via direct manipulation of the representation. The paper illustrates the operation of the interface as implemented in the AI-STARS full-text information retrieval system. | |||
| Determining the Functionality and Features of an Intelligent Interface to an Information Retrieval System | | BIBA | 151-177 | |
| Nicholas J. Belkin; Pier Giorgio Marchetti | |||
| In this paper, we propose a method for specifying the functionality of an intelligent interface to large-scale information retrieval systems and for implementing those functions in an operational environment. The method is based on a progressive, three-stage model of intelligent information support; a high-level cognitive task analysis of the information retrieval problem; a low-level specification of the host system functionality; and, derivation of explicit relations between the system functions and the cognitive tasks. This method is applied, by example, in the context of the European Space Agency Information Retrieval Service, with some specific suggestions for implementation of a stage one intelligent interface to that system. | |||
| Using Syntactic Analysis in a Document Retrieval System that Uses Signature Files | | BIBA | 179-192 | |
| Ron Sacks-Davis; Peter Wallis; Ross Wilkinson | |||
| Our work involves the study of the extent to which natural language processing techniques aid the automatic indexing and retrieval of documents. In this paper we describe the use of signature files in large text retrieval systems. We show that good performance can be obtained without requiring the significant overheads required for the inverted file technique. We examine the use of syntactic analysis of the text in all stages of retrieval and argue that an initial boolean query should be performed that provides a subset of documents, which are then ranked. We then give an algorithm for generating such queries, taking into account the syntactic structure of the queries. | |||
| A Dynamic Signature Technique for Multimedia Databases | | BIBA | 193-210 | |
| F. Rabitti; P. Zezula | |||
| A signature file acts as a filtering mechanism to reduce the amount data
needs to be searched during query evaluation. Even though several techniques
for organizing and searching signature files have been proposed in literature,
they have serious limitations when applied to multimedia databases, where
integrated access methods to text and image content are needed.
A new signature technique, called Quick Filter, is proposed in the paper. According to this technique, signatures are divided into partitions, each of which holds signatures sharing the same characteristic key. As a result, it is possible to determine if the signatures in a partition satisfy a query by merely examining the key. Partitions not matching the key need not be searched. This method is based on dynamic hashing since signatures are hashed into partitions according to the keys and the file size, computed algorithmically from the signatures. Implementation of this technique is illustrated using an example and is verified by analytical performance evaluation. The result is a signature technique which satisfies the requirements for access methods in multimedia databases: dynamicity, with respect to insertions and updates, good query processing performance on large databases for high-weight queries. | |||
| Surrogate Subsets: A Free Space Management Strategy for the Index of a Text Retrieval System | | BIBA | 211-226 | |
| F. J. Burkowski | |||
| This paper presents a new data structure and an associated strategy to be utilized by indexing facilities for text retrieval systems. The paper starts by reviewing some of the goals that may be considered when designing such an index and continues with a small survey of various current strategies. It then presents an indexing strategy referred to as surrogate subsets discussing its appropriateness in the light of the specified goals. Various design issues and implementation details are discussed. Our strategy requires that a surrogate file be divided into a large number of subsets separated by free space which will allow the index to expand when new material is appended to the database. Experimental results report on the utilization of free space when the database is enlarged. | |||
| Construction of a Dynamic Thesaurus and Its Use of Associated Information Retrieval | | BIBAK | 227-240 | |
| Haruo Kimoto; Toshiaki Iwadera | |||
| An information retrieval system based on a dynamic thesaurus was developed
utilizing the connectionist approach. The dynamic thesaurus consists of nodes,
which represent each term of a thesaurus, and links, which represent the
connections between nodes. Term information that is automatically extracted
from user's relevant documents is used to change node weights and generate
links. Node weights and links reflect a user's particular interest. A
document retrieval experiment was conducted in which both a high recall rate
and a high precision rate were achieved. Keywords: Connectionist model, Automatic indexing, Information retrieval, Thesaurus | |||
| Knowledge Based Retrieval of Office Documents | | BIBA | 241-253 | |
| Augusto Celentano; Maria Grazia Fugini; Silvano Pozzi | |||
| Document classification and retrieval systems for office applications require knowledge management to describe the semantics of documents related to procedural and domain dependent aspects of the office work to be described: operational dependencies, documents relationships and references to regulations and laws are concepts which are not explicitly stated in the text. This paper presents a semantic model for office documents classification and retrieval based on knowledge representation of the office procedural environment and of the application domain. Navigation along knowledge networks for document retrieval and browsing is described. | |||
| Evaluation of an Expert System for Searching in Full Text | | BIBA | 255-277 | |
| Susan Gauch | |||
| This paper presents a prototype expert system which provides online search
assistance. The expert system automatically reformulates queries, using an
online thesaurus as the source of domain knowledge, and a knowledge base of
domain-independent search tactics. The expert system works with a full-text
database which requires no syntactic or semantic pre-processing. In addition,
the expert system ranks the retrieved passages in decreasing order of probable
relevance.
Users' search performance using the expert system was compared with their search performance on their own, and their search performance using the online thesaurus. The following conclusions were reached: 1) The expert system significantly reduced the number of queries necessary to find relevant passages compared with the user searching alone or with the thesaurus. 2) The expert system produced marginally significant improvements in precision compared with the user searching on their own. There was no significant difference in the recall achieved by the three system configurations. 3) Overall, the expert system ranked relevant passages above irrelevant passages. | |||
| Order Preserving Minimal Perfect Hash Functions and Information Retrieval | | BIBA | 279-311 | |
| Edward A. Fox; Qi Fan Chen; Amjad M. Daoud; Lenwood S. Heath | |||
| Rapid access to information is essential for a wide variety of retrieval systems and applications. Hashing has long been used when the fastest possible direct search is desired, but is generally not appropriate when sequential or range searches are also required. This paper describes a hashing method, developed for collections that are relatively static, that supports both direct and sequential access. Indeed, the algorithm described gives hash functions that are optimal in terms of time and hash table space utilization, and that preserve any a priori ordering desired. Furthermore, the resulting order preserving minimal perfect hash functions (OPMPHFs) can be found using space and time that is on average linear in the number of keys involved. | |||
| On the Interrelationship of Dictionary Size and Completeness | | BIBA | 313-325 | |
| Hubert Huther | |||
| When dictionaries for specific applications or subject fields are derived
from a text collection, the frequency distribution of the terms in the
collection gives information about the expected completeness of the dictionary.
If only a subset of the terms in the collection is to be included in the
dictionary, the completeness of the dictionary can be optimized with respect to
dictionary size.
In this paper, formulas for the relationship between the frequency distribution of the terms in the collection and expected dictionary completeness are derived. First we regard one-dimensional dictionaries where the (non-trivial) terms occurring in the texts are to be included in the dictionary. Then we describe the case of two-dimensional dictionaries, which are needed for example for automatic indexing with a controlled vocabulary; here relationship between text terms and descriptors from the prescribed vocabulary have to be stored in the dictionary. For both cases, formulas for the interpolation and extrapolation with respect to different collection sizes are derived. We give experimental results for one-dimensional dictionaries and show how the completeness can be estimated and optimized. | |||
| Construction of Optimal Graphs for Bit-Vector Compression | | BIBA | 327-342 | |
| Abraham Bookstein; Shmuel T. Klein | |||
| Bitmaps are data structures occurring often in information retrieval. They are useful; they are also large and expensive to store. For this reason, considerable effort has been devoted to finding techniques for compressing them. These techniques are most effective for sparse bitmaps. We propose a preprocessing stage, in which bitmaps are first clustered and the clusters used to transform their member bitmaps into sparser ones, that can be more effectively compressed. The clustering method efficiently generates a graph structure on the bitmaps. The results of applying our algorithm to the Bible is presented: for some sets of bitmaps, our method almost doubled the compression savings. | |||
| Hypertext: "Growing Up?" | | BIBA | 343-347 | |
| M. E. Frisse; Maristella Agosti; Marie-France Bruandet; Udo Hahn; Stephen F. Weiss | |||
| This panel will employ two different interpretations of the phrase "growing
up" to address areas of common interest between hypertext and information
retrieval researchers. First, the panelists will question whether or not
hypertext is "growing up" as a scientific discipline; They will discuss
characteristics that separate hypertext research from other related
disciplines. Second, the panelists will discuss the problems encountered when
a hypertext system "grows up" in size and complexity; They will discuss the
very real problems expected when representing and integrating large knowledge
bases, accommodating multiple users, and distributing single logical hypertexts
across multiple physical sites.
The panelists will not lecture, but they will advance a number of themes including "the Myth of Modularity" (Frisse), "New Architectures Employing Hyperconcept Databases" (Agosti), "Hypertext in Software Engineering" (Bruandet), "Automatic Hypertext Generation" (Hahn), and "Large-Scale Hypertexts" (Weiss). | |||
| Experiments with Query Acquisition and Use in Document Retrieval Systems | | BIBA | 349-368 | |
| W. Bruce Croft; Raj Das | |||
| In some recent experimental document retrieval systems, emphasis has been placed on the acquisition of a detailed model of the information need through interaction with the user. It has been argued that these "enhanced" queries, in combination with relevance feedback, will improve retrieval performance. In this paper, we describe a study with the aim of evaluating how easily enhanced queries can be acquired from users and how effectively this additional knowledge can be used in retrieval. The results indicate that significant effectiveness benefits can be obtained through the acquisition of domain concepts related to query concepts, together with their level of importance to the information need. | |||
| The Automatic Generation of Extended Queries | | BIBA | 369-383 | |
| Carolyn J. Crouch; Donald B. Crouch; Krishna R. Nareddy | |||
| In the extended vector space model, each document vector consists of a set of subvectors representing the multiple concepts or concept classes present in the document. Typical information concepts, in addition to the usual content terms or descriptors, include author names, bibliographic links, etc. The extended vector space model is known to improve retrieval effectiveness. However, a major impediment to the use of the extended model is the construction of an extended query. In this paper, we describe a method for automatically extending a query containing only content terms (a single concept class) to a representation containing multiple concept classes. No relevance feedback is involved. Experiments using the CACM collection resulted in an average precision 34% better than that obtained using the standard single-concept term vector model. | |||
| Term Clustering of Syntactic Phrases | | BIBA | 385-404 | |
| David D. Lewis; W. Bruce Croft | |||
| Term clustering and syntactic phrase formation are methods for transforming natural language text. Both have had only mixed success as strategies for improving the quality of text representations for document retrieval. Since the strengths of these methods are complementary, we have explored combining them to produce superior representations. In this paper we discuss our implementation of a syntactic phrase generator, as well as our preliminary experiments with producing phrase clusters. These experiments show small improvements in retrieval effectiveness resulting from the use of phrase clusters, but it is clear that corpora much larger than standard information retrieval test collections will be required to thoroughly evaluate the use of this technique. | |||
| Optimizations for Dynamic Inverted Index Maintenance | | BIBA | 405-411 | |
| Doug Cutting; Jan Pedersen | |||
| For free-text search over rapidly evolving corpora, dynamic update of inverted indices is a basic requirement. B-trees are an effective tool in implementing such indices. The Zipfian distribution of postings suggests space and time optimizations unique to this task. In particular, we present two novel optimizations, merge update, which performs better than straight forward block update, and pulsing which significantly reduces space requirements without sacrificing performance. | |||
| Partitioned Posting Files: A Parallel Inverted File Structure for Information Retrieval | | BIBA | 413-428 | |
| Craig Stanfill | |||
| This paper describes algorithms and data structures for applying a parallel computer to information retrieval. Previous work has described an implementation based on overlap encoded signatures. That system was limited by 1) the necessity of keeping the signatures in primary memory, and 2) the difficulties involved in implementing document-term weighting. Overcoming these limitations requires adapting the inverted index techniques used on serial machines. The most obvious adaptation, also previously described, suffers from the fact that data must be sent between processors at query-time. Since interprocessor communication is generally slower than local computation, this suggests that an algorithm which does not perform such communication might be faster. This paper presents a data structure, called a partitioned posting file, in which the interprocessor communication takes place at database-construction time, so that no data movement is needed at query-time. Algorithms for constructing the data structure are also described. Performance characteristics and storage overhead are established by benchmarking against a synthetic database. | |||
| Parallel Text Searching in Serial Files Using a Processor Farm | | BIBA | 429-453 | |
| Janey K. Cringean; Roger England; Gordon A. Manson; Peter Willett | |||
| This paper discusses the implementation of a parallel text retrieval system using a microprocessor network. The system is designed to allow fast searching in document databases organised using the serial file structure, with a very rapid initial text signature search being followed by a more detailed, but more time-consuming, pattern matching search. The network is built from transputers, high performance microprocessors developed specifically for the construction of highly parallel computing systems, which are linked together in a processor farm. The paper discusses the design and implementation of processor farms, and then reports our initial studies of the efficiency of searching that can be achieved using this approach to text retrieval from serial files. | |||
| An Architecture for Probabilistic Concept-Based Information Retrieval | | BIBA | 455-467 | |
| Robert M. Fung; Stuart L. Crawford; Lee A. Appelbaum; Richard M. Tong | |||
| While concept-based methods for information retrieval can provide improved
performance over more conventional techniques, they require large amounts of
effort to acquire the concepts and their qualitative and quantitative
relationships.
This paper discusses an architecture for probabilistic concept-based information retrieval which addresses the knowledge acquisition problem. The architecture makes use of the probabilistic networks technology for representing and reasoning about concepts and includes a knowledge acquisition component which partially automates the construction of concept knowledge bases from data. We describe two experiments that apply the architecture to the task of retrieving documents about terrorism from a set of documents from the Reuters news service. The experiments provide positive evidence that the architecture design is feasible and that there are advantages to concept-based methods. | |||
| A New Method for Information Retrieval, Based on the Theory of Relative Concentration | | BIBA | 469-493 | |
| L. Egghe | |||
| This paper introduces a new method for information retrieval of documents
that are represented by a vector. The novelty of the algorithm lies in the
fact that no (generalized) p-norms are used as a matching function between the
query and the document (as is done e.g. by Salton and others) but a function
that measures the relative dispersion of the terms between a document and a
query. This function originates from an earlier paper of the author where a
good measure of relative concentration was introduced, used in informetrics to
measure the degree of specialization of a journal w.r.t. the entire subject.
This new information retrieval algorithm is shown to have many desirable properties (in the sense of the new Cater-Kraft wish list) including those of the original cosine-matching function of Salton. In addition the property of the cosine-matching function that, if one only uses weights 0 to 1, one is reduced to Boolean IR, is refined in the sense that one takes into consideration the broadness or specialization of a document and a query. Our new matching function satisfies these additional properties. | |||
| Extended Boolean Retrieval: A Heuristic Approach? | | BIBA | 495-508 | |
| Ronald Rousseau | |||
| We show that the similarity measures for p-norm retrieval, as defined by Salton, Fox and Wu have some undesirable mathematical properties. We propose a new function that remedies some of these drawbacks. Still, even for this new similarity measure the extended Boolean model has some properties which can only be described as 'heuristic'. | |||