HCI Bibliography Home | HCI Conferences | IR Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
IR Tables of Contents: 86878889909192939495969798990001020304

Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

Fullname:Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
Editors:W. Bruce Croft; C. J. van Rijsbergen
Location:Dublin, Ireland
Dates:1994-Jul-03 to 1994-Jul-06
Standard No:ISBN 0-387-19889-X; ACM Order Number 606940; ACM DL: Table of Contents hcibib: IR94
  1. Text Categorisation
  2. Indexing
  3. User Modelling
  4. Theory and Logic
  5. Natural Language Processing
  6. Statistical Models
  7. Performance Evaluation
  8. Probabilistic Models
  9. Interfaces
  10. Routing
  11. Passage Retrieval
  12. Implementation
  13. Panel Sessions

Text Categorisation

A Sequential Algorithm for Training Text Classifiers BIBAPDF 3-12
  David D. Lewis; William A. Gale
The ability to cheaply train text classifiers is critical to their use in information retrieval, content analysis, natural language processing, and other tasks involving data which is partly or fully textual. An algorithm for sequential sampling during machine learning of statistical classifiers was developed and tested on a newswire text categorization task. This method, which we call uncertainty sampling, reduced by as much as 500-fold the amount of training data that would have to be manually classified to achieve a given level of effectiveness.
Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval BIBAPDF 13-22
  Yiming Yang
Expert Network (ExpNet) is our new approach to automatic categorization and retrieval of natural language texts. We use a training set of texts with expert-assigned categories to construct a network which approximately reflects the conditional probabilities of categories given a text. The input nodes of the network are words in the training texts, the nodes on the intermediate level are the training texts, and the output nodes are categories. The links between nodes are computed based on statistics of the word distribution and the category distribution over the training set. ExpNet is used for relevance ranking of candidate categories of an arbitrary text in the case of text categorization, and for relevance ranking of documents via categories in the case of text retrieval. We have evaluated ExpNet in categorization and retrieval on a document collection of the MEDLINE database, and observed a performance in recall and precision comparable to the Linear Least Squares Fit (LLSF) mapping method, and significantly better than other methods tested. Computationally, ExpNet has an O(NlogN) time complexity which is much more efficient than the cubic complexity of the LLSF method. The simplicity of the model, the high recall-precision rates, and the efficient computation together make ExpNet preferable as a practical solution for real-world applications.
Towards Language Independent Automated Learning of Text Categorisation Methods BIBAPDF 23-30
  Chidanand Apte; Fred Damerau; Sholom M. Weiss
We describe the results of extensive machine learning experiments on large collections of Reuters' English and German newswires. The goal of these experiments was to automatically discover classification patterns that can be used for assignment of topics to the individual newswires. Our results with the English newswire collection show a very large gain in performance as compared to published benchmarks, while our initial results with the German newswires appear very promising. We present our methodology, which seems to be insensitive to the language of the document collections, and discuss issues related to the differences in results that we have obtained for the two collections.
Using IR Techniques for Text Classification in Document Analysis BIBAPDF 31-40
  Rainer Hoch
This paper presents the INFOCLAS system applying statistical methods of information retrieval for the classification of German business letters into corresponding message types such as order, offer, enclosure, etc. INFOCLAS is a first step towards the understanding of documents proceeding to a classification-driven extraction of information. The system is composed of two main modules: the central indexer (extraction and weighting of indexing terms) and the classifier (classification of business letters into given types). The system employs several knowledge sources including a letter database, word frequency statistics for German, lists of message type specific words, morphological knowledge as well as the underlying document structure. As output, the system evaluates a set of weighted hypotheses about the type of the actual letter. Classification of documents allow the automatic distribution or archiving of letters and is also an excellent starting point for higher-level document analysis.


An Evaluation Method for Stemming Algorithms BIBAPDF 42-50
  Chris D. Paice
The effectiveness of stemming algorithms has usually been measured in terms of their effect on retrieval performance with test collections. This however does not provide any insights which might help in stemmer optimisation. This paper describes a method in which stemming performance is assessed against predefined concept groups in samples of words. This enables various indices of stemming performance and weight to be computed. Results are reported for three stemming algorithms. The validity and usefulness of the approach, and the problems of conceptual grouping, are discussed, and directions for further research are identified.
On the Measurement of Inter-Linker Consistency and Retrieval Effectiveness in Hypertext Databases BIBAPDF 51-60
  David Ellis; Jonathan Furner-Hines; Peter Willett
An important stage in the process of retrieval of objects from a hypertext database is the creation of a set of inter-nodal links that are intended to represent the relationships existing between objects; this operation is often undertaken manually, just as index terms are often manually assigned to documents in a conventional retrieval system. In this paper, a study is reported in which several different sets of links were inserted, each by a different person, between the paragraphs of each of a number of full-text documents. The degree of similarity between the members of each pair of link-sets (i.e., the degree of inter-linker consistency) was then evaluated. The results indicated that little similarity existed amongst the link-sets, a finding that is comparable with those of studies of inter-indexer consistency, which suggest that there is generally only a low level of agreement between the sets of index terms assigned to a document by different indexers. These latter studies have historically been considered significant on account of their common assumption that there exists a positive relationship between recorded levels of inter-indexer consistency and the levels of retrieval effectiveness that may be achieved by the systems studied. In order to test the validity of making a similar assumption in the context of link-assignment, the paper continues with a description of an investigation into the nature of the relationship existing between (i) the levels of inter-linker consistency obtaining among the group of hypertext databases used in our earlier experiments and (ii) the levels of effectiveness of a number of searches carried out in those databases. An account is given of the implementation of the searches and of the methods used in the calculation of numerical values expressing their effectiveness, and conclusions are drawn regarding the consistency-effectiveness relationship.
Query Expansion Using Lexical-Semantic Relations BIBAPDF 61-69
  Ellen M. Voorhees
Applications such as office automation, news filtering, help facilities in complex systems, and the like require the ability to retrieve documents from full-text databases where vocabulary problems can be particularly severe. Experiments performed on small collections with single-domain thesauri suggest that expanding query vectors with words that are lexically related to the original query words can ameliorate some of the problems of mismatched vocabularies. This paper examines the utility of lexical query expansion in the large, diverse TREC collection. Concepts are represented by WordNet synonym sets and are expanded by following the typed links included in WordNet. Experimental results show this query expansion technique makes little difference in retrieval effectiveness if the original queries are relatively complete descriptions of the information being sought even when the concepts to be expanded are selected by hand. Less well developed queries can be significantly improved by expansion of hand-chosen concepts. However, an automatic procedure that can approximate the set of hand picked synonym sets has yet to be devised, and expanding by the synonym sets that are automatically generated can degrade retrieval performance.

User Modelling

Perceptual Speed, Learning and Information Retrieval Performance BIBAPDF 71-80
  Bryce Allen
Although the cognitive ability "perceptual speed" is known to influence search performance by end-users, previous research has not established the mechanism by which this influence occurred. Results from educational psychology suggest that learning that occurs during searching is likely to be influenced by perceptual speed. An experiment was designed to test how this cognitive ability would interact with a system feature designed to enhance learning of search vocabulary, specifically, presenting subject descriptors as the first element in the display of a reference. Results showed significant interactions between perceptual speed and the order of presentation of data elements in predicting both vocabulary learning and search performance. Those results indicate that searchers with higher levels of perceptual speed will learn additional search vocabulary, and use that vocabulary to complete higher quality searches, when they use a system designed to optimize scanning of subject descriptors. This outcome supports the idea that cognitive abilities influence information system usability, and that usability is determined by interactions between characteristics of users and system features. The findings also suggest that system features that enhance the learning of search vocabulary, such as query expansion mechanisms, can have a significant positive effect on the quality of end-user searching.
Term Relevance Feedback and Query Expansion: Relation to Design BIBAPDF 81-90
  Amanda Spink
To improve information retrieval effectiveness, research in both the algorithmic and human approach to query expansion is required. This paper uses the human approach to examine the selection and effectiveness of search terms sources for query expansion. The results show that the most effective sources were the users written question statement, user terms derived during the interaction and terms selected from particular database fields. These findings indicate the need for the design and testing of automatic relevance feedback techniques that place greater emphasis on these sources.
Modelling Information Retrieval Agents with Belief Revision BIBAPDF 91-100
  Brian Logan; Steven Reece; Karen Sparck Jones
This paper describes the development and computational testing of a model of the information intermediary based on an AI theory of belief revision. We describe the theoretical foundations of the work in a general account of the way an agent's beliefs and intentions are formed and modified, and in an analysis of the functional tasks an intermediary has to carry out; we indicate the specific developments required to automate and integrate both aspects of intermediary behaviour, as determinants of interactive dialogue with the user; and report, with illustrations, on tests and findings. The research shows that such approaches can be implemented in an essentially principled manner, though there are many large problems still to be overcome, and our experiments are only the first, extremely simple, trials of the basic strategy for intermediary simulation.
Polyrepresentation of Information Needs and Semantic Entities: Elements of a Cognitive Theory for Information Retrieval Interaction BIBAPDF 101-110
  Peter Ingwersen
The paper outlines the principles underlying the theory of polyrepresentation applied to the user's cognitive space and the information space of IR systems, set in a cognitive framework. By means of polyrepresentation it is suggested to represent the current user's information need, problem state, and domain work task or interest in a structure of causality as well as to embody semantic full-text entities by means of the principle of 'intentional redundancy'. In IR systems this principle implies simultaneously to apply different methods of representation and a variety of IR techniques of different cognitive origin to each entity. The objective is to approximate as close as possible text retrieval to retrieval of information in a cognitive sense.

Theory and Logic

Investigating Aboutness Axioms using Information Fields BIBAPDF 112-121
  P. D. Bruza; T. W. C. Huibers
This article proposes a framework, a so called information field, which allows information retrieval mechanisms to be compared inductively instead of experimentally. Such a comparison occurs as follows: Tooth retrieval mechanisms are first mapped to an associated information field. Within the field, the axioms that drive the retrieval process can be filtered out. In this way, the implicit assumptions governing an information retrieval mechanism can be brought to light. The retrieval mechanisms can then be compared according to which axioms they are governed by. Using this method it is shown that Boolean retrieval is more powerful than a strict form of coordinate retrieval. The salient point is not this result in itself, but how the result was achieved.
A Probabilistic Terminological Logic for Modelling Information Retrieval BIBAPDF 122-130
  Fabrizio Sebastiani
Some researchers have recently argued that the task of Information Retrieval (IR) may successfully be described by means of mathematical logic; accordingly, the relevance of a given document to a given information need should be assessed by checking the validity of the logical formula d -> n, where d is the representation of the document, n is the representation of the information need and "->" is the conditional connective of the logic in question. In a recent paper we have proposed Terminological Logics (TLs) as suitable logics for modelling IR within the paradigm described above. This proposal, however, while making a step towards adequately modeling IR in a logical way, does not account for the fact that the relevance of a document to an information need can only be assessed up to a limited degree of certainty. In this work, we try to overcome this limitation by introducing a model of IR based on a Probabilistic TL, i.e. a logic allowing the expression of real-valued terms representing probability values and possibly involving expressions of a TL. Two different types of probabilistic information, i.e. statistical information and information about degrees of belief, can be accounted for in this logic. The paper presents a formal syntax and a denotational (possible-worlds) semantics for this logic, and discusses, by means of a number of examples, its adequacy as a formal tool for describing IR.

Natural Language Processing

Retrieving Terms and their Variants in a Lexicalized Unification-Based Framework BIBAPDF 132-141
  Christian Jacquemin; Jean Royaute
Term extraction is a major concern for information retrieval. Terms are not fixed forms and their variations prevent them from being identified by a match with their initial string or inflection. We show that a local syntactic approach to this problem can give good results for both the quality of identification and parsing time.
   A specific tool, FASTR, is developed which handles an identification of basic terms and a parser of their variations as well. Terms are described by logic rules automatically generated from terms and their categorial structure. Variations are represented by metarules. The parser efficiently processes large size corpora with big dictionaries and mixes lexical identification with local syntactic analysis. We evaluate the accuracy of results produced by these metarules and improve these results with filtering metarules.
Word Sense Disambiguation and Information Retrieval BIBAPDF 142-151
  Mark Sanderson
It has often been thought that word sense ambiguity is a cause of poor performance in Information Retrieval (IR) systems. The belief is that if ambiguous words can be correctly disambiguated, IR performance will increase. However, recent research into the application of a word sense disambiguator to an IR system failed to show any performance increase. From these results it has become clear that more basic research is needed to investigate the relationship between sense ambiguity, disambiguation, and IR.
   Using a technique that introduces additional sense ambiguity into a collection, this paper presents research that goes beyond previous work in this field to reveal the influence that ambiguity and disambiguation have on a probabilistic IR system. We conclude that word sense ambiguity is only problematic to an IR system when it is retrieving from very short queries. In addition we argue that if a word sense disambiguator is to be of any use to an IR system, the disambiguator must be able to resolve word senses to a high degree of accuracy.
A Full-Text Retrieval System with a Dynamic Abstract Generation Function BIBAPDF 152-161
  Seiji Miike; Etsuo Itoh; Kenji Ono; Kazuo Sumita
We have developed a Japanese full-text retrieval system named BREVIDOC* that enables the user to specify an area within a text for abstraction and to control the volume of the abstract interactively. This system analyzes a document structure using linguistic knowledge only and thus is domain-independent. In its text structure analysis, the system determines relations among paragraphs and sentences, based on linguistic clues such as connective, anaphoric expressions, and idiomatic expressions. The system analyzes and stores the text structure in advance so that it can generate an abstract in real time by selecting sentences according to relative importance of rhetorical relations among the sentences. The retrieval system works on an engineering workstation.
   *Broadcatching System with an Essence Viewer for Retrieved Documents

Statistical Models

A Document Retrieval Model Based on Term Frequency Ranks BIBAPDF 163-172
  IJsbrand Jan Aalbersberg
This paper introduces a new full-text document retrieval model that is based on comparing occurrence frequency rank numbers of terms in queries and documents.
   More precisely, to compute the similarity between a query and a document, this new model first ranks the terms in the query and in the document on decreasing occurrence frequency. Next, for each term, it computes a local similarity between the query and the document, by calculating a weighted difference between the term's rank number in the query and its rank number in the document. Finally, it collects all those local similarities and unifies them into one global similarity between the query and the document.
   In this paper we also demonstrate that the effectiveness of this new full-text document retrieval model is comparable with that of the standard vector-space retrieval model.
Automatic Combination of Multiple Ranked Retrieval Systems BIBAPDF 173-181
  Brian T. Bartell; Garrison W. Cottrell; Richard K. Belew
Retrieval performance can often be improved significantly by using a number of different retrieval algorithms and combining the results, in contrast to using just a single retrieval algorithm. This is because different retrieval algorithms, or retrieval experts, often emphasize different document and query features when determining relevance and therefore retrieve different sets of documents, however, it is unclear how the different experts are to be combined, in general, to yield a superior overall estimate. We propose a method by which the relevance estimates made by different experts can be automatically combined to result in superior retrieval performance. We apply the method to two expert combination tasks. The applications demonstrate that the method can identify high performance combinations of experts and also is a novel means for determining the combined effectiveness of experts.
Properties of Extended Boolean Models in Information Retrieval BIBAPDF 182-190
  Joon Ho Lee
The conventional boolean retrieval system does not provide ranked retrieval output because it cannot compute similarity coefficients between queries and documents. Extended boolean models such as fuzzy set, Waller-Kraft, Paice, P-Norm and Infinite-One have been proposed in the past to support ranking facility for the boolean retrieval system. In this paper, we analyze the behavioural aspects of the previous extended boolean models and address important mathematical properties to affect retrieval effectiveness. We concentrate our description on evaluation formulas for AND and OR operations and query weights. Our analyses show that P-Norm is the most suitable for achieving high retrieval effectiveness.

Performance Evaluation

OHSUMED: An Interactive Retrieval Evaluation and New Large Test Collection for Research BIBAPDF 192-201
  William Hersh; Chris Buckley; T. J. Leone; David Hickman
A series of information retrieval experiments was carried out with a computer installed in a medical practice setting for relatively inexperienced physician end-users. Using a commercial MEDLINE product based on the vector space model, these physicians searched just as effectively as more experienced searchers using Boolean searching. The results of this experiment were subsequently used to create a new large medical test collection, which was used in experiments with the SMART retrieval system to obtain baseline performance data as well as compare SMART with the other searchers.
Results of Applying Probabilistic IR to OCR Text BIBAPDF 202-211
  Kazem Taghva; Julie Borsack; Allen Condit
Character accuracy of optically recognized text is considered a basic measure for evaluating OCR devices. In the broader sense, another fundamental measure of an OCR's goodness is whether its generated text is usable for retrieving information. In this study, we evaluate retrieval effectiveness from OCR text databases using a probabilistic IR system. We compare these retrieval results to their manually corrected equivalent. We show there is no statistical difference in precision and recall using graded accuracy levels from three OCR devices. However, characteristics of the OCR data have side effects that could cause unstable results with this IR model. In particular, we found individual queries can be greatly affected. Knowing the qualities of OCR text, we compensate for them by applying an automatic post-processing system that improves effectiveness.
Natural Language vs. Boolean Query Evaluation: A Comparison of Retrieval Performance BIBAPDF 212-220
  Howard Turtle
The results of experiments comparing the relative performance of natural language and Boolean query formulations are presented. The experiments show that on average a current generation natural language system provides better retrieval performance than expert searchers using a Boolean retrieval system when searching full-text legal materials. Methodological issues are reviewed and the effect of database size on query formulation strategy is discussed.

Probabilistic Models

Inferring Probability of Relevance Using the Method of Logistic Regression BIBAPDF 222-231
  Fredric C. Gey
This research evaluates a model for probabilistic text and document retrieval; the model utilizes the technique of logistic regression to obtain equations which rank documents by probability of relevance as a function of document and query properties. Since the model infers probability of relevance from statistical clues present in the texts of documents and queries, we call it logistic inference. By transforming the distribution of each statistical clue into its standardized distribution (one with mean μ = O and standard deviation σ = 1), the method allows one to apply logistic coefficients derived from a training collection to other document collections, with little loss of predictive power. The model is applied to three well-known information retrieval test collections, and the results are compared directly to the particular vector space model of retrieval which uses term-frequency/inverse-document-frequency (tfidf) weighting and the cosine similarity measure. In the comparison, the logistic inference method performs significantly better than (in two collections) or equally well as (in the third collection) the tfidf/cosine vector space model. The differences in performances of the two models were subjected to statistical tests to see if the differences are statistically significant or could have occurred by chance.
Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval BIBAPDF 232-241
  S. E. Robertson; S. Walker
The 2-Poisson model for term frequencies is used to suggest ways of incorporating certain variables in probabilistic models for information retrieval. The variables concerned are within-document term frequency, document length, and within-query term frequency. Simple weighting functions are developed, and tested on the TREC test collection. Considerable performance improvements (over simple inverse collection frequency weighting) are demonstrated.
The Formalism of Probability Theory in IR: A Foundation or an Encumbrance? BIBAPDF 242-247
  Wm. S. Cooper
Probabilistic theories of retrieval bring to bear on the information search problem a high degree of theoretical coherence and deductive power. In principle, this power ought to be an invaluable asset. In practice, it has turned out to be a mixed blessing. The question considered here is whether the trappings of the probabilistic formalism strengthen or encumber IR research on balance.
Note: Triennial ACM-SIGIR Award Paper


LyberWorld -- A Visualization User Interface Supporting Fulltext Retrieval BIBAPDF 249-259
  Matthias Hemmje; Clemens Kunkel; Alexander Willett
LyberWorld is a prototype IR user interface. It implements visualizations of an abstract information space -- fulltext. The paper derives a model for such visualizations and an exemplar user interface design is implemented for the probabilistic fulltext retrieval system INQUERY. Visualizations are used to communicate information search and browsing activities in a natural way by applying metaphors of spatial navigation in abstract information spaces. Visualization tools for exploring information spaces and judging relevance of information items are introduced and an example session demonstrates the prototype. The presence of a spatial model in the user's mind and interaction with a system's corresponding display methods is regarded as an essential contribution towards natural interaction and reduction of cognitive costs during e.g. query construction, orientation within the database content, relevance judgement and orientation within the retrieval context.
A System for Discovering Relationships by Feature Extraction from Text Databases BIBAPDF 260-270
  Jack G. Conrad; Mary Hunter Utt
A method for accessing text-based information using domain-specific features rather than documents alone is presented. The basis of this approach is the ability to automatically extract features from large text databases, and identify statistically significant relationships or associations between those features. The techniques supporting this approach are discussed, and examples from an application using these techniques, named the Associations System, are illustrated using the Wall Street Journal database. In this particular application, the features extracted are company and person names. The series of tests run on the Associations System demonstrate that feature extraction can be quite accurate, and that the relationships generated are reliable. In addition to conventional measures of recall and precision, evaluation measures are currently being studied which will indicate the usefulness of the relationships identified, in various domain-specific contexts.


Information Filtering Based on User Behavior Analysis and Best Match Text Retrieval BIBAPDF 272-281
  Masahiro Morita; Yoichi Shinoda
Information filtering systems have potential power that may provide an efficient means of navigating through large and diverse data space. However, current information filtering technology heavily depends on a user's active participation for describing the user's interest to information items, forcing the user to accept extra load to overcome the already loaded situation. Furthermore, because the user's interests are often expressed in discrete format such as a set of keywords sometimes augmented with if-then rules, it is difficult to express ambiguous interests, which users often want to do. We propose a technique that uses user behavior monitoring to transparently capture the user's interest in information, and a technique to use this interest to filter incoming information in a very efficient way, The proposed techniques are verified to perform very well by having conducted a field experiment and a series of simulation.
Improving Text Retrieval for the Routing Problem using Latent Semantic Indexing BIBAPDF 282-291
  David Hull
Latent Semantic Indexing (LSI) is a novel approach to information retrieval that attempts to model the underlying structure of term associations by transforming the traditional representation of documents as vectors of weighted term frequencies to a new coordinate space where both documents and terms are represented as linear combinations of underlying semantic factors. In previous research, LSI has produced a small improvement in retrieval performance. In this paper, we apply LSI to the routing task, which operates under the assumption that a sample of relevant and non-relevant documents is available to use in constructing the query. Once again, LSI slightly improves performance. However, when LSI is used is conduction with statistical classification, there is a dramatic improvement in performance.
The Effect of Adding Relevance Information in a Relevance Feedback Environment BIBAPDF 292-300
  Chris Buckley; Gerard Salton; James Allan
The effects of adding information from relevant documents are examined in the TREC routing environment. A modified Rocchio relevance feedback approach is used, with a varying number of relevant documents retrieved by an initial SMART search, and a varying number of terms from those relevant documents used to expand the initial query. Recall-precision evaluation reveals that as the amount of expansion of the query due to adding terms from relevant documents increases, so does the effectiveness. There appears to be a linear relationship between the log of the number of terms added and the recall-precision effectiveness. There also appears to be a linear relationship between the log of the number of known relevant documents and the recall-precision effectiveness.

Passage Retrieval

Passage-Level Evidence in Document Retrieval BIBAPDF 302-310
  James P. Callan
The increasing lengths of documents in full-text collections encourages renewed interest in the ranking and retrieval of document passages. Past research showed that evidence from passages can improve retrieval results, but it also raised questions about how passages are defined, how they can be ranked efficiently, and what is their proper role in long, structured documents.
   This paper reports on experiments with passages in INQUERY, a probabilistic information retrieval system. Experiments were conducted with passages based on paragraphs, and with passages based on text windows of various sizes. Experimental results are given for three homogeneous and two heterogeneous document collections, ranging in size from three megabytes to two gigabytes.
Effective Retrieval of Structured Documents BIBAPDF 311-317
  Ross Wilkinson
Information systems usually retrieve whole documents as answers to queries. However, it may in some circumstances be more appropriate to retrieve parts of documents. We consider formulas for retrieving whole documents and parts of documents from a large structured document collection. We consider what information is needed to retrieve effectively and show that knowledge of the structure of documents can lead to improved retrieval performance.
Document and Passage Retrieval Based on Hidden Markov Models BIBAPDF 318-327
  Elke Mittendorf; Peter Schauble
Introduced is a new approach to Information Retrieval developed on the basis of Hidden Markov Models (HMMs). HMMs are shown to provide a mathematically sound framework for retrieving documents -- documents with predefined boundaries and also entities of information that are of arbitrary lengths and formats (passage retrieval). Our retrieval model is shown to encompass promising capabilities: First, the position of occurrences of indexing features can be used for indexing. Positional information is essential, for instance, when considering phrases, negation, and the proximity of features. Second, from training collections we can derive automatically optimal weights for arbitrary features. Third, a query dependent structure can be determined for every document by segmenting the documents into passages that are either relevant or irrelevant to the query. The theoretical analysis of our retrieval model is complemented by the results of preliminary experiments.


Synthetic Workload Performance Analysis of Incremental Updates BIBAPDF 329-338
  Kurt Shoens; Anthony Tomasic; Hector Garcia-Molina
Declining disk and CPU costs have kindled a renewed interest in efficient document indexing techniques. In this paper, the problem of incremental updates of inverted lists is addressed using a dual-structure index data structure that dynamically separates long and short inverted lists and optimizes the retrieval, update, and storage of each type of list. The behavior of this index is studied with the use of a synthetically-generated document collection and a simulation model of the algorithm. The index structure is shown to support rapid insertion of documents, fast queries, and to scale well to large document collections and many disks.
Document Filtering for Fast Ranking BIBAPDF 339-348
  Michael Persin
Ranking techniques are effective for finding answers in document collections but the cost of evaluation of ranked queries can be unacceptably high. We propose an evaluation technique that reduces both main memory usage and query evaluation time based on early recognition of which documents are likely to be highly ranked. Our experiments show that, for our test data, the proposed technique evaluates queries in 20% of the time and 2% of the memory taken by the standard inverted file implementation, without degradation in retrieval electiveness.
Adapting a Full-Text Information Retrieval System to the Computer Troubleshooting Domain BIBAPDF 349-358
  Peter G. Anick
There has been much research in full-text information retrieval on automated and semi-automated methods of query expansion to improve the effectiveness of user queries. In this paper we consider the challenges of tuning an IR system to the domain of computer troubleshooting, where user queries tend to be very short and natural language query terms are intermixed with terminology from a variety of technical sublanguages. A number of heuristic techniques for domain knowledge acquisition are described in which the complementary contributions of query log data and corpus analysis are exploited. We discuss the implications of sublanguage domain tuning for run-time query expansion tools and document indexing, arguing that the conventional devices for more purely "natural language" domains may be inadequate.

Panel Sessions

Integration of Information Retrieval and Database Systems BIBAPDF 360
  Norbert Fuhr; Ray R. Larson; Peter Schauble; Joachim W. Schmidt; Ulrich Thiel
The panelists will report on their current work as well as on their experience they have gained in the following projects.
  • Lassell/SEQOIA 2000
  • MIND
  • Tycoon Among others the following issues will be discussed.
  • 1. Extensible database systems supporting text retrieval by means of user
        defined functions and/or triggers.
  • 2. Data models and retrieval models for semistructured data incorporating
        best-match retrieval and exact-match retrieval as special cases.
  • Evaluating Interactive Retrieval Systems BIBAPDF 361
      Susan Dumais; Nicholas Belkin; Christine L. Borgman; Micheline Hancock-Beaulieu
    Most current information retrieval systems are highly interactive. Users ask queries, get immediate feedback, refine their queries, and so on. Methods for evaluating these dynamic systems have not kept pace with the rapid advances in system design. It is no longer enough to use the standard precision-recall measures to evaluate and to improve interactive retrieval systems. There is often no single final query to evaluate, with useful information being gathered from many different queries along the way. In addition, interfaces play a critical role in building effective retrieval systems. The best retrieval algorithm can be rendered functionally useless if the interface to it is unusable. Conversely, of course, the spiffiest new interface is not worth much without a good retrieval engine behind it. It would be easy if one could study interfaces and retrieval engines separately and take the best of both worlds. Unfortunately, there are important interactions that cannot be evaluated by studying components in isolation -- e.g., how do you incorporate ranking or relevance feedback for a Boolean retrieval engine, or how do you highlight matching terms if complex syntactic and semantic processing of queries is used? The design of effective interactive retrieval environments will require careful attention to the larger human - interface - retrieval-engine system.
       Systematic, generalizable evaluations of these larger interactive systems are possible both in the laboratory and in the field. The panelists will describe interactive retrieval experiments and experiences, focusing on: a) why it is important to study interactions b) how interactive retrieval performance should be measured, and c) how the methods for evaluation and findings generalize to other systems. Belkin will begin with an overview of some of the problems in evaluating interactive retrieval systems and will present a new framework characterizing IR as interaction with text. The remaining talks will describe end-user experiments involving highly interactive retrieval systems. The focus of these talks will be on the approaches, methods and instruments used to evaluate retrieval effectiveness and ease of use as well as the relationship between system functionality and the interface. Dumais will describe the importance of interfaces in retrieval, and will present examples of successful iterative interface design with the SuperBook and X-LSI systems. Hancock-Beaulieu will review a series of experiments on the Okapi system to systematically examine the effectiveness of different retrieval aids. Bergman will describe the multiple evaluation methods employed to study children's information-seeking behavior using the Science Library Catalog, a graphical browsing system supplemented by keyword searching tailored to children's skills, and attempts to generalize these evaluation methods to other IR environments.