HCI Bibliography Home | HCI Conferences | IR Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
IR Tables of Contents: 878889909192939495969798990001020304050607

Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

Fullname:Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
Editors:Nicholas J. Belkin; A. Desai Narasimalu; Peter Willet
Location:Philadelphia, Pennsylvania
Dates:1997-Jul-27 to 1997-Jul-31
Standard No:ISBN 0-89791-836-3; ACM Order Number 606970; ACM DL: Table of Contents hcibib: IR97
  1. Gerard Salton Award for Excellence in Research in Information Retrieval: Acceptance Speech
  2. Keynote Address
  3. Relevance Feedback
  4. Chinese Language Retrieval
  5. Classification Methods
  6. Cross-Language Retrieval
  7. Formal Models
  8. Natural Language Processing
  9. Text Structures
  10. User Issues 1
  11. Asian Languages
  12. User Issues 2
  13. Combination Techniques
  14. Image Retrieval
  15. Query Expansion
  16. Panel Session
  17. Posters
  18. Tutorials: Descriptions
  19. Pre-Conference Workshop
  20. Post-Conference Workshops

Gerard Salton Award for Excellence in Research in Information Retrieval: Acceptance Speech

USERS LOST: Reflections on the Past, Future and Limits of Information Science BIBPDF 1-2
  Tefko Saracevic

Keynote Address

Information Retrieval with a Dictionary BIB 3
  George A. Miller

Relevance Feedback

Fast and Effective Query Refinement BIBAPDFPDF 6-15
  Bienvenido Velez; Ron Weiss; Mark A. Sheldon; David K. Gifford
Query Refinement is an essential information retrieval tool that interactively recommends new terms related to a particular query. This paper introduces concept recall, an experimental measure of an algorithm's ability to suggest terms humans have judged to be semantically related to an information need. This study uses precision improvement experiments to measure the ability of an algorithm to produce single term query modifications that predict a user's information need as partially encoded by the query. An oracle algorithm produces ideal query modifications, providing a meaningful context for interpreting precision improvement results.
   This study also introduces RMAP, a fast and practical query refinement algorithm that refines multiple term queries by dynamically combining precomputed suggestions for single term queries. RMAP achieves accuracy comparable to a much slower algorithm, although both RMAP and the slower algorithm lag behind the best possible term suggestions offered by the oracle. We believe RMAP is fast enough to be integrated into present day Internet search engines: RMAP computes 100 term suggestions for a 160,000 document collection in 15 ms on a low-end PC.
On Relevance Weights with Little Relevance Information BIBAPDF 16-24
  S. E. Robertson; S. Walker
The relationship between the Robertson/Sparck Jones relevance weighting formula and the Croft/Harper version for no relevance information is discussed. A method of avoiding the negative weights sometimes implied by the Croft/Harper version is proposed, which turns out to involve a return to the original Sparck Jones inverse collection frequency weight. The paper then goes on to propose a new way of using small amounts of relevance information in the estimation of relevance weights. Some experiments using TREC data are reported.
Learning Routing Queries in a Query Zone BIBAPDF 25-32
  Amit Singhal; Mandar Mitra; Chris Buckley
Word usage is domain dependent. A common word in one domain can be quite infrequent in another. In this study we exploit this property of word usage to improve document routing. We show that routing queries (profiles) learned only from the documents in a query domain are better than the routing profiles learned when query domains are not used. We approximate a query domain by a query zone. Experiments show that routing profiles learned from a query zone are 8-12% more effective than the profiles generated when no query zoning is used.

Chinese Language Retrieval

Comparing Representations in Chinese Information Retrieval BIBAPDF 34-41
  K. L. Kwok
Three representation methods are empirically investigated for Chinese information retrieval: l-gram (single character), bigram (two contiguous overlapping characters), and short-word indexing based on a simple segmentation of the text. The retrieval collection is the approximately 170 MB TREC-5 Chinese corpus of news articles, and 28 queries that are long and rich in wordings. Evaluation shows that l-gram indexing is good but not sufficiently competitive, while bigram indexing works surprisingly well. Bigram indexing leads to a large index term space, three times that of short-word indexing, but is as good as short-word indexing in precision, and about 5% better in relevants retrieved. The best average non-interpolated precision is about 0.45, 17% better than l-gram indexing and quite high for a mainly statistical approach.
Chinese Text Retrieval Without Using a Dictionary BIBAPDF 42-49
  Aitao Chen; Jianzhang He; Liangjie Xu; Fredric C. Gey; Jason Meggs
It is generally believed that words, rather than characters, should be the smallest indexing unit for Chinese text retrieval systems, and that it is essential to have a comprehensive Chinese dictionary or lexicon for Chinese text retrieval systems to do well. Chinese text has no delimiters to mark word boundaries. As a result, any text retrieval systems that build word-based indexes need to segment text into words. We implemented several statistical and dictionary-based word segmentation methods to study the effect on retrieval effectiveness of different segmentation methods using the TREC-5 Chinese test collection and topics. The results show that, for all three sets of queries, the simple bigram indexing and the purely statistical word segmentation perform better than the popular dictionary-based maximum matching method with a dictionary of 138,955 entries.
PAT-Tree-Based Keyword Extraction for Chinese Information Retrieval BIBAPDF 50-58
  Lee-Feng Chien
Considering the urgent need to promote Chinese Information Retrieval, in this paper we will raise the significance of keyword extraction using a new PAT-tree-based approach, which is efficient in automatic keyword extraction from a set of relevant Chinese documents. This approach has been successfully applied in several IR researches, such as document classification, book indexing and relevance feedback. Many Chinese language processing applications therefore step ahead from character level to word/phrase level.

Classification Methods

Almost-Constant-Time Clustering of Arbitrary Corpus Subsets BIBAPDF 60-66
  Craig Silverstein; Jan O. Pedersen
Methods exist for constant-time clustering of corpus subsets selected via Scatter/Gather browsing [3]. In this paper we expand on those techniques, giving an algorithm for almost-constant-time clustering of arbitrary corpus subsets. This algorithm is never slower than clustering the document set from scratch, and for medium-sized and large sets it is significantly faster. This algorithm is useful for clustering arbitrary subsets of large corpora -- obtained, for instance, by a boolean search -- quickly enough to be useful in an interactive setting.
Feature Selection, Perceptron Learning, and a Usability Case Study for Text Categorization BIBAPDF 67-73
  Hwee Tou Ng; Wei Boon Goh; Kok Leong Low
In this paper, we describe an automated learning approach to text categorization based on perceptron learning and a new feature selection metric, called correlation coefficient. Our approach has been tested on the standard Reuters text categorization collection. Empirical results indicate that our approach outperforms the best published results on this Reuters collection. In particular, our new feature selection method yields considerable improvement.
   We also investigate the usability of our automated learning approach by actually developing a system that categorizes texts into a tree of categories. We compare the accuracy of our learning approach to a rule-based, expert system approach that uses a text categorization shell built by Carnegie Group. Although our automated learning approach still gives a lower accuracy, by appropriately incorporating a set of manually chosen words to use as features, the combined semi-automated approach yields accuracy close to the rule-based approach.
Projections for Efficient Document Clustering BIBAPDF 74-81
  Hinrich Schutze; Craig Silverstein
Clustering is increasing in importance but linear- and even constant-time clustering algorithms are often too slow for real-time applications. A simple way to speed up clustering is to speed up the distance calculations at the heart of clustering routines. We study two techniques for improving the cost of distance calculations, LSI and truncation, and determine both how much these techniques speed up clustering and how much they affect the quality of the resulting clusters. We find that the speed increase is significant while -- surprisingly -- the quality of clustering is not adversely affected. We conclude that truncation yields clusters as good as those produced by full-profile clustering while offering a significant speed advantage.

Cross-Language Retrieval

Phrasal Translation and Query Expansion Techniques for Cross-Language Information Retrieval BIBAPDF 84-91
  Lisa Ballesteros; W. Bruce Croft
Dictionary methods for cross-language information retrieval give performance below that for mono-lingual retrieval. Failure to translate multi-term phrases has been shown to be one of the factors responsible for the errors associated with dictionary methods. First, we study the importance of phrasal translation for this approach. Second, we explore the role of phrases in query expansion via local context analysis and local feedback and show how they can be used to significantly reduce the error associated with automatic dictionary translation.
QUILT: Implementing a Large-Scale Cross-Language Text Retrieval System BIBAPDF 92-98
  Mark W. Davis; William C. Ogden
QUILT (Query User Interface with Light Translations) is a prototype implementation of a complete cross-language text retrieval system that takes English queries and produces English gloss translations of Spanish documents. The system indexes the Spanish documents in Spanish, but converts the English query into a Spanish equivalent set through a novel combination of lexical methods and parallel-corpus disambiguation. Similar methods are applied to the returned document to produce a simple translation that can be examined by non-Spanish speakers to gauge the relevance of the document to the original English query. The system integrates traditional, glossary-based machine translation technology with information retrieval approaches and demonstrates that relatively simple term substitution and disambiguation approaches can be viable for cross-language text retrieval.
Cross-Language Speech Retrieval: Establishing a Baseline Performance BIBAPDF 99-108
  Paraic Sheridan; Martin Wechsler; Peter Schauble
We present here the realisation of a cross-language speech retrieval system which retrieves German speech documents in response to user queries specified as French text. This has been achieved through the integration of two existing modules of the SPIDER information retrieval system, namely the query pseudo-translation module and the speech retrieval module. Our approach to cross-language retrieval uses an automatically constructed corpus-based information structure called a similarity thesaurus. A similarity thesaurus can be constructed over any loosely comparable corpus -- a parallel corpus is not necessary. The similarity thesaurus used here was constructed over a 330 MByte corpus of comparable German and French news stories. Our speech retrieval module is based on a speaker-independent phoneme recognizer and it indexes speech documents by N-grams of phonemic features. The speech retrieval module includes an additional probabilistic matching technique designed to aid retrieval from erroneous data such as the phonemic output of the speech recognition process. We have evaluated our cross-language speech retrieval system over a collection of 30 hours (3.4 GBytes) of German speech, comparing the effectiveness of French queries (cross-language) against performance on equivalent German queries (mono-lingual). It must be stressed that this work represents our first step in the direction of cross-language speech retrieval. Our aim here is to establish a baseline of performance on this task, against which we can then measure the success of our continuing research in this area.

Formal Models

Dempster-Shafer's Theory of Evidence Applied to Structured Documents: Modelling Uncertainty BIBAPDF 110-118
  Mounia Lalmas
Documents often display a structure determined by the author, e.g., several chapters, each with several sub-chapters and so on. Taking into account the structure of a document allows the retrieval process to focus on those parts of the documents that are most relevant to an information need. Chiaramella et al advanced a model for indexing and retrieving structured documents. Their aim was to express the model within a framework based on formal logics with associated theories. They developed the logical formalism of the model. This paper adds to this model a theory of uncertainty, the Dempster-Shafer theory of evidence. It is shown that the theory provides a rule, the Dempster's combination rule, that allows the expression of the uncertainty with respect to parts of a document, and that is compatible with the logical model developed by Chiaramella et al.
Computationally Tractable Probabilistic Modelling of Boolean Operators BIBAPDF 119-128
  Warren R. Greiff; W. Bruce Croft; Howard Turtle
The inference network model of information retrieval allows for a probabilistic interpretation of Boolean query operators. Prior work has shown, however, that these operators do not perform as well as the pnorm operators developed in the context of the vector space model. The design of alternative operators in the inference network framework must contend with the issue of computational tractability. We define a flexible class of link matrices that are natural candidates for the implementation of Boolean operators and an O(n2) algorithm for the computation of probabilities involving link matrices of this class. We present experimental results indicating that Boolean operators implemented in terms of link matrices from this class perform as well as pnorm operators.
A Method for Monolingual Thesauri Merging BIBAPDF 129-138
  Marios Sintichakis; Panos Constantopoulos
Thesauri merging is the activity of consolidating a set of thesauri into a thesaurus which accommodates the vocabularies and the structure of all thesauri being merged. In this paper, we introduce a general framework for monolingual thesauri merging. We also present a domain independent set-theoretic model for the representation of terms, relationships, and integrity constraints. Finally, we present a method for the merging of monolingual thesauri focusing on its mechanisms for the detection of equivalent terms among the thesauri being merged. Our method expands previous work on the problem; we introduce equivalence assumptions that express similarity between terms and we propose a term distance model which can be used to guide the confirmation or rejection of equivalence assumptions.

Natural Language Processing

Textual Context Analysis for Information Retrieval BIBAPDF 140-147
  Mark A. Stairmand
We describe four applications of QUESCOT, a program which analyses and quantifies textual contexts in documents with reference to the WordNet database, and hence ascertains the dominance of topics in a document. Our analysis is based on previous work in lexical cohesion, a feature of texts which contributes to their functioning as a coherent unit. The applications are diverse, but all pertain to information retrieval. Whilst our results suggest that QUESCOT is not well suited to word sense disambiguation and text segmentation, our experimental IR system using QUESCOT as an indexing component produces promising results. We also used QUESCOT representations to automatically generate a resource to supplement WordNet, based on collocational relations between concepts in a document collection. We conclude that QUESCOT is suited to applications based on document-level descriptions, where the degree of granularity allows inaccuracies to be smoothed out.
Effective Use of Natural Language Processing Techniques for Automatic Conflation of Multi-Word Terms: The Role of Derivational Morphology, Part of Speech Tagging, and Shallow Parsing BIBAPDF 148-155
  Evelyne Tzoukermann; Judith L. Klavans; Christian Jacquemin
We present a corpus-based system to expand multi-word index terms using a part-of-speech tagger and a full-fledged derivational morphological system, combined with a shallow parser. The system has been applied to French. The unique contribution of the research is in using these linguistically based tools with safety filters in order to avoid the problems of degradation typically associated with derivational analysis and generation. The successful expansion and thus conflation of terms, increases indexing coverage up to 30%, with precision of nearly 90% for correct identification of related terms. The fully implemented system is described with particular attention on the role of derivational morphology and phrasal relations. Results and evaluation are presented in terms of precision and recall, with an analysis and discussion of errors. This paper illustrates how natural language processing tools, when combined effectively for tasks to which they are especially suited, indicates the potential for the application of natural language processing (NLP) techniques in information retrieval (IR).
Guessing Morphology from Terms and Corpora BIBAPDF 156-165
  Christian Jacquemin
This study proposes an algorithm for automatically acquiring morphological links between words. This algorithm relies on the concurrent use of a corpus and a list of multi-word terms, and does not require any prior linguistic knowledge. The four steps of the algorithm are (1) single-word truncation, (2) conflation of multi-word terms, (3) classification and filtering, and (4) clustering of conflation classes. At each step a precise evaluation is performed in order to chose the optimal parameters. The final results indicate a clustering of 45% of the classes with a precision of 87%. The derivational knowledge acquired through this method can be used for conceiving a domain-oriented stemmer for scientific and technical corpora.

Text Structures

Optimal Demand-Oriented Topology for Hypertext Systems BIBAPDF 168-177
  Scott Aaronson
This paper proposes an algorithm to aid in the design of hypertext systems. A numerical index is presented for rating the organizational efficiency of hypertexts based on (1) user demand for pages, (2) the relevance of pages to one another, and (3) the probability that users can navigate along hypertext paths without getting lost. Maximizing this index under constraints on the number of links is proven NP-complete, and a genetic algorithm is used to search for the optimal link topology. An experiment with computer users provides evidence that a numerical measure of hypertext efficiency might have practical value.
Passage Retrieval Revisited BIBAPDF 178-185
  Marcin Kaszkiel; Justin Zobel
Ranking based on passages addresses some of the shortcomings of whole-document ranking. It provides convenient units of text to return to the user, avoids the difficulties of comparing documents of different length, and enables identification of short blocks of relevant material amongst otherwise irrelevant text. In this paper we explore the potential of passage retrieval, based on an experimental evaluation of the ability of passages to identify relevant documents. We compare our scheme of arbitrary passage retrieval to several other document retrieval and passage retrieval methods; we show experimentally that, compared to these methods, ranking via fixed-length passages is robust and effective. Our experiments also show that, compared to whole-document ranking, ranking via fixed-length arbitrary passages significantly improves retrieval effectiveness, by 8% for TREC disks 2 and 4 and by 18%-37% for the Federal Register collection.
Exploration of Text Collections with Hierarchical Feature Maps BIBAPDF 186-195
  Dieter Merkl
Document classification is one of the central issues in information retrieval research. The aim is to uncover similarities between text documents. In other words, classification techniques are used to gain insight in the structure of the various data items contained in the text archive. In this paper we show the results from using a hierarchy of self-organizing maps to perform the text classification task. Each of the individual self-organizing maps is trained independently and gets specialized to a subset of the input data. As a consequence, the choice of this particular artificial neural network model enables the true establishment of a document taxonomy. The benefit of this approach is a straightforward representation of document similarities combined with dramatically reduced training time. In particular, the hierarchical representation of document collections is appealing because it is the underlying organizational principle in use by librarians providing the necessary familiarity for the user. The massive reduction in the time needed to train the artificial neural network together with its highly accurate clustering results makes it a challenging alternative to conventional approaches.

User Issues 1

Users' Perception of the Performance of a Filtering System BIBAPDF 198-205
  Raya Fidel; Michael Crandall
Although filtering electronic information is spreading rapidly, very few studies examined users' perceptions about the success of filtering. Users at the Boeing Company participated in a study which collected data through observation, verbal protocols, questionnaire, and interviews. Data analysis used four levels of relevance to assess the importance, and frequency of use, of thirteen criteria for relevance, and fourteen for non relevance, that are not topics or subject matters. Results showed that perceived precision ratios for filtered information were higher then the ratios for non-filtered information, but not significantly and could still be improved even though most respondents were satisfied with these ratios. Developing methods to create and maintain useful profiles, and finding ways to incorporate relevance as well as non-relevance criteria into profiles, are necessary to improve the performance of filtering mechanisms.
Time, Relevance and Interaction Modelling for Information Retrieval BIBAPDF 206-213
  M. D. Dunlop
The most common method for assessing the worth of an information retrieval (IR) system is through precision and recall graphs. These graphs show how precise an IR engine is when working at fixed levels of recall. This paper introduces number-to-view graphs, a new graphing method based on an early evaluation measure, which supplement precision-recall graphs by plotting the number of relevant documents a user wishes against the number of documents they would have to view to encounter them. The paper also proposes a step forward from number-to-view graphs that directly includes presentation, interface and temporal issues within the same framework as engine effectiveness: time-to-view graphs. Taken together, these graphs and models introduce a new evaluation approach called Expected Search Duration.

Asian Languages

How to Read Less and Know More -- Approximate OCR for Thai BIBAPDF 216-225
  Doug Cooper
A large alphabet of similar letters and marks, wide and inconsistent variation in fonts and handwriting, and the absence of spaces between words all frustrate standard methods and applications for Thai-language OCR. We consider an alternative approach aimed at building information recognition and retrieval systems, rather than using OCR as a substitute for character-by-character data entry. Instead of trying to identify individual symbols, we define an approximation alphabet of similar shapes and clusters, targeted to the predicted lower-bound accuracy of existing OCR. We test the effectiveness of approximation alphabets of 3, 7, 9, and 27 symbols for two tasks: discriminating between ambiguous input or queries (as from handwritten or pen-based input), and indexing scanned documents (as the basis of document-based IR systems).
Overlapping Statistical Word Indexing: A New Indexing Method for Japanese Text BIBAPDF 226-234
  Yasushi Ogawa; Toru Matsuda
Because word boundaries are not apparently indicated in Asian languages including Japanese, word indexing cannot simply be applied. Although dictionary-based text segmentation techniques enable word indexing, they have some problems such as dictionary maintenance. N-gram indexing, another conventional indexing method, suffers from increase in index size. This paper proposes a new statistical indexing method. We first propose a segmentation method for Japanese text which uses statistical information of characters. It needs only a small amount of statistic information and computation, and does not need constant maintenance. We secondly propose a new indexing strategy which extracts some overlapping segments in addition to the segments extracted using the existing strategy. Thus it increases the effectiveness of retrieval.

User Issues 2

Effectiveness of a Graphical Display of Retrieval Results BIBAPDF 236-245
  Aravindan Veerasamy; Russell Heikes
We present the design of a visualization tool that graphically displays the strength of query concepts in the retrieved documents. Graphically displaying document surrogate information enables set-at-a-time perusal of documents, rather than document-at-a-time perusal of textual displays. By providing additional relevance information about the retrieved documents, the tool aids the user in accurately identifying relevant documents. Results of an experiment evaluating the tool shows that when users have the tool they are able to identify relevant documents in a shorter period of time than without the tool, and with increased accuracy. We have evidence to believe that appropriately designed graphical displays can enable users to better interact with the system.
Cat-a-Cone: An Interactive Interface for Specifying Searches and Viewing Retrieval Results Using a Large Category Hierarchy BIBAPDF 246-255
  Marti A. Hearst; Chandu Karadi
This paper introduces a novel user interface that integrates search and browsing of very large category hierarchies with their associated text collections. A key component is the separate but simultaneous display of the representations of the categories and the retrieved documents. Another key component is the display of multiple selected categories simultaneously, complete with their hierarchical context. The prototype implementation uses animation and a three-dimensional graphical workspace to accommodate the category hierarchy and to store intermediate search results. Query specification in this 3D environment is accomplished via a novel method for painting Boolean queries over a combination of category labels and free text. Examples are shown on a collection of medical text.

Combination Techniques

A Probabilistic Model for Distributed Information Retrieval BIBAPDF 258-266
  Christoph Baumgarten
This paper describes a model for optimum information retrieval over a distributed document collection. The model stems from the Probability Ranking Principle: Having computed individual document rankings correlated to different subcollections, these local rankings are stepwise merged into a final ranking list where the documents are ordered according to their probability of relevance. Here, a full dissemination of subcollection-wide information is not required. The documents of different subcollections are assumed to be indexed using different indexing vocabularies. Moreover, local rankings may be computed by individual probabilistic retrieval methods. The underlying data volume is arbitrarily scalable. A criterion for effectively limiting the ranking process to a subset of subcollections extends the model.
Analyses of Multiple Evidence Combination BIBAPDF 267-276
  Joon Ho Lee
It has been known that different representations of a query retrieve different sets of documents. Recent work suggests that significant improvements in retrieval performance can be achieved by combining multiple representations of an information need. However, little effort has been made to understand the reason why combining multiple sources of evidence improves retrieval effectiveness. In this paper we analyze why improvements can be achieved with evidence combination, and investigate how evidence should be combined. We describe a rationale for multiple evidence combination, and propose a combining method whose properties coincide with the rationale. We also investigate the effect of using rank instead of similarity on retrieval effectiveness.

Image Retrieval

Image Retrieval by Appearance BIBAPDF 278-285
  S. Ravela; R. Manmatha
A system to retrieve images using a syntactic description of appearance is presented. A multi-scale invariant vector representation is obtained by first filtering images in the database with Gaussian derivative filters at several scales and then computing low order differential invariants. The multi-scale representation is indexed for rapid retrieval. Queries are designed by the users from an example image by selecting appropriate regions. The invariant vectors corresponding to these regions are matched with those in the database both in feature space as well as in coordinate space and a match score is obtained for each image. The results are then displayed to the user sorted by the match score. From experiments conducted with over 1500 images it is shown that images similar in appearance and whose viewpoint is within 25 degrees of the query image can be retrieved with an average precision of 57%.
Using Semantic Contents and WordNet in Image Retrieval BIBAPDF 286-295
  Y. Alp Aslandogan; Chuck Thier; Clement T. Yu; Jon Zou; Naphtali Rishe
Image retrieval based on semantic contents involves extraction, modelling and indexing of content information. While extraction of abstract contents is a hard problem, it is only part of the bigger picture. In this paper we use knowledge about the semantic contents of images to improve retrieval effectiveness. In particular we use WordNet, an electronic lexical system for query and database expansion. Our content model facilitates novel uses of WordNet. We also propose a new normalization formula, an object significance scheme and evaluate their effectiveness with real user experiments. We describe the experiment setup and provide quantitative evaluation of each technique.
Image Retrieval by Hypertext Links BIBAPDF 296-303
  V. Harmandas; M. Sanderson; M. D. Dunlop
This paper presents a model for retrieval of images from a large World Wide Web based collection. Rather than considering complex visual recognition algorithms, the model presented is based on combining evidence of the text content and hypertext structure of the Web. The paper shows that certain types of query are amply served by this form of representation. It also presents a novel means of gathering relevance judgements.

Query Expansion

Automatic Feedback Using Past Queries: Social Searching? BIBAPDF 306-313
  Larry Fitzpatrick; Mei Dent
The effect of using past queries to improve automatic query expansion was examined in the TREC environment. Automatic feedback of documents identified from similar past queries was compared with standard top-document feedback and with no feedback. A new query similarity metric was used based on comparing result lists and using probability of relevance. Our top-document feedback method showed small improvements over no feedback method consistent with past studies. On recall-precision and average precision measures, past query feedback yielded performance superior to that of top-document feedback. The past query feedback method also lends itself to tunable thresholds such that better performance can be obtained by automatically deciding when, and when not, to apply the expansion. Automatic past-query feedback actually improved top document precision in this experiment.
Exploiting Clustering and Phrases for Context-Based Information Retrieval BIBAPDF 314-323
  Peter Anick; Shivakumar Vaithyanathan
This paper explores exploiting the synergy between document clustering and phrasal analysis for the purpose of automatically constructing a context-based retrieval system. A context consists of two components -- a cluster of logically related articles (its extension) and a small set of salient concepts, represented by words and phrases and organized by the cluster's key terms (its intension). At run-time, the system presents contexts that best match the result list of a user's natural language query. The user can then choose a context and manipulate the intensional component to both browse the context's extension and launch new searches over the entire database. We argue that the focused relevance feedback provided by contexts, at a level of abstraction higher than individual documents and lower than the database as a whole, provides a natural way for users to refine vague information needs and helps to blur the distinction between searching and browsing. The Paraphrase interface, running over a database of business-related news articles, is used to illustrate the advantages of such a context-based retrieval paradigm.
The Potential and Actual Effectiveness of Interactive Query Expansion BIBAPDF 324-332
  Mark Magennis; Cornelis J. van Rijsbergen
In query expansion, terms from a source such as relevance feedback are added to the query. This often improves retrieval effectiveness but results are variable across queries. In interactive query expansion (IQE) the automatically-derived terms are instead offered as suggestions to the searcher, who decides which to add. There is little evidence of whether IQE is likely to be effective over multiple iterations in a large scale retrieval context, or whether inexperienced users can achieve this effectiveness in practice. These experiments address these two questions. A small but significant improvement in potential retrieval effectiveness is found. This is consistent across a range of topics. Inexperienced users' term selections consistently fail to improve on automatic query expansion, however. It is concluded that interactive query expansion has good potential, particularly for term sources that are poorer than relevance feedback. But it may be difficult for searchers to realise this potential without experience or training in term selection and free-text search strategies.

Panel Session

Real Life Information Retrieval: Commercial Search Engines BIBAPDF 333
  Michael Lesk; Doug Cutting; Jan Pedersen; Terry Noreault; Matt Koll
The world of commercial search engines is surprisingly little covered in the literature, e.g. the devices needed to cope with searching for extremely frequent terms, efficient coding for partial word searching, and fielded searching. This panel is made up of people who have written large search systems and will give them (and the audience) a chance to discuss the technical and practical details of large search systems.
   Content representation in large systems is now a major topic as we try to move beyond simple word matching for searching. As the different Web searching systems compete, they try to offer better search quality through a variety of techniques. This includes concept searching using both manual aids and automatic detection. The panelists will be able to discuss how this is done and what kinds of progress are being made. This panel will provide insight into the practical situation in the industry today, and what kinds of technology are used in searching and why.


ACHIRA: Automatic Construction of Hypertexts for Information Retrieval Applications BIBA 335
  M. Agosti; L. Benfante; M. Melucci
The issue of automatic construction has been addressed by researchers since the early days of hypertexts. The increasing availability of on-line textual document collections, whose size is too large to enable a manual authoring and construction of the hypertext, is the main reason for which fully automatic, or partially automatic techniques are currently being studied and implemented.
   We have addressed the problem of the automatic construction of hypertexts for IR and now we are addressing the design and implementation of a digital library of which the ACHIRA project can be considered as an important building block.
   Main objectives of ACHIRA are:
  • - Capability of managing different types of objects; an OODBMS is adopted for
       storing different types of objects that need to be managed in a digital
       library. Most relevant types are multimedia documents and metadata.
  • - Integration of different retrieval approaches, such as querying and browsing:
       the result of such an integration enables the user to actively interact with
       the system.
  • - Capabilities of interfacing heterogeneous data sources that have to be all
       accessible through a standard Web client.
  • - Implementation of relevance feedback techniques using one of the constrained
       spreading activation methods. The ACHIRA architecture is centred around an interface for distributed objects connecting different types of software tools both on the server and client side. The server side is a set of service and data providers: an IR object class library supplies objects to design and implement IR applications, and the OODBMS manages a set of distributed object bases storing multimedia data and metadata. The client side includes Web clients used to interact with IR applications: a querying tool, an IR object builder, and a hypertext author.
       At present, the ACHIRA prototype implements integration of querying and browsing, management of textual data and metadata, and active users/system interaction by the use of hypertexts that are constructed "on the fly" answering the user query.
  • Semantic Search and Semantic Categorization BIBA 335
      Hsinchun Chen; Andrea L. Houston; Robin R. Sewell; Bruce R. Schatz
    The Internet provides an exceptional testbed for developing algorithms to improve browsing and searching large information spaces. Our research focused on two algorithms, a Kohonen category map for browsing, and an automatically generated concept space for searching.
       Our results indicated that a Kohonen SOM-based algorithm can successfully categorize large and eclectic information spaces (Yahoo!'s Entertainment sub-category) into manageable sub-spaces which can be successfully navigated. The SOM algorithm worked best with broad browsing tasks and tasks where subjects skipped around between categories. Subjects especially liked the SOM's visual and graphical features. Subjects who tried directed searching and who tried using familiar browsing mental models (alphabetic or hierarchical) were less successful.
       Concept space results were especially encouraging. There were no significant differences among document precision for subject-suggested, thesaurus-suggested, and combined subject- and thesaurus-suggested terms. Document recall measures indicated that combined subject- and thesaurus-suggested terms were significantly better than subject-suggested terms. Retrieved homepage analysis indicated limited overlap between subject-suggested and thesaurus-suggested terms, suggesting that an automatically generated concept space can enhance a keyword-based search. Subjects especially liked the level of searching control, and the fact that thesaurus-suggested terms were "real" (i.e., originating in the homepages) and therefore successful retrieval was guaranteed.
    Visual SOM BIBA 336
      Hsinchun Chen; Marshall Ramsey; Terry R. Smith
    With the increasing complexity of user queries and the online information available, the need for information systems to effectively and efficiently bring conceptually relevant information to users becomes more pressing. Based on the concept space approach developed by the Illinois Digital Library Initiative (DLI) project and the Alexandria georeferenced collections, this research proposes to develop knowledge representations and structures to capture concepts of relevance to spatial and multimedia information (natural language phrases and geo-related textures). Selected machine learning techniques and general Artificial Intelligence (AI) graph traversal algorithms will also be adopted to assist in semantic, concept-based spreading activation in integrated knowledge networks.
       The current version of the Visual Thesaurus subdivides aerial or satellite photographs into 128 X 128 image tiles. Using the photographic coordinates and a digital gazetteer (GNIS), feature names are applied to the tiles, and a Jacquard's score is assigned. Feature analysis is performed using a bank of Gabor filters, and a normalized feature file is created. This information is used to generate a self-organizing map able to cluster visually similar tiles together allowing access by visual feature and gazetteer feature name. It is expected that the scope of the applicability of the proposed research will extend to generic textual and multimedia digital libraries, and that the research will inspire the development of techniques to facilitate efficient and effective semantic access.
    Hypertext vs. Boolean-Based Searching in a Bibliographic Database Environment: A Direct Comparison of Searcher Performance BIBA 336
      Alexandra Dimitroff; Dietmar Wolfram
    The purpose of the present study was to carry out a direct comparison of a hypertext-based retrieval system with a traditional Boolean-based retrieval system using the same bibliographic database.
       A total of 60 novice and experienced searchers were assigned to either a prototype hypertext system called HyperLynx, or to a traditional Boolean-based system. Searchers were asked to perform five retrieval tasks on a subset of the NTIS database consisting of approximately 3,000 records. Retrieval tasks represented both specific and general subject searches with a small number and large number of potentially relevant records. Usage and performance measures collected included: time taken, record pages visited, recall, precision and success. Data were analyzed to determine if any significant differences in usage and performance existed between the searcher experience levels and the systems used.
       Findings of the study have implications for the design of future retrieval systems that take advantage of the best features of both approaches for more effective and efficient retrieval of highly structured data, such as those found in bibliographic databases.
    Searching Behavior in the GIRAFFE Ranked Retrieval System BIBA 336
      Efthimis N. Efthimiadis
    The searching behavior and retrieval effectiveness of users of GIRAFFE, a partial match ranked retrieval system are reported in this study. GIRAFFE is an X-Windows based interface to OKAPI that has been developed for conducting information retrieval experiments.
       The issues studied included how end-users searched in a partial match environment, what steps did they follow, what difficulties (conceptual or technical) did they encounter during the search, and whether searchers with a library and information science background had different search behavior and performance to users without that background.
       The Wall Street Journal and San Jose Mercury News databases of the TREC test collection and twenty six queries from the TREC query set were used. Fifty searchers were divided into two groups, the LIS-group and the NON-LIS-group, with 25 searchers in each. The LIS-group comprised of graduate students, faculty and professional librarians from UCLA with a library and information science background. The NON-LIS-group was a mix of undergraduate and graduate students, faculty and researchers from different UCLA departments without a library and information science background. Data were collected from 100 searches via questionnaires, structured interviews, participant observation and transaction logs. The results are analyzed both quantitatively and qualitatively.
    Query Improvement for Information Retrieval Using Niching Genetic Algorithm BIBA 337
      Giovanni Fanduiz; Manjula Krishnan; Yaneth Prada; B. Buckles; F. Petry; D. Kraft
    Text processing and information filtering has become one of the topics stimulating great interest over the past several years. The development of computer networks and information services promises to intensify the need for effective and efficient information retrieval mechanisms. One major aspect of text processing is information filtering, the determination of which of a set of documents or records should be retrieved in response to a user query for information.
       The study is focused specifically on queries that will be used repeatedly for weeks or months by clients who daily or weekly seek similar information and who will be willing to spend considerable time developing a good query, interacting with the system in order to provide a relevance feedback. In our previous work we began the investigation of applying genetic algorithms to a fuzzy information retrieval system in order to improve the formulation of weighted Boolean queries by means of relevance feedback. A weighted Boolean query was viewed as a parse tree and is a chromosome in terms of a genetic algorithm. Through the mechanisms of genetic programming, the weighted query was modified in order to improve precision and recall. Relevance feedback was incorporated, in part via user defined measures over a trial set of records. The fitness of a candidate query can be expressed directly as a function of the perceived relevance of the retrieved set.
       The research is an extension of our previous work to a vector space model for the query and the development of a new genetic algorithm approach using niching. This allows a more suitable solution to be evolved for the query by permitting niches to be formed corresponding to the disjunctive (OR) parts of a query. We will discuss the development of an experimental test bed of documents, creation of a retrieval system using a simple GA and development of a GA with fitness sharing as an alternative solution, in which the population is divided into niches.
    An Exploratory Study of IR Interaction for User Interface Design BIBA 337
      Preben Hansen
    Information seeking is a dynamic and interactive process. Factors like users' information needs, individual differences, goals and tasks, knowledge and cognitive abilities etc. influence the information seeking process, and need to be identified and supported in the user interface design. We adopt a user-centered approach to establish a link between research within the IR interaction perspective and the methods in HCI on how to evaluate information seeking interaction in a hypertext IR system (Dienst). Our purpose with this exploratory study is to identify, describe and acquire knowledge of characteristics of the user population, and finally, to make suggestions for supporting users in user interface design.
       For the evaluation task, we have applied HCI evaluation techniques to our IR evaluation to make a connection between the traditional IR evaluation and HCI evaluation, combining different qualitative and quantitative data collection and analyzing methods, implemented in an experimental real-world online WWW setting. This methodology combined online (WWW-based) questionnaires and database log statistics.
       Preliminary results revealed several "hidden" realities: a mismatch between what people said they wanted to do as opposed to what they actually did. We also observed that people initially expected a specific function, but when using the system, they did not use it. Finally, we established some group differences concerning variables like previous experience searching information in hypertext systems, IR knowledge and browsing/searching strategies.
    Lessons Learned in an Informal Usability Study BIBA 337
      Dawn M. Hoffman; Laura L. Downey
    This poster examines the challenges involved in conducting an informal usability study based on the introduction of a new information retrieval system to experienced users. The specific goals of the project were to examine the usability of the new ZPRISE interface and to identify problems our users were having with the assigned task (topic development for TREC-5). A summary of activities that were performed during the usability testing and a description of the analysis methodology are presented, along with lessons learned about both the users and the testing techniques. The methodology for analyzing the results of the usability study incorporates several grouping and prioritizing methods which provide one of the major contributions of the work. For example, problem trends among users were identified by grouping observations by interface windows. These problems were then categorized into high, medium, and low priorities and turned into a set of action items based on priority category and cost/benefit analysis. Some TREC-specific lessons were also learned and have led to recommendations for changes in the TREC topic development and assessment tasks. One of these lessons was the need for two specialized task-specific interfaces (i.e., topic development and relevance assessment) and a revised training program.
    Integrating a Thesaurus for Rule Induction in Text Classification BIBA 338
      Markus Junker; Andreas Abecker
    Rule learning algorithms show encouraging results on text classification tasks. They produce understandable results, they allow easy integration of background knowledge, and they can be extended to work with complex document representations.
       A serious problem in text classification is the skewed distribution of feature values. Many highly descriptive words and word patterns only occur very rarely. Unfortunately, a standard learning approach cannot decide whether rare features are relevant or caused by chance. One method to tackle this problem is to cluster rare features based on the presented training examples.
       In contrast, clustering can be done by exploiting background knowledge. We investigated this type of clustering in the framework of a separate-and-conquer rule learning algorithm. Documents in this algorithm are represented by words and pairs of adjacent words. In addition to conjunction as the basic refinement operator for rule hypotheses, we introduced thesaurus-based operators. They allow to generalize from words to more abstract concepts. The operators are based on hyponymy and meronymy hierarchies as represented by the electronic thesaurus Wordnet.
    Non-Linear Information Retrieval in Multidimensional Computer-Assisted Learning Environments BIBA 338
      Slava Kalyuga
    Training for complex activities assumes acquisition (including information retrieval, schema composing, etc.) of multidimensional knowledge structures. For example, several dimensions can be identified for representing complex technical knowledge. The first dimension is oriented on the context of the whole human activity and includes three main components: subject and object of activity, and mediating tools. The second dimension includes functional, processual and structural aspects of a component's description. Finally, each of above mentioned aspects of a component could be described either in general terms, or in more details. The interconnected components, aspects and levels of description represent a general model of multidimensional knowledge structure. It is suggested, that the process of knowledge acquisition can be facilitated if learners are allowed independent traversing of any dimension and immediate retrieval of information about any component, aspect or level of description (according to their needs, levels of understanding and preliminary knowledge). Considering the complex, undistinctive and highly individualized character of internal human structures of knowledge, the necessity of its adaptation to a specific situation, such a non-linear way of learning could be more efficient than traditional prescribed linear sequence of learning. The prototype demonstrating described cognitively oriented approach to the retrieval of instructional information has been built using computer based hypermedia learning environment.
    An Approach for Text Information Retrieval, Browsing, and Extraction Using Discourse Level Structure BIBA 338
      Noriko Kando
    This paper describes an approach for textual information retrieval and passage extraction using discourse-level structure. The set of typical functional components of the text type of research papers was delineated and the automatic detection of these components was tested with several corpora from various fields. Tags representing these components were embedded into the text, and then converted into a structure-tagged fulltext database. These tags can be used in searching articles and passage extraction on the structure sensitive search engine.
       The results of our previous experiments have shown that searches using discourse-level structure are more effective than ones which do not by distinguishing the role or function each concept plays in the text or the relationship among them and by detecting the central theme in a text. Tags are used as a kind of "role indicators" which can be assigned automatically. The passages extracted across different texts were helpful for users to compare or to summarize their content.
       However, the construction of a search statement using these tags is rather complicated for users. Therefore the paper suggests three models of search statement construction and discuss each advantage and disadvantage. The paper also discuss that the importance of the content-based approach like this in the network environment that various types of electronic texts are available.
    Representing Search Results in Three Dimensions with Local Latent Semantic Indexing BIBA 338-339
      Michael H. Miller
    Aside from the fact that users find 3-D interfaces visually appealing, there are strong practical reasons for developing methods for visualizing search results. Traditional information retrieval systems present results in ordered lists which are difficult to browse, and do not contain any information about the relationships between documents in the list. The method described here employs Local Latent Semantic Indexing (LLSI) to create meaningful local dimensions in which to visualize hundreds of document objects. In this model, the top one hundred ranked documents from the result of a search are used to create a three dimensional LLSI document index, which can be represented in three dimensional space. Similar documents tend to form clusters, based on their location in the graph. A hierarchical clustering algorithm is used to partition the documents into four or five clusters based on the location of a document in the LLSI space. An implemented system is described which utilizes Virtual Reality Modeling Language (VRML) to display documents and their titles. Preliminary tests with a small collection of MEDLINE articles indicate that, on average, 73% of the relevant documents tend to fall into a single cluster, and over 40% of the articles within this cluster are relevant to the query. The most relevant cluster is often easily identifiable by quickly scanning a few of the document titles in each cluster.
    Exploring the Similarity Space BIBA 339
      Alistair Moffat; Justin Zobel
    Many different similarity measures have been proposed and tested during the several decades of active research into information retrieval techniques. As a result of experimentation a great deal is known about what combinations lead to "good" heuristics, and a small number of variants have been shown to consistently work well on current data sets such as TREC. However, the number and diversity of possible variants of similarity measures raises important questions: How thoroughly has the space of measures been explored? Is there, possibly, some unknown combination that significantly improves on the current best measure? What weighting schemes improve performance? And how robust are the most successful combinations?
       We addressed these questions by gathering a large number of different similarity measures and representing them in a uniform way, factoring them into eight orthogonal components that can be varied independently. We have partially explored the resulting "similarity space" to test whether particular formulations for some components work well regardless of the combination in which they are used, and whether there are new effective combinations.
       We expected in this research to confirm that standard formulations of similarity measures are effective. Indeed, this is what occurred, but close attention to inverse document frequency, document length, and within-document weighting can yield significant performance improvements. Large improvements are not available with the formulations we tried, but small improvements are. Moreover, the successful formulations are surprisingly non-portable. Different collections, different query types (for example, long or short), and different retrieval levels all affect the list of "best" measures.
    An Investigation of Subword Unit Representations for Spoken Document Retrieval BIBA 339
      Kenney Ng; Victor W. Zue
    This study investigates the feasibility of using subword unit representations for spoken document retrieval as an alternative to using words generated by either keyword spotting or word recognition. Our investigation is motivated by the observation that word-based retrieval approaches face the problem of either having to know the keywords to search for a priori, or requiring a very large recognition vocabulary in order to cover the contents of growing and diverse message collections. In this study, we examine a range of subword units of varying complexity derived from phonetic transcriptions. The basic underlying unit is the phone; more and less complex units are derived by varying the level of detail and the length of sequences of the phonetic units. We measure the ability of the different subword units to effectively index and retrieve a large collection of recorded speech messages. We also compare their performance when the underlying phonetic transcriptions are perfect and when they contain recognition errors. We find that with the appropriate subword units it is possible to achieve performance comparable to that of text-based word units if the underlying phonetic units are recognized correctly. In the presence of recognition errors, performance degrades but many subword units can still achieve reasonable performance.
    Task-Based Training and User Performance on Full Text Retrieval BIBA 339-340
      Mary Ellen Okurowski; Kevin Ward Drummey; Ellen Powell; Shannon Williams Cobb; Andrew McCabe; Jacklyn R. Kennedy; David S. Lucas
    This study evaluated the effect of task-based training on search performance of trained, informally trained, and untrained users. Training at the U.S. Department of Defense on an information retrieval system focused on using actual job tasks and modeling best practices. Exact binomial tests on three groups of users indicate that the formally trained used the system in proportions much greater than their expected population (p<.01). Monitoring of the help desk for a 21-day period revealed that trained users are more likely to seek assistance and more fully exploit advanced system features. Automatic scoring of randomly selected queries from the trained, informally, and untrained users over a two month period were analyzed for query complexity. Non-parametric analysis of variance (ANOVA) methods were employed to test for significant differences between the three training groups. First, Kruskal-Wallis tests showed a significant difference among the three groups (p<.01); formally trained user queries were shown to be more complex. Moreover, Friedman two-factor ANOVA tests showed that training had more effect on query complexity on users with less technical experience (p<.10). On-the-job assessment of precision for the three groups is in progress.
    Dynamic Organization of Search Results Using a Taxonomic Domain Model BIBA 340
      Wanda Pratt; Larry Fagan; Marti Hearst
    When using search tools to find answers to a general question, people can become overwhelmed by the large number of documents retrieved. Query refinement can be used to focus the search, but in many cases there are dozens or hundreds of documents that are truly relevant to the user's information need. In this situation, tools are needed to help users explore and understand the results rather than eliminate documents.
       Our solution is to automatically group the results of a broad search into a set of hierarchically-organized categories. The approach incorporates the main advantage of clustering techniques (deriving the organization from the retrieved documents) with the main advantage of classification techniques (assigning meaningful labels to the categories). The approach uses two kinds of knowledge: query type and taxonomic domain knowledge. For each type of query, we represent taxonomic constraints that must be met for a document to belong to categories and a function for selecting category labels. The final organization of categories is determined from the hierarchy of terms in the domain model and a breadth threshold. We have implemented a prototype of this approach on medical text using the Unified Medical Language System (UMLS) as the taxonomic domain model.
    An Investigation of Mental Models and Information Seeking Behavior in a Novel Task BIBA 340
      Pamela Savage; Nicholas Belkin; Colleen Cool; Hong Xie
    Mental models are representations of objects, events, and processes that people construct through interaction with their environments. As people interact with an information retrieval (IR) system, they infer how that system works and they develop a set of expectations that they use to guide their future interactions with that system. Often these models may be applied to other IR systems, domains, and novel tasks to varying degrees of success.
       We present data that describe the interactive searching behavior of twelve searchers using the INQUERY retrieval engine in the context of the TREC-5 interactive task. Our pre-search interview, in which participants described the methods that they would use when conducting an online search in order to identify as many "aspects" as possible for a topic, was designed to elicit users' mental models. Based upon a content analysis of the responses, we were able to derive a classification scheme comprised of three mental models employed by our participants: 1) start with anything, evaluate results, then plan the search, 2) start with general concepts then go to specific terms, and 3) identify keywords and try them one at a time. We discuss the mental models held by our experienced searchers, how they corresponded to actual searching behavior, the extent to which users' models "fit" the novel aspects task, the relative benefits and limitations of the models impact on retrieval performance, and implications for system design.
    Efficient Multiple Database Search through the Optimal Use of a Multiprocessor BIBA 340
      Toru Takaki; Tsuyoshi Kitani
    A full text search system usually has multiple databases due to the necessity of physical segmentation to maintain massive amounts of data, different data sources and differences in their compilation times, and the existence of frequently searched portions of the database.
       In this poster, we propose an efficient search method for searching multiple databases using a SMP (Symmetric Multiprocessor) server. When all databases are on a single server machine, the search is performed in each process associated with each database simultaneously. Then, results from the multiple databases are merged and returned to the client. With a conventional method, the search time for the multiple databases, the time for a client to receive a search result, is primarily determined by the slowest search process.
       Our proposed method assigns the execution priority of each search process according to the estimated search time, which is largely affected by the size of the database. By assigning appropriate processor resources to each search process, the search time of all processes are averaged, so that the overall search time can be reduced. This method improves on the conventional method for searching databases whose number is greater than the number of processors of the SMP server. The experimental results, using databases containing 480,000 Japanese patent documents, proved that the proposed method can shorten the overall search time by 10% or more.
    When Does It Make Sense to Linearly Combine Relevance Scores? BIBA 341
      Christopher C. Vogt
    One of the simplest ways to combine multiple IR systems is to take a linear combination of their relevance scores, but when is this approach most appropriate?
       Two sets of simulations are used to examine this question. A large number of "experts" (lists of 1000 documents and their relevance scores for a single query) are used. One set of simulations uses randomly generated experts, the other uses the submissions from last year's TREC conference. For every possible pair of experts, the best linear combination is estimated by raster-scanning the ratio of multiplicative weights to find the one resulting in the highest average precision. A number of measures are made of both experts: average precision, J (a measure of rank correlation between the relevance scores and the user's relevance assessments), Guttman's Point Alienation (another measure of rank correlation) between the two systems, and a modified GPA wherein only relevant documents are used.
       The simulations show that whereas the degree of improvement cannot be completely predicted using only the above measures, they are nevertheless useful, and that the best time to linearly combine experts is when both have reasonable performance of similar magnitude, but do not rank documents in a similar fashion.

    Tutorials: Descriptions

    Algorithmic and Cognitive Approaches for Information Retrieval BIBA 343
      Peter Ingwersen; Peter Willett
    This tutorial will start with an introduction to IR systems and will discuss their principal components, such as documents, queries and relevance assessments, inter alia. It will then summarise the main features of algorithmic and cognitive approaches to IR, thus providing attendees with background for the research presentations later in the conference. The algorithmic area focuses principally on the algorithms and data structures that are needed to maximise retrieval effectiveness whilst maintaining a reasonable level of retrieval efficiency. The cognitive section summarises a range of communicative and psycho-sociological studies of IR systems focusing on user-centered approaches to information systems design.
    Multimedia Information Retrieval BIBA 343
      Norbert Fuhr
    The aim of this tutorial is to survey the state of the art in multimedia IR. The focus is on indexing and retrieval methods for multimedia, whereas system-oriented aspects will not be addressed. More specifically, the following major concepts are to be taught in the tutorial:
  • basic properties of text, images, audio, video
  • views on media objects: physical (layout), structural
  • (logical), symbolic, spatial, temporal, perceptive
  • modelling the structure of multimedia documents feature-based and semantic
       indexing methods for text,
  • images, speech, video
  • multimedia retrieval: classical IR models vs. logic-based approaches
  • retrieval of structured documents. Additional information can be found at the Web page of the MMIS course at the University of Dortmund: http://ls6-www.informatik.unidortmund.de/ir/teaching/courses/mmis/
  • Software Agents for Information Retrieval BIBA 343
      Tim Finin; James Mayfield; Charles Nicholas
    This tutorial will provide an introduction to software agents and their potential applications in IR systems. The tutorial will be divided into three sections of roughly one hour each followed by a short conclusion. The first will present concepts which underly the software agents paradigm and illustrate them with a range of example applications. The second part will cover agent software architectures, agent communication languages, and cooperation protocols. The third segment will present examples of agent-based IR systems and discuss the techniques used in them.
    Cross-Language Information Retrieval BIBA 343-344
      Douglas W. Oard
    Cross-language information retrieval techniques offer important functionality to multilingual systems by allowing queries formed in a single language to be used over the entire collection. Cross-language retrieval systems also offer monolingual users the potential to limit the expenditure of expensive translation resources to potentially useful documents that are identified with a cross-language selection interface. The tutorial will begin with descriptions of cross-language text retrieval applications and some examples of deployed systems. The capabilities and limitations of controlled vocabulary techniques will be described and used to motivate the subsequent discussion of free text techniques. Current research on knowledge-based approaches that exploit dictionaries will be presented in detail and research on techniques based on more sophisticated ontologies will be discussed briefly. Multilingual corpora provide another important source of information on the relationship between languages, and techniques based on both parallel and comparable corpora will be presented in detail. A description of cross-language selection interfaces will complete the discussion of current research on cross-language text retrieval. The tutorial will conclude with a brief discussion of the potential application of these techniques to cross-language speech retrieval, identification of open research topics on cross-language retrieval, and a brief summary of sponsored research opportunities in the United States and the European Community.
    Information Retrieval Systems: Research and Design Methods BIBA 344
      Raya Fidel; Philip J. Smith
    The objective of the course is to familiarize IR researchers and system developers with the design and evaluation of IR systems from a cognitive ergonomics perspective. The initial part of the course will be organized around a conceptual framework for pursuing a research project from discovery to validation. This introduction will include a discussion of alternative research approaches and the associated data collection methods. This initial section will conclude by focusing on the analysis of verbal protocols and discourse for model building and hypothesis testing. The second part of the course will discuss the adaptation and extension of the above research methods for system development purposes. First, the collection of verbal and behavioral data as part of usability studies will be considered. Then, a number of analytical techniques will be outlined for predicting the impact of design decisions on users' cognitive processes. Particular emphasis will be placed on the use of alternative representations to assist with the generation of a predictive cognitive task analysis. This second half of the course will be centered around a series of case studies.
    Machine Learning for Information Retrieval BIBA 344
      David D. Lewis
    This tutorial will discuss machine learning methods for IR tasks, including retrieval, categorization, and routing/filtering. The emphasis will be on supervised learning (i.e. learning from manually classified examples), with some attention to unsupervised methods (e.g. clustering, LSI) for representation change. The use of machine learning in commercial IR software will be touched upon, but the emphasis will be on research findings. The tutorial will attempt to clarify the links between important but sometimes confusing concepts from IR (e.g. term weighting, query expansion, relevance feedback, classification, etc.) and important but sometimes confusing concepts from machine learning (e.g. feature extraction, overfitting, generalization, classification, etc.).
    Evaluation of IR Systems BIBA 344
      William R. Hersh; Stephen E. Robertson
    The tutorial will provide an overview and critical assessment of information retrieval system evaluation. Until now the Cranfield approach to IR with recall and precision measures has dominated retrieval testing. Developments in end-user information systems such as CD-ROM's, hypertext public access systems, and the Internet are presenting new evaluation challenges. The tutorial will start with basic research concepts and their application in IR evaluation. Approaches adopted in classic retrieval experiments will be presented and their limitations will be discussed. More recent evaluative studies conducted at City University London, Oregon Health Sciences University, and TREC will be used to illustrate efforts towards more user-centered evaluation. The final discussion will consider future directions in accommodating both system and user oriented evaluation in IR.
    Implementation of High Performance Information Retrieval Systems BIBA 344
      Alistair Moffat; Justin Zobel
    Basic IR techniques, developed and refined over more than thirty years, are well-known. However, it is only recently that these techniques have been applied to document collections containing gigabytes of text. This tutorial examines the practical problems of indexing, querying, storing, and updating gigabyte-sized text databases. It describes a variety of recently-developed techniques for coping with the the scale of modern text collections, including fast indexing methods, fast query evaluation strategies, and fast text and index compression mechanisms. The public-domain software system MG will be used as an example, and participants will be given guidance on the installation and use of MG. The tutorial will conclude with a description of other indexing methods, in particular signature files, and an evaluation of their usefulness.

    Pre-Conference Workshop

    Education and Curriculum Development for Multimedia, Hypertext, and Information Access: Focus on DL and IR BIBA 345
      Edward Fox
    This workshop is part of a series of meetings that began in 1995 to develop guidelines for curricula and courses in the broad area of "information" (Multimedia, Hypertext and Information Access). Attendees will help draft guidelines (similar to those by SIGGRAPH, SIGCHI) for curricula, courses and training programs in this area. Educators will present syllabi and describe courseware for courses or training programs about digital libraries or information retrieval. Employers will describe knowledge and skills they seek when recruiting in these areas. Researchers will explain testbeds that can be used by learners. Workshop results will be disseminated over WWW and later through ACM publications, and also will be made available through online courseware for undergraduate and graduate students.
    Note: Joint with DL '97

    Post-Conference Workshops

    Beyond Word Relations BIBA 345
      Beth Hetzler
    Many IR systems identify documents or provide a document visualization based on analysis of a particular relationship among documents -- that of similar content. But there may be layers of other less apparent and less traditional relationships that would potentially be useful to the user. Building a theoretical framework for this "other" information is the subject of this workshop. The focus will be on identifying non-traditional relationships which may be valuable to analysis, and on integrating among the traditional and non-traditional.
       The goal of the workshop is to enhance our understanding of the linkages and associations among documents by:
  • Identifying semantic relationships among documents. For example, some readily
       apparent relationships include documents with the same subject or theme,
       that share a property, that reference or quote one another, that share the
       same purpose, or that embody a cause-and-effect relationship.
  • Categorizing those relationships
  • Identifying attributes of the relationships
  • Identifying areas for follow-on research, such as visualization possibilities
  • Networked Information Retrieval BIBA 345
      Jamie Callan; Chris Buckley; Norbert Fuhr
    The recent and rapid growth of the Internet and corporate intranets poses new problems for Information Retrieval. There is now a need for tools that help people navigate the network, select which collections to search, and fuse the results returned from searching multiple collections. These problems are being addressed by the international IR research community and a number of digital libraries projects around the world, such as the U.S. Digital Libraries projects, the ERCIM Digital Libraries projects and the German MEDOC project.
       The goal of this workshop is to bring together people from each of these areas to discuss their varying approaches to common problems. Researchers are invited to submit position papers or extended abstracts discussing novel approaches to the following problems:
  • Resource selection: selecting from among a set of collections or databases;
  • Data fusion: merging or fusing results from different collections or
  • Browsing, summarization and visualization of distributed resources;
  • Archival retrieval methods for heterogeneous objects;
  • Metaknowledge;
  • Consistency;
  • Multilingual environments;
  • User interfaces; and
  • Architectures for networked information retrieval
  • Summarization and Visualization for IR: Reducing the Information Overload BIBA 346
      James Allan; Amit Singhal
    How can IR techniques be used to reduce the cliched "information overload" problem? Researchers have been using IR's statistical methods to provide "best sentence" summaries of documents for decades, but other types of summaries are needed to assimilate the piles of information available on world-wide networks. For example:
  • Grouping of retrieved texts into related classes.
  • Lists of main topics in a collection or in a retrieved set.
  • Summaries of non-textual material: is a thumbnail the only approach?
  • Displays of how retrieved material related to other material (the collection,
       earlier results, etc.)
  • Query- or user-specific summaries.
  • Improving coherence and coverage of summaries.
  • Visuals that help a user understand why documents were retrieved or whether
       it likely that the desired document is anywhere to be found.
  • Evaluation of summarization and visualization techniques
  • Crosslingual Information Retrieval BIBA 346
      Jaime Carbonell; Yiming Yang
    Crosslingual Information Retrieval (aka "translingual" or "multilingual" IR) is a rapidly growing area of IR, driven in part by the ease of information access across national and linguistic boundaries afforded by the internet and the web. The 1996 crosslingual (CIR) SIGIR workshop helped establish this new field, and there has been considerable progress since then in the context of TREC and in a number of new CIR techniques and comparative evaluations.
       This workshop offers a forum for discussion of developments and emerging issues in CIR. In particular, we expect to address:
  • New methods for CIR (beyond dictionary-based query translation)
  • The role of query expansion in CIR
  • The role of bilingual corpora in CIR
  • Can MT help in CIR, and if so how?
  • How should CIR performance be evaluated?
  • Can we set some common benchmarks and/or corpora?
  • What message(s) should we carry to TREC wrt CIR?
  • What are the greatest challenges for CIR?