HCI Bibliography Home | HCI Conferences | IR Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
IR Tables of Contents: 8687888990919293949596979899000102030405

Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

Fullname:Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
Editors:Edward A. Fox; Peter Ingwersen; Raya Fidel
Location:Seattle, Washington
Dates:1995-Jul-09 to 1995-Jul-13
Standard No:ISBN 0-89791-714-6; ACM Order Number 606950; ACM DL: Table of Contents hcibib: IR95
  1. Keynote Address
  2. Distributed IR and the Internet
  3. Efficiency Techniques
  4. Advanced Systems
  5. Text Summarization
  6. Integrating Structured and Unstructured Information
  7. Natural Language Processing
  8. Keynote Address
  9. User Studies
  10. Fusion Strategies
  11. Search Interfaces
  12. Cognition and Association
  13. Automatic Classification
  14. Text Categorization
  15. Retrieval Logic
  16. Term Statistics
  17. Feedback Methods
  18. Panels
  19. Systems Demonstrations: Abstracts
  20. Posters: Abstracts
  21. Post-Conference Research Workshops

Keynote Address

Digital vs. Libraries: Bridging the Two Cultures BIB 2
  Terry Winograd

Distributed IR and the Internet

NetSerf: Using Semantic Knowledge to Find Internet Information Archives BIBA 4-11
  Anil S. Chakravarthy; Kenneth B. Haase
This paper describes the architecture, implementation and evaluation of NetSerf, a program for finding information archives on the Internet using natural language queries. NetSerf's query processor extracts structured, disambiguated representations from the queries. The query representations are matched to hand-coded representations of the archives using semantic knowledge from WordNet (a semantic thesaurus) and an on-line Webster's dictionary. NetSerf has been tested using a set of questions and answers developed independently for a game called Internet Hunt. The paper presents results comparing the performance of NetSerf and the standard IR system SMART on this set of queries.
Dissemination of Collection Wide Information in a Distributed Information Retrieval System BIBA 12-20
  Charles L. Viles; James C. French
We find that dissemination of collection wide information (CWI) in a distributed collection of documents is needed to achieve retrieval effectiveness comparable to a centralized collection. Complete dissemination is unnecessary. The required dissemination level depends upon how documents are allocated among sites. Low dissemination is needed for random document allocation, but higher levels are needed when documents are allocated based on content. We define parameters to control dissemination and document allocation and present results from four test collections. We define the notion of iso-knowledge lines with respect to the number of sites and level of dissemination in the distributed archive, and show empirically that iso-knowledge lines are also iso-effectiveness lines when documents are randomly allocated.
Searching Distributed Collections with Inference Networks BIBA 21-28
  James P. Callan; Zhihong Lu; W. Bruce Croft
The use of information retrieval systems in networked environments raises a new set of issues that have received little attention. These issues include ranking document collections for relevance to a query, selecting the best set of collections from a ranked list, and merging the document rankings that are returned from a set of collections. This paper describes methods of addressing each issue in the inference network model, discusses their implementation in the INQUERY system, and presents experimental results demonstrating their effectiveness.

Efficiency Techniques

Fast Evaluation of Structured Queries for Information Retrieval BIBA 30-38
  Eric W. Brown
Information retrieval systems are being challenged to manage larger and larger document collections. In an effort to provide better retrieval performance on large collections, more sophisticated retrieval techniques have been developed that support rich, structured queries. Structured queries are not amenable to previously proposed optimization techniques. Optimizing execution, however, is even more important in the context of large document collections. We present a new structured query optimization technique which we have implemented in an inference network-based information retrieval system. Experimental results show that query evaluation time can be reduced by more than half with little impact on retrieval effectiveness.
Efficient Recompression Techniques for Dynamic Full-Text Retrieval Systems BIBA 39-47
  Shmuel T. Klein
An efficient variant of an optimal algorithm is presented, which, in the context of a large dynamic full-text information retrieval system, reorganizes data that has been compressed by an on-the-fly compression method based on LZ77, into a more compact form, without changing the decoding procedure. The algorithm accelerates a known technique based on a reduction to a graph-theoretic problem, by reducing the size of the graph, without affecting the optimality of the solution. The new method can thus effectively improve any dictionary compression scheme using a static encoding method.

Advanced Systems

Design of a Reusable IR Framework BIBA 49-57
  Gabriele Sonnenberger; Hans-Peter Frei
In this paper, we describe the design of a reusable IR framework, called FIRE, that is being implemented to facilitate the development of IR systems. In addition, FIRE is designed to support the experimental evaluation of both indexing and retrieval techniques. First, we discuss the development of reusable software in the IR domain and derive essential criteria for the design of an IR framework. Next, we sketch the object model developed for FIRE. We present the basic concepts and their modeling and show how the components interact when performing indexing and retrieval tasks.
Parallel Text Retrieval on a High Performance Supercomputer Using the Vector Space Model BIBA 58-66
  P. Efraimidis; C. Glymidakis; B. Mamalis; P. Spirakis; B. Tampakas
This paper discusses the efficiency of a parallel text retrieval system that is based on the Vector Space Model. Specifically, we describe a general parallel retrieval algorithm for use with this model, the application of the algorithm in the FIRE system [1], and its implementation on the high performance GCel3/512 Parsytec parallel machine [2]. The use of this machine's two-dimensional grid of processors provides an efficient basis for the virtual tree that lies at the heart of our retrieval algorithm. Analytical and experimental evidence is presented to demonstrate the efficiency of the algorithm.

Text Summarization

A Trainable Document Summarizer BIBAK 68-73
  Julian Kupiec; Jan Pedersen; Francine Chen
* To summarize is to reduce in complexity, and hence in length,
   while retaining some of the essential qualities of the original.
  • This paper focusses on document extracts, a particular kind of computed
       document summary.
  • Document extracts consisting of roughly 20% of the original can be as
       informative as the full text of a document, which suggests that even shorter
       extracts may be useful indicative summaries.
  • The trends in our results are in agreement with those of Edmundson who used a
       subjectively weighted combination of features as opposed to training the
       feature weights using a corpus.
  • We have developed a trainable summarization program that is grounded in a
       sound statistical framework.
    Keywords: Summary sentence, Original documents, Summary pairs, Training corpus, Document extracts
  • Generating Summaries of Multiple News Articles BIBAK 74-82
      Kathleen McKeown; Dragomir R. Radev
    We present a natural language system which summarizes a series of news articles on the same event. It uses summarization operators, identified through empirical analysis of a corpus of news summaries, to group together templates from the output of the systems developed for ARPA's Message Understanding Conferences. Depending on the available resources (e.g., space), summaries of different length can be produced. Our research also provides a methodological framework for future work on the summarization task and on the evaluation of news summarization systems.
    Keywords: Natural language summarization, Natural language generation, Summarization of multiple texts

    Integrating Structured and Unstructured Information

    Integrating IR and RDBMS Using Cooperative Indexing BIBA 84-92
      Samuel DeFazio; Amjad Daoud; Lisa Ann Smith; Jagannathan Srinivasan; Bruce Croft; Jamie Callan
    The full integration of information retrieval (IR) features into a database management system (DBMS) has long been recognized as both a significant goal and a challenging undertaking. By full integration we mean: i) support for document storage, indexing, retrieval, and update, ii) transaction semantics, thus all database operations on documents have the ACID properties of atomicity, consistency, isolation, and durability, iii) concurrent addition, update, and retrieval of documents, and iv) database query language extensions to provide ranking for document retrieval operations. It is also necessary for the integrated offering to exhibit scaleable performance for document indexing and retrieval processes. To identify the implementation requirements imposed by the desired level of integration, we layered a representative IR application on Oracle Rdb and then conducted a number of database load and document retrieval experiments. The results of these experiments suggest that infrastructural extensions are necessary to obtain both the desired level of IR integration and scaleable performance. With the insight gained from our initial experiments, we developed an approach, called cooperative indexing, that provides a framework to achieve both scaleability and full integration of IR and RDBMS technology. Prototype implementations of system-level extensions to support cooperative indexing were evaluated with a modified version of Oracle Rdb. Our experimental findings validate the cooperative indexing scheme and suggest alternatives to further improve performance.
    A Language for Queries on Structure and Contents of Textual Databases BIBA 93-101
      Gonzalo Navarro; Ricardo Baeza-Yates
    We present a model for querying textual databases by both the structure and contents of the text. Our goal is to obtain a query language which is expressive enough in practice while being efficiently implementable, features not present at the same time in previous work. We evaluate our model regarding expressivity and efficiency. The key idea of the model is that a set-oriented query language based on operations on nearby structure elements of one or more hierarchies is quite expressive and efficiently implementable, being a good tradeoff between both goals.
    An NF² Relational Interface for Document Retrieval, Restructuring and Aggregation BIBA 102-110
      Kalervo Jarvelin; Timo Niemi
    Complex documents are used in many environments, e.g., information retrieval (IR). Such documents contain subdocuments, which may contain further subdocuments, etc. Powerful tools are needed to facilitate their retrieval, restructuring, and analysis. Existing IR systems are poor in complex document restructuring and data aggregation. However, in practice, IR system users would often want to obtain aggregation information on subdocuments of complex documents. In this paper we address this problem and provide a truly declarative and powerful interface for the users. Our interface is based on the non-first-normal-form (NF²) relational model. It allows intuitive and systematic modeling of complex documents.

    Natural Language Processing

    Fast and Quasi-Natural Language Search for Gigabits of Chinese Texts BIBA 112-120
      Lee-Feng Chien
    This paper presents an efficient signature file approach for fast and intelligent retrieval of large Chinese full-text document databases. The proposed approach is an integrated and efficient text access method, which performs well both in exact match searching of Boolean queries and best match searching (ranking) of quasi-natural language queries. Using this approach, the inherent difficulties of Chinese word segmentation and proper noun identification can be effectively reduced, queries can be expressed with non-controlled vocabulary, and the ranking function can be easily implemented neither demanding extra space overhead nor affecting the retrieval efficiency. The experimental results show that the proposed approach achieves good performance in many ways, especially in the reduction of false drops and space overhead, the speedup of retrieval time, and the capability of best match searching using quasi-natural language queries. In conclusion, the proposed approach is capable of retrieving gigabytes of Chinese texts very efficiently and intelligently.
    A New Character-Based Indexing Method using Frequency Data for Japanese Documents BIBA 121-129
      Yasushi Ogawa; Iwasaki Masajirou
    A character based indexing is preferable for Japanese IR systems since Japanese words are not segmented. This paper proposes a new character indexing method to enhance our previous method which divided character pair index entries into disjoint groups based on character classes. Since frequency data is used to determine hashed entries for character pairs and to establish a special string index, both search speed and precision are improved. Moreover, bit strings are managed using small and large blocks, so registration and retrieval are accelerated. Experiments using patent abstracts showed these proposals are quite effective.
    Little Words Can Make a Big Difference for Text Classification BIBA 130-136
      Ellen Riloff
    Most information retrieval systems use stopword lists and stemming algorithms. However, we have found that recognizing singular and plural nouns, verb forms, negation, and prepositions can produce dramatically different text classification results. We present results from text classification experiments that compare relevancy signatures, which use local linguistic context, with corresponding indexing terms that do not. In two different domains, relevancy signatures produced better results than the simple indexing terms. These experiments suggest that stopword lists and stemming algorithms may remove or conflate many words that could be used to create more effective indexing terms.

    Keynote Address

    Evaluation of Evaluation in Information Retrieval BIBA 138-146
      Tefko Saracevic
    Evaluation is a major force in research, development and applications related to information retrieval (IR). This paper is a critical and historical analysis of evaluations of IR systems and processes. Strengths and shortcomings of evaluation efforts and approaches are discussed, together with major challenges and questions. A limited comparison is made with evaluation in experts systems and Online Public Access Catalogs (OPACs). Evaluation is further analyzed in relation to the broad context and specific problems addressed. Levels of evaluation are identified and contrasted; most IR evaluations were concerned with the processing level, but others were conducted at the output, users and use, and social levels. A major problem is the isolation of evaluations at a given level. Issues related to systems under evaluation, and evaluation criteria, measures, measuring instruments, and methodologies are examined. A general point is also considered: IR is increasingly imbedded into many other applications, such as the Internet or digital libraries. Little evaluation in the traditional IR sense is undertaken in relation to these applications. The challenges are to integrate IR evaluations from different levels and to incorporate evaluation in new applications.

    User Studies

    Searchers and Searchers: Differences between the Most and Least Consistent Searchers BIBA 149-157
      Mirja Iivonen
    Differences between the most and least consistent searchers are considered. Attention is payed both to term-consistency and concept-consistency. The paper is based on an empirical study where 32 searchers formulated query statements from 12 search requests. The searchers were also interviewed to obtain information about their experience. There was a statistically significant dependence between term-consistency and the terminological styles of searchers on the one hand and between concept-consistency and searchers' search strategies on the other hand. There were also clear differences in the experience of most and least consistent searchers both in information storage and information retrieval.
    Information Processing in the Context of Medical Care BIBA 158-163
      Valerie Florance; Gary Marchionini
    We report findings from an exploratory study whose overall goal was to design an online document surrogate for journal articles, customized for use in clinical problem solving. We describe two aspects of literature-based medical decision making. First, there are interaction effects among citations in a search output (or among articles in a stack of articles) that affect the physician's judgment of clinical applicability. Second, physicians select among different information processing strategies when attempting to use literature for finding an answer a clinical question.
    Towards New Measures of Information Retrieval Evaluation BIBA 164-170
      William R. Hersh; Diane L. Elliot; David H. Hickam; Stephanie L. Wolf; Anna Molnar; Christine Leichtenstien
    All of the methods currently used to evaluate information retrieval (IR) systems have limitations in their ability to measure how well users are able to acquire information. We utilized an approach to assessing information obtained based on the user's ability to answer questions from a short-answer test. Senior medical students look the ten-question test and then searched one or two IR systems on the five questions for which they were least certain of their answer. Our results showed that pre-searching scores on the test were low but that searching yielded a high proportion of answers with both systems. These methods are able to measure information obtained, and will be used in subsequent studies to assess differences among IR systems.

    Fusion Strategies

    Learning Collection Fusion Strategies BIBA 172-179
      Ellen M. Voorhees; Narendra K. Gupta; Ben Johnson-Laird
    Collection fusion is a data fusion problem in which the results of retrieval runs on separate, autonomous document collections must be merged to produce a single, effective result. This paper explores two collection fusion techniques that learn the number of documents to retrieve from each collection using only the ranked lists of documents returned in response to past queries and those documents' relevance judgements. Retrieval experiments using the TREC test collection demonstrate that the effectiveness of the fusion techniques is within 10% of the effectiveness of a run in which the entire set of documents is treated as a single collection.
    Combining Multiple Evidence from Different Properties of Weighting Schemes BIBA 180-188
      Joon Ho Lee
    It has been known that using different representations of either queries or documents, or different retrieval techniques retrieves different sets of documents. Recent work suggests that significant improvements in retrieval performance can be achieved by combining multiple representations or multiple retrieval techniques. In this paper we propose a simple method for retrieving different documents within a single query representation, a single document representation and a single retrieval technique. We classify the types of documents, and describe the properties of weighting schemes. Then, we explain that different properties of weighting schemes may retrieve different types of documents. Experimental results show that significant improvements can be obtained by combining the retrieval results from different properties of weighting schemes.
    Efficient Processing of Vague Queries using a Data Stream Approach BIBA 189-197
      Ulrich Pfeifer; Norbert Fuhr
    In this paper, we consider vague queries in text and fact databases. A vague query can be formulated as a combination of vague criteria. A single database object can meet a vague criterion to a certain degree. We confine ourselves to queries for which the answer can be computed efficiently by (perhaps repetitive) combination of rankings to new rankings. Since users usually will inspect some of the best answer objects only, the corresponding rankings need to be computed just as far as necessary to generate these first answer objects. In this contribution we describe an approach for estimating the number of elements needed from the basic rankings to compute a given number of elements of the resulting ranking. Experiments with a large text database prove the applicability of our approach.

    Search Interfaces

    Document Analysis for Visualization BIBA 199-204
      David Dubin
    An experimental term selection strategy for document visualization is described. Strong discriminators with few co-occurrences increase the clustering tendency of low-dimensional document browsing spaces. Clustering tendency is tested with diagnostic measures adapted from the field of cluster analysis, and confirmed using the VIBE visualization tool. This method supports browsing in high recall, low precision document retrieval and classification tasks.
    Users' Model of the Information Space: The Case for Two Search Models BIBAK 205-210
      Sylvia Willie; Peter Bruza
    Computerised information spaces evolved using the Boolean logic paradigm for retrieval of their stored information. While many studies have looked at ways to improve training to enable users to create queries appropriate to their information need, little attention has been paid to users' cognitive models of the information. This research shows that when people can illustrate their queries graphically, they have at least two quite different mental models. These incorporate the information space they are working with and the particular approach applied to their search. This paper presents the findings to date and indicates the additional avenues which we believe warrant investigation.
    Keywords: Venn metaphor, Information retrieval, Individual differences, Query languages, Novices

    Cognition and Association

    The Newspaper Image Database: Empirical Supported Analysis of Users' Typology and Word Association Clusters BIBA 212-218
      Susanne Ornager
    This paper touches upon the problems arising in connection with indexing and retrieval for effective searching of digitized images. An empirical study, based on 13 newspaper archives, demonstrates that rules for indexing images can be formulated and that a user group typology can be established. An image user model is suggested based on word clusters. The empirical analysis demonstrates how the results of the word association tests can be used as the foundation for a semantic model.
    Human Memory Models and Term Association BIBA 219-227
      Gerda Ruge
    Results of cognitive psychology research are analysed to explain why it is difficult for retrieval system users to bring to mind alternative search terms. A human memory model is modified in such a way that it produces additional search terms instead of human associations. A small experiment shows that such a spreading activation network can find alternative terms -- with a performance similar to the normally used similarity measures.

    Automatic Classification

    A Comparison of Classifiers and Document Representations for the Routing Problem BIBA 229-237
      Hinrich Schutze; David A. Hull; Jan O. Pedersen
    In this paper, we compare learning techniques based on statistical classification to traditional methods of relevance feedback for the document routing problem. We consider three classification techniques which have decision rules that are derived via explicit error minimization: linear discriminant analysis, logistic regression, and neural networks. We demonstrate that the classifiers perform 10-15% better than relevance feedback via Rocchio expansion for the TREC-2 and TREC-3 routing tasks.
       Error minimization is difficult in high-dimensional feature spaces because the convergence process is slow and the models are prone to overfitting. We use two different strategies, latent semantic indexing and optimal term selection, to reduce the number of features. Our results indicate that features based on latent semantic indexing are more effective for techniques such as linear discriminant analysis and logistic regression, which have no way to protect against overfitting. Neural networks perform equally well with either set of features and can take advantage of the additional information available when both feature sets are used as input.
    A Case-Based Approach to Intelligent Information Retrieval BIBA 238-245
      Jody J. Daniels; Edwina L. Rissland
    We have built a hybrid Case-Based Reasoning (CBR) and Information Retrieval (IR) system that generates a query to the IR system by using information derived from CBR analysis of a problem situation. The query is automatically formed by submitting in text form a set of highly relevant cases, based on a CBR analysis, to a modified version of INQUERY's relevance feedback module. This approach extends the reach of CBR, for retrieval purposes to much larger corpora and injects knowledge-based techniques into traditional IR.
    Evaluating and Optimizing Autonomous Text Classification Systems BIBA 246-254
      David D. Lewis
    Text retrieval systems typically produce a ranking of documents and let a user decide how far down that ranking to go. In contrast, programs that filter text streams, software that categorizes documents, agents which alert users, and many other IR systems must make decisions without human input or supervision. It is important to define what constitutes good effectiveness for these autonomous systems, tune the systems to achieve the highest possible effectiveness, and estimate how the effectiveness changes as new data is processed. We show how to do this for binary text classification systems, emphasizing that different goals for the system lead to different optimal behaviors. Optimizing and estimating effectiveness is greatly aided if classifiers that explicitly estimate the probability of class membership are used.

    Text Categorization

    Noise Reduction in a Statistical Approach to Text Categorization BIBA 256-263
      Yiming Yang
    This paper studies noise reduction for computational efficiency improvements in a statistical learning method for text categorization, the Linear Least Squares Fit (LLSF) mapping. Multiple noise reduction strategies are proposed and evaluated, including: an aggressive removal of "non-informative words" from texts before training; the use of a truncated singular value decomposition to cut off noisy "latent semantic structures" during training; the elimination of non-influential components in the LLSF solution (a word-concept association matrix) after training. Text collections in different domains were used for evaluation. Significant improvements in computational efficiency without losing categorization accuracy were evident in the testing results.
    Partial Orders for Document Representation: A New Methodology for Combining Document Features BIBA 264-272
      Steven Finch
    This paper describes a novel paradigm for representing many types of information about documents in a manner particularly suited to text categorization by a trivial empirical rule induction system. It also has potential application to full-text retrieval paradigms.
       The paradigm allows many different types of document predicates to be combined together with logical dependencies being controlled for. This is shown to be justified by any reasonable model of descriptor inference, and the effect of increasing representation sophistication is shown for two corpora.
    Cluster-Based Text Categorization: A Comparison of Category Search Strategies BIBA 273-280
      Makoto Iwayama; Takenobu Tokunaga
    Text categorization can be viewed as a process of category search, in which one or more categories for a test document are searched for by using given training documents with known categories. In this paper a cluster-based search with a probabilistic clustering algorithm is proposed and evaluated on two data sets. The efficiency, effectiveness, and noise tolerance of this search strategy were confirmed to be better than those of a full search, a category-based search, and a cluster-based search with nonprobabilistic clustering.

    Retrieval Logic

    Probabilistic Datalog -- A Logic for Powerful Retrieval Methods BIBA 282-290
      Norbert Fuhr
    In the logical approach to information retrieval, retrieval is considered as uncertain inference. Here we present a new, powerful inference method for this purpose which combines Datalog with probability theory on the basis of intensional semantics. We describe syntax and semantics of probabilistic Datalog and also present an evaluation method and a prototype implementation. This approach allows for easy formulation of specific retrieval models for arbitrary applications, and classical probabilistic IR models can be implemented by specifying the appropriate rules. In comparison to other approaches, the possibility of recursive rules allows for more powerful inferences. Finally, probabilistic Datalog can be used as a query language for integrated information retrieval and database systems.
    Probability Kinematics in Information Retrieval BIBA 291-299
      F. Crestani; C. J. van Rijsbergen
    In this paper we discuss the dynamics of probabilistic term weights in different IR retrieval models. We present four different models based on different notions of retrieval. Two of these models are classical probabilistic models long in use in IR, the two others are based on a logical technique of evaluating the probability of a conditional called Imaging, one is a generalisation of the other. We analyse the transfer of probabilities occurring in the representation space at retrieval time for these four models, compare their retrieval performance using classical test collections, and discuss the results.
    An Image Retrieval Model Based on Classical Logic BIBA 300-308
      Carlo Meghini
    Images are a communication medium, hence objects of a linguistic nature having a form and a content. The form of an image is the image appearance and is understood as depicting a scene, the image content. The relationship between the form of an image and its content is established through a process of interpretation, capturing the meaning of the image form. Any information need on images can, and indeed has to, be seen as addressing either the image form, or its content, or the relationship between them. Consequently, any general, domain independent image retrieval facility should be based on a model supporting all these aspects of images. An image retrieval model, based on classical logic, is proposed which fulfills this basic requirement.

    Term Statistics

    One Term or Two? BIBA 310-318
      Kenneth Ward Church
    How effective is stemming? Text normalization? Stemming experiments test two hypotheses: one term (+stemmer) or two (-stemmer). The truth lies somewhere in between. The correlations, ρ, between a word and its variants (e.g., +s, +ly, +uppercase) tend to be small (refuting the one term hypothesis), but non-negligible (refuting the two term hypothesis). Moreover, ρ varies systematically depending on the words involved; it is relatively large for a good keyword, ρ(hostage, hostages){approx}0.5, and small for pairs with little content, ρ(anytime, Anytime){approx}0, or conflicting content, ρ(continental, Continental){approx}0.
    Detecting Content-Bearing Words by Serial Clustering BIBA 319-327
      A. Bookstein; S. T. Klein; T. Raita
    Information Retrieval Systems typically distinguish between content bearing words and terms on a stop list. But "content-bearing" is relative to a collection. For optimal retrieval efficiency, it is desirable to have automated methods for custom building a stop list. This paper defines the notion of serial clustering of words in text, and explores the value of such clustering as an indicator of a word bearing content. The numerical measures we propose may also be of value in assigning weights to terms in requests. Experimental support is obtained from natural text databases in three different languages.
    Note: Extended Abstract
    Applying Probabilistic Term Weighting to OCR Text in the Case of a Large Alphabetic Library Catalogue BIBA 328-335
      Elke Mittendorf; Peter Schauble; Paraic Sheridan
    We report on a probabilistic weighting approach to indexing the scanned images of very short documents. This fully automatic process copes with short and very noisy texts (67% word accuracy) derived from the images by Optical Character Recognition (OCR). The probabilistic term weighting approach is based on a theoretical proof explaining how the retrieval effectiveness is affected by recognition errors. We have evaluated our probabilistic weighting approach on a sample of index cards from an alphabetic library catalogue where, on the average, a card contains only 23 terms. We have demonstrated over 30% improvement in retrieval effectiveness over a conventional weighted retrieval method where the recognition errors are not taken into account. We also show how we can take advantage of the ordering information of the alphabetic library catalogue.

    Feedback Methods

    Relevance Feedback with Too Much Data BIBA 337-343
      James Allan
    Modern text collections often contain large documents that span several subject areas. Such documents are problematic for relevance feedback since inappropriate terms can easily be chosen. This study explores the highly effective approach of feeding back passages of large documents. A less-expensive method that discards long documents is also reviewed and found to be effective if there are enough relevant documents. A hybrid approach that feeds back short documents and passages of long documents may be the best compromise.
    On the Reuse of Past Optimal Queries BIBA 344-350
      Vijay V. Raghavan; Hayri Sever
    Information Retrieval (IR) systems exploit user feedback by generating an optimal query with respect to a particular information need. Since obtaining an optimal query is an expensive process, the need for mechanisms to save and reuse past optimal queries for future queries is obvious. In this article, we propose the use of a query base, a set of persistent past optimal queries, and investigate similarity measures between queries. The query base can be used either to answer user queries or to formulate optimal queries. We justify the former case analytically and the latter case by experiment.
    Optimization of Relevance Feedback Weights BIBA 351-357
      Chris Buckley; Gerard Salton
    Methods for learning weights of terms using relevance information from a learning set of document has been studied for decades in information retrieval research. The approach used here, Dynamic Feedback Optimization, starts with a good weighting scheme based on Rocchio feedback, and then improves those weights in a dynamic fashion by testing possible changes of query weights on the learning set documents. The resulting optimized query performs 10-15% better that the original when evaluated on the test set. We discuss this constant tension between describing what a relevant document should contain, and describing what the known relevant documents do contain.


    Funding for IR Research BIB 358
      Efthimis Efthimiadis; Maria Zemankova; Milton Corn
    Education for IR BIBA 358
      Kazem Taghva; Edward Fox; Stephen Robertson; Nicholas Belkin; David Lewis; Donna Harman
    The SIGIR Education Committee has recently been formed based on the model of the SIGCHI committee which completed its final report in 1992. The committee is charged with developing curriculum recommendations for IR-related education to serve the computer science, library science, and information science communities. It will be soliciting input from information retrieval educators and the consumers of IR education (students and employers), with a view of determining the current status of IR education, the marketplace, and future direction. The committee is also interested in:
  • clearinghouses for IR courseware and training materials
  • electronic as well as traditional courses
  • demonstrations for online access to state of the art systems
  • other innovative efforts The purpose of this panel is to report briefly on the activities of the Education Committee and to stimulate discussion on the state of information retrieval education. The panel will consist of three IR educators from different communities (computer science, library science, information science) who will give brief (about 10 minute) presentations on their view of the purpose and content of IR curricula; a representative from the government sector will report on the role of IR in government agencies; and a participant from the industrial sector will consider the role of IR education in industry.
  • Systems Demonstrations: Abstracts

    VUSE for INSPEC; and EPOQUE for Windows BIBA 359
      Steve Pollitt
    CeDAR -- The Centre for Database Access Research, School of Computing and Mathematics, University of Huddersfield, UK, has pioneered the use of view-based techniques to improve the effectiveness of user-interfaces to both bibliographic and corporate databases. Two systems are presented:
       VUSE for INSPEC: This front-ending software searches the 5 million record INSPEC database and is a by-product of a research project launched on 1st Sept. 1991. The project has been funded by the University of Huddersfield in collaboration with the Institution of Electrical Engineers, Marconi Research Laboratories and STN-International (FIZ-Karlsruhe). The VUSE (View-based User Search Engine) system removes the need for the user to appreciate explicit Boolean statements by introducing a search strategy of successive refinement through the use of filtering views. These techniques are described in "Peek-a-Boo revived -- End-user searching of bibliographic databases using filtering views." by A Steven Pollitt, Martin P Smith and Geoffrey P Ellis, Online 94, 18th International Online Information Meeting, London, December 1994 pp 63-72. This PC-resident software has been used in the investigation of ranking and relevance feedback extensions to VUSE, the subject of research to PhD being undertaken by Martin P Smith.
       EPOQUE for Windows: Presented in collaboration with the Directorate of Informatics and Telecommunications at the European Parliament in Luxembourg.
       CeDAR is responsible for specifying the thesaurus interface that provides the new guided search mode to EPOQUE (European Parliament Online QUEry system). EPOQUE for Windows, made available in April 1995, is designed to facilitate querying of the European Parliament's main documentary database through the incorporation of VUSE techniques. EPOQUE documents are indexed by the multilingual EUROVOC thesaurus which provides a significant demonstration of the suitability of view-based techniques for multilingual retrieval. An example of how this approach has been demonstrated on the Apple Macintosh can be found in: "Using the thesaurus to view and filter environmental databases: An example using EUROVOC to search EPOQUE -- the European Parliament Online Query System." by A Steven Pollitt, Geoffrey P Ellis and Martin P Smith, The First European ISKO Conference on Environmental Knowledge Organisation and Information Management, 14-16 Sept. 1994, Bratislava, Slovakia. in Stancikova P and Dahlberg I (Eds) Knowledge Organization in Subject Areas, Vol. 1 (1994) Supplement pp 21-32 Pub: INDEKS VERLAG Frankfurt/Main.
    DOTPLOT BIBA 359-360
      Kenneth W. Church; Jonathan I. Helfman
    An interactive program, "dotplot," has been developed for browsing millions of lines of text and source code, using an approach borrowed from biology for studying homology (self-similarity) in DNA sequences. With conventional browsing tools such as a screen editor, it is difficult to identify structures that are too big to fit on the screen. In contrast, with dotplots we find that many of these structures show up as diagonals, squares, textures and other visually salient features, as will be illustrated in examples selected from biology and two new application domains: text (AP news, Canadian Hansards) and source code (5ESS(RG)).
    BRUIN: Browsing and Retrieval of text and mUltimedia resources for Information retrieval educatioN BIBA 360
      Efthimis N. Efthimiadis; Ricardo Parodi
    The BRUIN prototype presents ideas for the implementation of a digital library as a resource for supporting information retrieval (IR) education.
       BRUIN utilizes concept maps, based on the IR literature, such as the Belkin and Croft (ARIST, 1987) classification of retrieval techniques, to provide an overarching structure for the system as well as a visualization mechanism.
       BRUIN uses different technologies, such as Web browsers (Mosaic, Netscape, etc.), graphics viewers, and retrieval engines, and integrates them under a common user interface.
       The resources are accessed either by searching the database, or by using the concept map to enter the system and browse through the resources. Documents have hypertext links to other documents in the database and are linked through document clustering and citation linking. Documents contain also links to HTML documents (in-house and off-site), Powerpoint slides, graphics, other multimedia elements, screen captures of system displays, and URLs to systems that are mentioned in the database and are accessible on the Internet.
      David A. Evans
    The CLARIT system consists of a set of flexible tools for application in a wide range of information management problems. These tools integrate natural-language processing (NLP), automatic knowledge discovery, and traditional information retrieval techniques. An advanced functionality application for free-text database management is demonstrated, incorporating full NLP, a broad range of querying mechanisms, automatic or user controlled query expansion, document collection profiling, document summarization, automatic document classification, and integrated handling of scanned images. The application provides rapid analysis of potentially large queries over large-scale databases in monolithic or client/server processing modes.
    Head-Coupled Stereo Display for Visualization in a Document Retrieval System using Associative Networks BIBA 360
      Richard H. Fowler; Wendy A. L. Fowler; Aruna Kumar; Jorge L. Williams
    The system to be demonstrated provides head-coupled stereoscopic viewing of 3-d visual representations for document collections, associative term thesauri, and individual documents. This style of interface has been called "fish tank VR" and shown to be relatively effective for 3-d viewing and interaction tasks. The system also provides mechanisms to integrate query formulation across the visual representations. Supplying interaction techniques to access the system's several visual representations is one of the system's goals. The demonstration allows users to experiment with 3-d interaction in the highly interactive, iterative process of information retrieval.
    MARIAN: Ranked Retrieval in a Full-Scale Library Catalog System BIBA 360-361
      Robert France
    Ranked retrieval techniques offer library users the ability to find works based on incomplete or partially incorrect descriptions. In addition, they offer a robust and unusual approach to exploratory or subject-based searches. Library data, however, is of a type not usually encountered in ranked retrieval systems: it is highly structured, involves very short text fields, and includes references to other objects such as people and subject categories. In MARIAN we have adapted techniques from vector search systems, from information theory, and from semantic network processing systems to provide effective approximate matching in this domain. Both canned and hands-on demonstrations will be provided on a complete research library collection of c. 1,000,000 records.
      Matthew Freedman; Scott Heyano; Ellen Jensen; William Jordan; Debra Ketchell
    Willow is a general-purpose, extensible information retrieval tool. It uses database drivers to translate user queries and actions into the idiom expected by the remote search system. Through its Z39.50 driver, Willow can communicate with any search system that understands version 2 of the ISO Z39.50-1994 search and retrieval protocol. This demonstration illustrates how Willow isolates users from the idiosyncratic query and command syntaxes of diverse information retrieval systems. It also demonstrates the list browser mechanism that helps a user choose search terms as she moves through data ordered along an arbitrary axis. Finally, it demonstrates the multimedia extensions that permit Willow to deal with complex data such as sound, images, SGML tagged text, or non-Roman character sets.
    WATERS: The Wide Area TEchnical Report Service, Dienst, and NCSTRL: A National Computer Science Technical Report Library BIBA 361
      James C. French; Charles L. Viles; James R. Davis
    The Wide Area TEchnical Report Service (WATERS) and Dienst are distributed databases of computer science technical reports. Contributors are departments of computer science that make their reports available through the World-Wide Web. The reports are stored locally at the contributing sites so that users with a client such as Mosaic can browse, search, obtain abstract and bibliographic information, and retrieve technical reports online.
       NCSTRL, the National Computer Science Technical Report Library, is a joint effort of teams from the NSF sponsored WATERS project and the ARPA CSTR project. This demonstration will be the first public showing of the results of their collaboration.
      David Harper; David Hendry
    The Eclair class library is a set of extensible C++ classes for implementing best-match IR applications. Developers use Eclair to either add IR functionality to an existing application or to develop new applications from scratch. Using a loosely-coupled user interface (written in Tcl/Tk), we demonstrate a variety of IR application features and discuss how the code abstractions offered by Eclair were employed. A traditional best-match model as well as a recent probabilistic inference approach (IJdens, Bruza & Harper, submitted) are used in the demonstration. Also discussed is a MultiMedia application where pictures are represented by complex indexing features, designed for effective retrieval. We explain how the implementation of IR applications is supported by Eclair.
    LyberWorld: A 3D Graphical User Interface for Fulltext Retrieval BIBA 361-362
      Matthias Hemmje
    The LyberWorld system introduces a prototypical application of information visualization components for IR user interfaces. The prototype implements visualizations of an abstract information space -- fulltext. It demonstrates a visual user interface for the probabilistic fulltext retrieval system INQUERY. Visualizations are used to communicate information search and browsing activities in a natural way by applying metaphors of spatial navigation and attraction in abstract information spaces. Visualization tools for exploring textual information spaces and judging relevance of information items are introduced and example sessions are provided. The presence of a spatial model in the user's mind and interaction with a system's corresponding display methods is regarded as an essential contribution towards natural interaction and reduction of cognitive costs during e.g. query construction, orientation within the database content, relevance judgement and orientation within the retrieval context.
    ITMS BIBA 362
      Russell P. Holsclaw
    The technology on which the ITMS is developed is referred to as the Judgment Space (J-SPACE). A Judgment Space is an N-dimensional Euclidean space with a coordinate system in which the reference axes are interpreted as subject matter dimensions. Textual units are assigned point locations in the space and the projections of each point on the reference axes are interpreted as the degree of relevance of that textual unit to that subject matter dimension. The procedure involves: 1) selecting a number of technical expressions and 2) obtaining scaled judgments as to the degree of relevance of each of the technical expressions, terms, etc. to each of the subdomains of the subject matter. The result is a two dimensional matrix reflecting the relevance of each term to each sub-domain which can be interpreted as an N-dimensional Euclidean Space in which is embedded a configuration of K vectors extending from the origin of the space.
    FUN: An NF2 Relational Interface with Aggregation Capability for Document Retrieval, Restructuring and Analysis BIBA 362
      Kalervo Jarvelin; Timo Niemi
    Complex documents are used in many environments, e.g., information retrieval (IR). Such documents contain subdocuments, which may contain further subdocuments, etc. In practice, document database users often want to view selected complex documents in different structures and to obtain aggregation information on their subdocuments. Therefore powerful tools are needed for complex document retrieval, restructuring, and analysis. The FUN system provides powerful filter conditions, full restructuring capability and multi-attribute multi-level data aggregation of structured complex documents represented in the non-first-normal-form (NF2) relational model. In particular, The FUN system provides these capabilities in a truly declarative and powerful interface.
    Automatic Building of Hypertext Links in Digital Libraries BIBA 362
      Robert B. Kellogg; Madhan Subhas; Edward A. Fox
    Our demonstration, Automatic Building of Hypertext Links in Digital Libraries, seeks to reduce the cost of authoring quality hypertext documents by taking advantage of promising information retrieval techniques. A set of tools will be presented that assist document authors in dynamically creating hypertext documents. The ability of the hypertext engine to semi-automatically and automatically create and remove bi-directional links will be demonstrated. The links will be generated based on similarity between documents and document components that reside in the collection. A World-Wide Web browser will be used to demonstrate the results of the hypertext linking tools.
    BIRD: Browsing Interface for the Retrieval of Documents BIBA 363
      Hanhwe Kim
    BIRD (Browsing Interface for the Retrieval of Documents) provides a visual interface for browsing and sifting through document collections. Documents behave like metal filings, and terms like magnets that attract the documents they index. Lists corresponding to any Boolean query can be built by iterative operations which involve separating a collection of documents into subsets according to one or two terms, merging selected subsets, and manually adding/deleting documents from the sets. Users can examine the documents in the lists at any time, and thus keep track of browsing sessions, while sifting through large collections of documents.
    An Interface for Remotely Searching a Newswire Multi-Data-Base System, with Functions for the Automatic Identification of Duplicate Information/Documents by the Use of Text Clustering Techniques BIBA 363
      John Kirriemuir; Peter Willett
    The system being demonstrated illustrates various relationships between newswire articles/documents; these documents are retrieved, in real time, from a multi-database belonging to a telecommunications company. The software is able to identify many near-duplicate records in database search outputs, such as may arise from several sources submitting various rewrites of the same original article to the database, or the abstract and full-text versions of the same article being submitted. The software is activated by additional functions on the end-user database search interface, that allow the user, when searching and retrieving documents, some control over the clustering process.
    VIBE: Visual Information Browsing Environment BIBA 363
      Robert R. Korfhage
    VIBE (Visual Information Browsing Environment) is a visual interface for information systems, focusing on the clustering and organization of documents in a collection, with respect to multiple reference points (e.g., query, user profile, known documents). The highly dynamic, interactive interface can be used in two vector modes: normal (iconic display of hundreds of documents), ASTRO (display of thousands of documents), and one Boolean mode, showing documents for all Boolean combinations of the reference points. Multiple tools are available to help the user organize and view the documents. User-defined thresholds can limit the documents shown to only the more relevant ones.
    PIRCS: An Effective Text Retrieval System BIBA 363
      K. L. Kwok
    PIRCS (Probabilistic Indexing and Retrieval -- Components -- System) is a highly effective, probability and network-based IR system designed for large scale heterogeneous collections. Factors contributing to its effectiveness include representation enhancements, sophisticated term weighting, learning network capability, and combining multiple retrieval strategies. PIRCS does not need a full inverted file but maintains a direct file that allows a network for retrieval and learning to be created dynamically. Documents are organized into subcollections served by a common master lexicon with cumulative statistics. Documents are then mixed and ranked for retrieval as if from a single collection. PIRCS currently runs on a SparcStation with 128MB and 7 GB of disk space.
    Creation and Navigation of Virtual Semantic Space BIBA 363-364
      Kok F. Lai; Wei-Jun Wang
    In most text retrieval systems, similarity between words and documents are captured in large similarity matrices which derive their coefficients from various similarity measures. To a typical human observer, the similarity matrix represents abstract mathematical relations where extraction of meaningful relationships is virtually impossible. We will demonstrate an interactive system which maps these relations into virtual semantic spaces whereby Euclidean distances between objects are inversely proportional to their similarity. Furthermore, we provide navigational tools that enable one to travel inside this virtual semantic space as one might explore a physical space. The interactivity allows one to exploit human experience rather than technological prowess to comprehend the semantic relations.
    Cheshire II: Demonstration of a Next-Generation Online Catalog System BIBA 364
      Ray R. Larson; Ralph Moon; Jerome McDonough; Lucy Kuntz; Paul O'Leary
    Cheshire II is a next-generation online catalog and full-text information retrieval system using advanced IR techniques. It is a client/server system that uses SGML as the underlying database format in the server search engine, supports probabilistic and Boolean searching, "nearest neighbor" searching and relevance feedback via the Z39.50 IR protocol. A graphical client interface provides access to the system, and to any other Z39.50 compliant servers. The system is being deployed in a working library environment and its use and performance are being evaluated.
    A Multilingual IR Engine BIBA 364
      Mun-Kew Leong
    This is a demonstration of a multilingual information retrieval engine which operates independent of the language of a source document. Individual documents may contain any mix of alphabetic and character based languages. The main focus, however, is on Asian (character-based) languages, and we will show the results of applying various linguistic methods to enhance document retrieval in such languages. These will include code-set based stop-word lists, compound nominal identification, extraction, and indexing, and phrase segmentation, indexing, and retrieval.
    Visual Displays of SIGIR Documents BIBA 364
      Xia Lin
    In an earlier SIGIR paper (Lin, et al. 1991), a method was proposed to construct a semantic map for information retrieval by a self-organizing algorithm, Kohonen's feature map. An important feature of the semantic map is to help the user visualize contents of underlying documents. This demonstration shows a prototype that implements such a semantic map as a graphical interface for retrieval systems. Using documents from SIGIR 86-93 as a test base, the prototype shows several IR literature maps based on different indexing methods such as title indexing, title-abstract indexing, and fulltext indexing. Comparing these maps has led to the exploration of relationships between document visualization and document indexing. Some preliminary results of the exploration will be illustrated during interactive demonstrations of the prototype.
    Promenade: An Integrated OODB/IR System for WWW Image Retrieval BIBA 364-365
      Stuart A. McLean; Edie Rasmussen
    Promenade is an image-document retrieval system which integrates free-text and attribute-value queries in an object-oriented database query language (OSQL from Ontos). The query language provides a protocol upon which we were able to build an HTML interface to make the stored collections available to the World Wide Web through standard Web browsers (Mosaic, Netscape, MacWeb). Promenade currently hosts two image databases for the National Agricultural Library on the World Wide Web: a collection of botanical prints from Curtis Botanical Magazine (1797-1827), and a collection of plant pest and disease photographs from the Michigan State University Cooperative Extension Service.
    The MG Retrieval System BIBAPDF 365
      Alistair Moffat; Justin Zobel
    The MG system provides facilities for compressing, indexing, and searching large collections of documents and images, and has been the primary tool used by the Melbourne-based CITRI group during (to date) three years of TREC experiments. One of the key features of the MG system is the extensive use of compression, reducing storage space, index construction time, and retrieval time. For example, the 2 Gb TREC collection is stored -- including compressed text, compressed inverted index, and other auxiliary files -- in about 750 Mb, and is fully built in under 8 hours. Multi-term Boolean and ranked queries are evaluated in seconds, and decompression of answers is also very fast. The MG software is available free of charge by anonymous ftp from munnari.oz.au, directory pub/mg. A tutorial guide appears as an appendix in Managing Gigabytes: Compressing and Indexing Documents and Images, Ian H. Witten, Alistair Moffat, Timothy C. Bell, Van Nostrand Reinhold, New York, 1994.
    The MIRACLE System: Using Abductive Inference and Dynamic Indexing to Retrieve Multimedia SGML Documents BIBA 365
      Adrian Muller; Ulrich Thiel
    The retrieval of complex data such as multimedia items and SGML-structured texts can be facilitated by means of a formal representation of syntactic and semantic knowledge about these data. These information sources must be aggregated dynamically at the time of query processing. MIRACLE (MultImedia concept Retrieval based on logiCaL query Expansion) is an interactive, probabilistic retrieval system, which comprises an extended Bayesian network, a multimedia indexing component and an abductive retrieval engine. The inference process exploits and controls the multiple index structures of the network. The prototype is demonstrated on a collection of SGML structured dictionary articles.
    Envision: Information Visualization in a Digital Library BIBA 365
      Lucy T. Nowell; Edward A. Fox
    Envision is a multimedia digital library of computer science literature, with full-text searching and full-content retrieval capabilities, serving computer science researchers, teachers, and students at all levels of expertise. The most unusual feature of Envision is its Graphic View window, which provides powerful information visualization facilities that enable users to explore patterns in the literature.
       Envision's Graphic View window displays search results as a matrix of icons that represent documents. Users have control over the semantics of six graphical devices: icon position along the x-axis and y-axis, the alphanumeric icon label, icon size, icon color, and icon shape. These graphical devices may represent a number of document attributes: probable relevance to query, publication year, document type (e.g., text, video, hypermedia), document size, number of sources, author names, and index terms. Working in tandem with the Graphic View, the Item Summary window presents bibliographic information for icons selected by the user. Document content is presented on demand using Mosaic and a suite of related viewers. Recent studies show strong user interest and satisfaction, and minor changes they have suggested are being incorporated into newer versions of the interface software. Implementation efforts have led to an X Motif version of the Envision interface, which will be shown with a sample of the overall digital library collection.
    GUIDO: Graphical User Interface for Document Organization BIBA 366
      Assadaporn Nuchprayoon
    GUIDO (Graphical Interface for Document Organization) provides a visual interface for browsing and retrieving documents from document collections. The visual display allows the user to view document collection according to the chosen reference points. Changing reference points provides an opportunity for users to view document collection from different angles. Reference points can also be dynamically created from a document, a cluster of documents. These reference points also play a role as queries in vector space model where users can draw a boundary around to include documents with high relevance in the retrieval set. Users can examine the document on the display at any time.
    Interactive Filtering with a Gaussian User Model BIBA 366
      Douglas Oard; Nicholas DeClaris; Bonnie Dorr; Christos Faloutsos; Gary Marchionini
    Text filtering systems are designed to sift through large quantities of dynamically generated texts and display only those which may be relevant to a user's interests. We are particularly interested in interactive filtering environments in which relevance judgments become available in real time. We will demonstrate a prototype interactive system for filtering USENET news which is based on a multidimensional Gaussian user interest model. That model allows us to include differing levels of specificity for different concepts in the interest representation, potentially improving performance when compared to the cosine measure.
    EFRS: Empirical Fact Retrieval System BIBA 366
      Sam Oh
    Empirical Fact Retrieval System (EFRS) provides access to statistical research findings. Using EFRS, variable name(s) can be searched to find all the associated variables investigated by other scholars. The system displays all the associated variables, their associated statistical information and document. Searches can be further restricted by indicating significance level and strength of relationship between variables. Different alpha levels, direction of relationships (positive or negative) and strength of relationships can also be specified. To develop this system, an ER diagram of statistical research findings is drawn. This ER schema is converted in the relational schema. The system is built using Microsoft Access Relational Database System.
      Daniel Knaus; Peter Schauble
    The EUROSPIDER system is a full fledged Information Retrieval (IR) system to search large and complex data collections for relevant objects. Depending on the configuration, the EUROSPIDER system can be used as a standalone IR system, it can be added to a World-Wide Web server to make a data collection accessible through a network, or it can be added to a commercial database (DB) system to provide access to a possibly very dynamic and structured data collection. In the last case, the integration of the EUROSPIDER system and a DB system provides both IR functionality (relevance ranking, feedback searches, document analysis) and DB functionality (data model, query language, transaction processing, access control). An advanced integration of the EUROSPIDER system with a DB system is achieved by using a probabilistic retrieval model which takes into account the DB scheme. The EUROSPIDER system is the commercial version of the IR system SPIDER developed at the Swiss Federal Institute of Technology (ETH) Zurich, Switzerland. Demonstrations are available at [http://www-ir.inf.ethz.ch].
    FISsearch BIBA 367
      Willem Scholten
    FISsearch is a experimental retrieval system, consisting of an IR module, a http to Z39.50 bi-directional gateway, a Document Object Abstraction facility, and a location independent content delivery system, implementing URN's and URL's. This system is specifically being built to provide full-text search capabilities to large archival collections, by indexing OCR output, and at retrieval time delivering the bitmapped image of the original document. We believe this particular IR system is unique in so far that it incorporates in the weighing algorithms a term-noise corruption factor. This particular additional weight is obtained during the OCR process, as output from the recognition neural-net, and normalized between 0-1000. Initial research has shown that the actual effectiveness of the total system is good, and a high degree of corruption can be dealt with in the indexes.
    InfoCrystal: A Visual Information Retrieval Interface BIBA 367
      Anselm Spoerri
    The InfoCrystal uses a simple visual metaphor to enable users to deal with some of the complexities inherent in information retrieval. The InfoCrystal can be used both as a visualization tool and a visual query language to help users search for information. In particular, it allows users to specify Boolean as well as vector-space queries graphically. In this demonstration we provide an overview of the key features of the InfoCrystal and its implementation. We also present the results of two experiments that demonstrate that the InfoCrystal can be successfully used by novice users to specify an information need, where the users only received a short training tutorial.
    Querying, Navigating and Visualizing an Online Library Catalog BIBA 367
      Aravindan Veerasamy
    We demonstrate a graphical interface to a library catalog information retrieval system. This ranked output IR system interface has combined a novel set of features to help the end-user in a wide range of information gathering situations. The system supports the following:
  • Navigational features such as browsing table of contents and browsing list of
       articles written by a specific author.
  • An integrated online thesaurus from which end-users can pick words and
       phrases to expand their original query.
  • A visualization scheme that helps the user in understanding how the query
       result ranking was computed.
  • Simple drag-and-drop operations of objects into positive and negative areas
       on the screen for providing relevance feedback information. The above features help the user's interactive and iterative nature of the information seeking process.
  • New Information Retrieval Capabilities for Russian Texts Based on the Language Processor Russicon BIBA 367-368
      Serge A. Yablonsky
    New retrieval capabilities for Russian texts based on the language processor Russicon are introduced:
  • - "word-changing" search, which means that all word-changing forms of a given
       word will be found;
  • - "paradigm" search, which means that all words -- members of a given word
       paradigm (or the part of paradigm) will be found;
  • - word search with given grammatical characteristics (part of speech,
       changeability, animation, case, number, gender, person, aspect, tense,
       transition, mood, form, reflexive (verb), length of word-building and
       wordchanging stem etc.);
  • - word search with given word-building (root) and word changing stem, prefixes,
       suffixes and endings -- "linguistic wild cards" etc.;
  • - word search enriched by synonyms from thesaurus;
  • - usage for forming natural query from Russian sentence. These capabilities are achieved by using Russian language processor RUSSICON inside of retrieval systems. Technical specifications of the system. Russian language processor is realised as a C-library of such functions: morphological analyzer, normalizer, syntactic analyzer and semantic analyzer. The processor is designed as C++ library of mentioned functions which allow quick generation of different C-based retrieval systems on multiple platforms (DOS/WINDOWS).
  • LIBRETTO: An Intelligent Information Provider BIBA 368
      E. J. Yannakoudakis
    Advances in software engineering are making it possible to design systems that are totally open and can also be tailored to the specific needs of an information provision centre. This demonstration will show how we have utilised the concepts: USER-DEFINED ENTITIES, FREE-TEXT RETRIEVAL, THESAURUS, SDI, OPEN SYSTEM and USBC, in order to build the integrated package called LIBRETTO for the total control of a modern library or information centre.

    Posters: Abstracts

    On Lexical Cohesion Patterns, Thesaural Information, and Text Abridgement BIBA 369
      Mohamed Benbrahim; Khurshid Ahmad
    The advent in information superhighway brings with it a deluge of multi-modal information, particularly textual information, stored on computer systems world-wide. If information retrieval of abstracts was an intractable problem, then think how hard it would be look for items of information in a distributed corpora of texts. It is important, therefore, to think about text abridgement schemes that can (semi)-automatically summarise texts not merely on frequency of keywords in context but authors use linguistic devices to convey messages and to fashion literally hundreds of thousands of words in a coherent whole. The literature on text linguistics and pragmatics of written communications can be of some considerable help here.
       Consider, for example, the work of Michael Hoey who focuses on 'passages of authentic text' and demonstrates that patterns of lexis operate across sentence boundaries and over considerable distances within and between the text. He has argued that lexis and text are an important level of organisation and that they interact constructively to form a regular contiguous unit. We intend to critically analyse his work from a computational stand point, and to explore whether or not, we can simulate 'lexis and text levels of language organisation': We show how we can analyse lexical repetition and paraphrasing, by making use of an encyclopaedic thesaurus, to abridge texts of specialist domains. We report on a computer system that can extract key sentences in a non-narrative English text, including sentences used to introduce and to close topics, and sentences that elaborate on the principal themes of the text.
    What You Get Is What You Want: Combining Evidence for Effective Information Filtering BIBA 369
      Susan T. Dumais
    Information filtering refers to the task of selecting objects of interest from an incoming stream of information. As part of NIST/ARPA's TREC Workshop, we used Latent Semantic Indexing (LSI) for filtering 336k documents from diverse sources (newswires, patents, newspapers, technical abstracts) for 50 topics of interest. We developed representations of user interests using two sources of information. A Word Filter used just the words in the topic statements. A RelDocs Filter used just relevant training documents and ignored the topic statement (a variant of relevance feedback). The RelDocs filter vector was 30% more effective than the detailed natural language description of interests. Combining these two vectors provided small additional improvements in filtering. On average, 7 of the top 10 documents are relevant using the combined vector method. Performance can further be improved by continually incorporating relevant test documents into the filter vector. Data combination of the Word and RelDocs retrieved sets was not generally successful in improving performance compared to the best individual method, although we believe it might be if additional sources are used. Both query and data combination methods are quite general and applicable to a variety of filtering applications.
    Language Processing Techniques for the Implementation of a Document Retrieval System for Turkish Text Databases BIBA 369-370
      F. Cuna Ekmekcioglu; Michael F. Lynch; Peter Willett
    Over the last decade, a certain degree of progress has been achieved in the morphological analysis of Turkish. However, this work has not been used, thus far, to improve the effectiveness of information retrieval systems. This poster considers the development and evaluation of conflation techniques necessary for the implementation of a document retrieval system for Turkish text databases. We have evaluated stemming and n-gram matching for searching six dictionaries of Turkish words. Our results indicate that stemming can bring about substantial reductions in the number of word variants that must be processed in a Turkish free-text retrieval system. Thus, the six Turkish corpora result in a mean compression figure of 78.6%, as against a compression figure of just 36.4% when Porter's algorithm is applied to an English text. The n-gram experiments suggest that trigrams perform slightly better than digrams, and the best results, in terms of minimising the retrieval of inappropriate word variants, are obtained by combining both stemming and n-gram analysis.
    Evaluation of Probabilistic Retrieval Methods BIBA 370
      Fredric C. Gey
    A probabilistic information retrieval method returns documents to the user in descending order of estimated probability of relevance. The advantage of a probabilistic method is that, if the estimate of relevance truely reflects the probability of relevance, the user has an additional piece of information upon which to decide when to halt the search. If, for example, the probability of relevance for the 75th ranked document is 0.01, the user knows that, on average, she will have to examine 100 more documents before finding the next relevant document.
       Recall and Precision graphs, as well as average precision over all levels of recall, are the usual method for evaluating information retrieval performance. However, with probabilistic retrieval models, another measure of performance can be introduced -- accuracy the probability estimate itself. This presentation shows how the accuracy of the probability estimate can be calibrated and tested with a Chi Square test. We test the probability accuracy on four different probabilistic methods which performed well (in terms of average precision) on the TREC3 collection of documents and queries. All of the methods fail a significance test on accuracy of the probability estimates. We explore the importance of prior probability-of-relevance as a reason for inaccuracy.
    A Learning Method for Text Categorization: The Category Discrimination Method BIBA 370-371
      Jeffrey Lee Goldberg
    The Category Discrimination Method (CDM) is a new learning algorithm designed for text categorization. The motivation is there are statistical problems associated with natural language text when it is applied as input to existing machine learning algorithms (too much noise, too many features, skewed distribution).
       The bases of the CDM are research results about the way that humans learn categories and concepts vis-a-vis contrasting concepts. The essential formula is cue validity borrowed from cognitive psychology, and used to select from all possible single word-based features the 'best' predictors of a given category.
       Using a precategorized test collection of text documents, for each category:
  • Determine the 'best' predictors (i.e. features) for the category by computing
       the cue validity of all single word-based features from all training
       documents and selecting those exceeding a threshold.
  • Conduct a multi-stage search over a limited search space to learn how these
       features might best be organized into a logical structure suitable for use
       as a text categorizer.
  • Evaluate the performance of best categorizer by running on the test
       documents. The hypothesis that CDM's performance will exceed two non-domain specific algorithms, Bayesian classification and decision tree learners is empirically tested.
  • VIIP: An Iconic-Indexing Approach for Video BIBA 371
      Hassen Haddad
    The aim of this work is to build up an automatic strategy and model for content-based video indexing. We propose a two-step strategy based on image analysis and on an analogy with textual documents vector-based models. Starting with raw video, the approach first identifies the shots by detecting shot cuts, and builds a set of indexing frames representing the document shots. In a second step, a clustering process is applied on the detected shots leading to another set of final indexing frames taken as representatives of each cluster.
       A prototype implementing this approach has been built. It achieves shot cuts detection and clustering by applying similarity measures between the document frames. It also includes a set of other parameters such as an indexing frame selection criterion.
       This approach has led to a two-level indexing model: a physical and a semantic level, where frames represent structural document elements linked together with one of the "composed-of" or "indexed-with" relationships.
       According to the first evaluating tests of the prototype, the approach and the model allow to express some of the semantics held in the document, like a specific camera motion, an object orientation or presence, etc.
    The Impact of Information Use in the Context of Pharmaceutical Research and Development BIBA 371
      Lauren Harrison
    This study was designed to identify the degrees of fit between or appropriateness of the knowledge provided via enduser searching of bibliographic information systems with the information seeking practices of scientists involved in pharmaceutical research and development. The critical incident techniques was utilized to extract information regarding the impact of information retrieved from online bibliographic information resources on its users.
       Scientists (n=10) actively involved in some aspect of pharmaceutical research and development were interviewed. The intent was to determine how online bibliographic database systems are used, how well they serve the needs of their users as well as the nature of their impact on these users. Content analysis of interview transcripts show that users in this context are engaged in publication/report writing or research design when motivated to use online bibliographic databases. The impact of having information retrieved includes increased: publication rates, credibility in report/ information generation, credibility in provision of information to the Medical Community and timesavings. Respondents stated having the retrieved information positively impacted their research activities. Evidence suggests that enduser searching has an impact on publication productivity as well as other aspects of productivity.
    Children's Browsing and Keyword Searching on the Science Library Catalog: The Effect of Domain Knowledge on Search Behavior BIBA 371-372
      Sandra G. Hirsh; Christine L. Borgman
    Research has shown that adults' subject domain knowledge influences the way they use information retrieval systems. However, the effect of domain knowledge on children's search behavior has not been investigated. This study examines children's search behavior on the Science Library Catalog, a hypertext-based automated library catalog for elementary school children. The Science Library Catalog provides two ways to search for information: a browsing-oriented search method which allows children to navigate through science knowledge hierarchies and a keyword search method which allows children to type in their search queries. We focus on the effect of science domain knowledge on children's search performance, search behavior, and learning as they look for science books on this system. Data were collected through one-on-one interviews, direct observation, and online monitoring of search sessions. We are using a pattern matching program to evaluate sequences of search moves in the monitoring logs and to help us understand how and when children use browsing and keyword search methods. This dissertation will contribute to our understanding of children's search behavior and the factors which influence their behavior. This research also has implications for information retrieval system evaluation and interface design.
    Image Attributes: An Investigation BIBA 372
      Corinne Jorgensen
    With the rapid expansion in imaging technologies, access to collections of digital images is a subject of major interest. Indexing systems and computerized retrieval for images both need data concerning typically described image attributes. To date, there is little research upon which to base choices as to which attributes should be included in these systems. This research is investigating attributes typically described in several types of tasks using pictorial images. Participants performed descriptive, categorizing, and searching tasks, and word and phrase data were subjected to content analysis. Forty-two image attributes and nine higher level attribute classes were described. The data suggest that indexing of literal object is of prime significance, as is indexing of the human form and other human characteristics. "Content/Story" and other abstract attributes are also typically described, suggesting that image indexing may benefit by application of concepts associated with indexing of fiction. Term variability is less than might have been expected, suggesting some constraints may exist on the process of communicating about visually perceived data.
    Relevance Feedback: Usage, Usability, Utility BIBA 372
      Jurgen Koenemann
    I present two experiments that investigate the interactive searching behavior of two groups of people using a best-match, ranked-output retrieval engine (INQUERY) to search a large, full-text document collection.
       The group for the first experiment consisted of ten users experienced in the use of traditional, boolean online retrieval systems who were novices in the use of best-match, ranked output systems. I describe their behavior and retrieval performance for five searches each in the context of the TREC-3 routing task with a special focus on their use of relevance feedback.
       The second experiment has been designed to analyze the contribution of relevance feedback more closely: a baseline system without relevance feedback is contrasted with three versions of relevance feedback systems that systematically vary user knowledge and user control with regard to relevance feedback but otherwise maintain the same interface, the same retrieval engine, the same full-text document collection (75000 Wall Street Journal articles from the TIPSTER collection), and the same search topics. I present an initial analysis of behavioral data and retrieval performance data gathered from 60 end-users with no training in information retrieval who each performed searches on the baseline system and one of the relevance feedback systems.
    An Automatic Method for Document Structuring BIBA 372-373
      Nicolas Masson
    This article outlines a method for the structuring of expository texts which are not explicitly structured -- no sections and subsections are present. We first perform a quantitative segmentation consisting in finding the topic boundaries. This stage uses the td.idf coherence measure, a standard measure in Information Retrieval. Quantitative segmentation allows one to isolate sets of paragraphs which are "topically coherent", that is to say that this process divides the text into several distinct developments or parts. We then perform a qualitative segmentation by establishing the nature of the relations, such as cause, illustration, conclusion, explanation, which are present between thematic blocks and between sentences inside each block. To achieve this, we developed a linguistic analysis based upon clue words detection and making use of the thematic boundaries previously obtained. These lexical clues can be connectors (e.g. thus), variable expressions (e.g. anteposed prepositional phrases) or invariable ones (e.g. in conclusion), punctuation (e.g. interrogation), verbal tenses and mood (e.g. conditional), verb (e.g. to introduce) This text structuring method is the first component of a system for the automatic generation of abstracts.
    Ambiguity of Negation in Natural Language Queries BIBA 373
      April McQuire; Caroline M. Eastman
    We address the problem posed by the handling of negation in natural language queries to information retrieval systems. Negated constructs tend to be ambiguous and difficult to handle in both vector space and Boolean systems. A major problem is identifying the intended scope of the negation. A survey was conducted using sample requests typical of those that might be posed to an information retrieval system. The responses indicate that subjects generally agreed on appropriate scope for some negated constructs but did not agree on others. In general, constructs more complicated than a conjunction of noun phrases were found to be ambiguous; most of these involved prepositional phrases. These results indicate that it is not possible for a natural language interface to automatically translate all instances of negation and that perhaps a clarification dialog should be used. Future work planned includes the design of a natural language system using such a clarification dialog to handle negation and the examination of potentially ambiguous constructs involving negation in a collection of real queries.
    Navigation-Based Passage Retrieval BIBA 373
      Massimo Melucci
    This work focuses on the navigation of hypertexts for Passage Retrieval (PR). In particular hypertexts that are automatically constructed from a large and heterogeneous collection of full-text documents have been considered to extract node-passages relevant to user's informative requirements. Full-text documents summarize different subjects and then they are containers of ambiguous words. In retrieving passages we have to select those excerpts that match narrow queries. For doing that a PR technique has to disambiguate the sense of words occurring in full-text documents. Most approaches to PR do not consider user's query because passages are often defined before, and independently of query. We are studying a navigation-based technique for PR from collections of large documents. The proposed technique is based on a methodology and a prototype for the automatic construction of hypertexts for IR. Users start to navigate the automatically constructed hypertext to retrieve the passages that are as close to his requirements as possible. In retrieving passages the navigation provides some useful information to disambiguate the sense of the passage terms because each passage term belongs to a semantically meaningful context, namely a passage, and it is related to the terms previously visited.
    A Probabilistic Approach to Document Classification BIBA 373-374
      Bernard Merialdo
    We propose a probabilistic approach to document classification, and experiment it on an application where a new article is automatically assigned to a Usenet newsgroup. Each newsgroup is represented by a probabilistic language model (based on unigrams). A Maximum A Posteriori rule is used to decide which newsgroup generated the article with the highest probability. We experiment this approach on newsgroups dealing with various facets of Artificial Intelligence, and we try to guess from the body of an article the precise group it was posted to. First, a set of efficient keywords is automatically extracted from training data using a maximum precision criterion. A keyword-based approach is compared with the probabilistic approach and evaluated on the same test data, both for recall and precision rates, using various sizes for the vocabulary. Experiments indicate that the probabilistic approach is more efficient that the keyword-based approach. In the keyword case, increasing the number of keywords always increases the number of documents selected, and thus the recall rate. This is not so in the probabilistic case, because considering more words provides more information to the decision rule, so that the size of the vocabulary has to be chosen carefully for a maximum efficiency.
    Does Probabilistic Datalog Meet the Requirements of Imaging? BIBA 374
      Thomas Rolleke
    Information retrieval may be described as the process of selecting those documents that logically imply the query. The desired ranking of the documents according to their relevance to the query is obtained by computing a related probability. This computation is based on a mechanism called imaging.
       Probabilistic Datalog enables the modelling of information retrieval as uncertain inference. The expressiveness of probabilistic Datalog is especially suitable for hypermedia retrieval, since it allows for the mapping of the complex structure of hypermedia objects. Classical information retrieval models can be implemented using probabilistic Datalog.
       This paper discusses if probabilistic Datalog meets the requirements of imaging. What are the impacts of the different views on possible worlds? Should we implement imaging on top of probabilistic Datalog or should we incorporate imaging into the inference mechanism? The illustration of an implementation of the retrieval example given in (Crestani and Rijsbergen, 1995) on top of probabilistic Datalog shows the modelling of imaging regarding possible worlds as objects.
    A New Approach for Textual Information Retrieval BIBA 374
      Arnon Rungsawang; Martin Rajman
    In textual information retrieval, the vector space retrieval model has proven its robustness in manipulating large collections of unrestricted natural language text. In our approach, we try to improve the retrieval effectiveness of this model by introducing the notion of distributional semantics. The content of retrievable units or text excerpts, as well as user queries, are represented in a unified way as projections in a vector space of pertinent terms. These projections are derived from co-occurrence matrices computed on large reference text corpora collecting the distributional semantic information. Different measures of similarity may be used to characterize the proximity between user queries and related texts. In our first experiments, we use the cosine similarity measure.
    CONVECTIS Context Vector-Based Indexing System BIBA 374-375
      Robert V. Sasseen; Joel L. Carleton; William R. Caid
    HNC Software Inc. has developed a system called CONVECTIS for automatically indexing free text documents. CONVECTIS uses HNC's context vector representation of text, which encodes similarity of meaning at the word level, and is learned automatically from free text examples. The key new feature of CONVECTIS is its use of supervised learning based on relevance feedback to tune the system's indexing behavior. This approach to indexing is also directly applicable to the document routing task. CONVECTIS has been demonstrated on datasets of gigabyte size and is currently being used by a large newswire company. The learning procedure typically achieves close to 100% precision and recall on training documents. For test documents not trained on, preliminary results indicate that performance in the 80-90% range for both recall and precision is commonly obtained. CONVECTIS is implemented in a client-server configuration and indexes over 10,000 documents per hour on a dual-CPU Sun Sparc20 (based on 3KB/doc with markup, 5000 index term context vectors).
    First Experiences with a Speech Retrieval System BIBA 375
      Peter Schauble; Martin Wechsler
    We present a speech retrieval system aimed at retrieving information from audio recordings containing speech. The current system contains 4.5 hours of radio news and accepts textual queries. The fully automatic indexing was done using speech recognition techniques. Indexing speech documents is challenging because word boundaries are difficult to detect and recognition errors influence the retrieval effectiveness. The indexing process is done in the following steps. First, a speaker dependent phone recognizer produces phonetic transcriptions of the audio recordings. Using those transcriptions, phone sequences of various lengths are selected as indexing features. We have developed an efficient algorithm for selecting indexing features. Its output is a set of 5000 phone sequences of medium collection frequency, covering most parts of the audio recordings. The final indexing is done by simply locating phone sequences in the phonetic transcriptions. Queries are entered as text and are indexed similarly using a phonetic dictionary. We show that useful information can be found with the system. Some of the selected features are similar to reduced words and have a positive influence on the indexing. This method can be further improved by taking into account recognition errors either in the selection or in the indexing process.
    Dynamic Allocation of Signature File for Multimedia Document Using Parallel Devices BIBA 375
      Man-Kwan Shan; Suh-Yin Lee
    Signature file access method is efficient for processing of content-based retrieval of multimedia database. In large multimedia database server, parallel device is utilized to achieve concurrent access. Efficient allocation of signature file on parallel device minimizes the query response time and is important in the design of large multimedia databases.
       In this paper, we propose a new dynamic allocation technique to distribute the signature file on parallel devices. It is an improvement of previous approach, Fragmented Signature File. While Fragmented Signature File distributes the partitioned frame signature file by using Quick Filter, the proposed Parallel Signature File distributes by using a disk allocation technique.
       The proposed Parallel Signature File has some advantages. First, the qualified blocks are distributed more uniformly than Fragmented Signature File. Second, it can be used in dynamic environment. Third, the blocks allocated in each processing unit can be clustered to reduce the disk random access time.
       Performance analysis shows that the proposed approach outperforms that of Fragmented Signature File and is not far from theoretical optimal response time. Besides, the performance of Parallel Signature File has significant improvement than that of Fragmented Signature File, especially in the application domain of multimedia.
    Document Expansion Applied to Classification: Weighting of Additional Terms BIBA 375-376
      Jean-David Sta
    In information retrieval, one can describe documents by the terms they contain. To improve the effectiveness of retrieval, the expansion method extends terms with related terms. It includes two steps: the selection and the weighting of terms to be added. This experiment involves comparing different systems for weighting the new terms of the expansion for automatic document classification, a problem somewhat similar to information retrieval. Document expansion occurs automatically and uses a thesaurus.
       The majority of weighting methods assign to new terms a weight equal to the expanded term multipied by a constant. This model has the disadvantage of modifying the similarity between documents when the information provided by the expansion should not interfere. In this specific instance, the model proposed here meets the invariance of similarity constraint and substantially improves document classification. When the similarity is the cosine, the space of solutions of this constraint applied to the expansion of a term t in q terms, is the hypersphere of radius, the weight w of t. The expansion function is then linear. One solution experimented here, is to assign to the q terms, w divided by the square root of q.

    Post-Conference Research Workshops

    VIRI: Visual Information Retrieval Interfaces BIBA 377
      Robert H. Korfhage; Xia Lin; David S. Dubin
    A visual information retrieval interface (VIRI) is defined as one that uses graphic elements in addition to text to aid the solution of a problem related to information storage and retrieval. More than twenty such interfaces already exist, with different retrieval models, graphical metaphors, and user interactions. Furthermore, the interfaces have different strengths, for example, retrieval, browsing, and document classification. The focus of the workshop is to exchange information, and to begin development of a method for comparing these interfaces. Researchers and practitioners who are actively working on VIRI projects are particularly invited to participate. Some effort will be put into developing a classification scheme for VIRIs and identifying major research issues related to visual interfaces. Following this the discussion will center on identifying test collections and developing experimental tasks and measures that will provide a sound basis for comparing and evaluating the interfaces.
    Z39.50 and the IR Research Community BIBA 377-378
      Clifford Lynch; Ray Larson
    The Z39.50 Computer-to-computer retrieval protocol is an increasingly mature US national standard (version 3 is currently in the ballot process as of early 1995); it is widely implemented both in the US and, increasingly, also seeing use internationally, particularly in Europe. Z39.50 is potentially of great importance to the IR research community for several reasons:
  • Because Z39.50 provides a means of separating a user interface from a
       retrieval system, it allows research in clients and user interfaces to
       proceed independently from research in back-end retrieval engines, and, of
       particular importance, allows new user interfaces to be tested against very
       large production databases. It also allows new experimental retrieval
       systems to be offered to large user communities through familiar interfaces.
  • Z39.50 can form the linkage between a number of large-scale research projects
       that involve the IR community, such as the various Digital Library efforts.
  • Z39.50 raises and provides a concrete framework to explore a number of
       important research issues in its own right about the design of interoperable
       clients and servers for information retrieval, the representation and
       exchange of metadata about information servers, and related matters. The workshop has several goals:
  • To introduce the broad IR community to Z39.50, including its history, its
       current status, its function, and implementation progress;
  • To highlight several IR research projects that are exploiting Z39.50 today;
  • To sketch some of the research issues that are raised by Z39.50. After an introduction delineating the history of Z39.50 and the current status of implementations, a short tutorial will explain the operation of the protocol. The second part of the workshop will include two panels: one about the use of Z39.50 to support IR research, and another about research issue in information retrieval protocols. Attendees will be invited to contribute to the discussion.
  • Information Retrieval and Databases BIBA 378
      David Harper; Peter Schauble
    The integration of database management systems and information retrieval systems is of great practical interests. There are, however, hard research problems that remain to be solved. The workshop aim is to assist the information retrieval community in understanding the integration problems and to set up a research agenda.
       The workshop will include short presentations on the following topics:
       Architecture: loosely coupled, tightly coupled, total integration; does the DBMS control the IRS or vice versa; support for distributed computing.
       Retrieval Model and Query Language: reconciling classical DB retrieval and classical (weighted) IR retrieval; retrieval models taking advantage of DB schema; treating DB attributes in an IR way, e.g. in a probabilistic way; integration query languages for IR/DB systems; query processing/optimization.
       Concurrence Control and Transaction Management: concurrence control on the IR index; is ACID enough or is ACID too much for IR; new transaction models (nested transactions); long lived transactions (for indexing).
       Performance: new access structures; new buffering schemes (caches); retrieval performance on dynamic data; insertion, deletion, modification performance; scalability (parallel architectures); identify bottlenecks.
       After the presentations, attendees will participate in round table discussions about each topic. To allow this to proceed in a workshop atmosphere, the workshop is restricted to 30 participants.
    Curriculum Development in Computer Information Science: A Framework for Developing a New Curriculum in IR BIBA 378-379
      Edward A. Fox; Doris K. Lidtke; Michael C. Mulder; Edie M. Rasmussen; Kazem Taghva
    In this one-day workshop, Doris Lidtke and Michael Mulder will report on their extensive experience in the development of new curricula in computer information science, emphasizing preparation of students to deal with large scale information systems AND new paradigms of learning/teaching. Topics to be covered by the workshop leaders include: (1) involvement of the stakeholders -- employers, faculty, and instructional/curriculum designers; (2) determining content -- both depth and breadth; (3) validation by the stakeholders; (4) packaging -- knowledge units vs. courses; (5) special delivery mechanisms, and (6) essential/desired infrastructure to support the new/revised curriculum.
       These topics will provide a framework for discussion of curriculum development in information retrieval. Individuals and groups representing various points of view (library and information science, computing science, MIS, information systems, business, government and academia) will be invited to prepare submissions and act as group leaders. An opportunity will be provided for attendees to participate in working groups developing an IR curriculum in their area of interest.
    IR and Automatic Construction of Hypermedia BIBA 379
      Maristella Agosti; James Allan
    The workshop will address IR methods and tools that can be used in the automatic construction of a hypermedia base to produce an informative hypertext collection of documents that can be searched and browsed by content. Passage retrieval is one of the methods that can be used in the segmentation of documents in a collection of flat documents for hypermedia information retrieval design. This method, as well as other methods for automatic authoring of hypermedia bases will be presented and discussed in the workshop. Both techniques that construct a hypertext from an unlinked set of data and those that can be applied to an existing hypertext/media permitting augmention of its set of links are relevant to the workshop. Typing of links in the resulting hypertext needs to be addressed as well as having both static and dynamic links in the resulting hypertext. The workshop also will address evaluation of the quality of hypertext collections and their construction.
       After the presentations of a few position papers, the participants will discuss specific methods or other topics of interest. The workshop will conclude with the approval of a short working paper presenting all the methods that the participants deem useful for automatic construction of hypermedia.