| B. C. Brookes: In Memoriam | | BIB | PDF | 1 | |
| Nicholas J. Belkin | |||
| The Significance of the Cranfield Tests on Index Languages | | BIB | PDF | 3-12 | |
| Cyril W. Cleverdon | |||
| Complete Formal Model for Information Retrieval Systems | | BIB | PDF | 14-20 | |
| Jean Tague; Airi Salminen; Charles McClellan | |||
| Automatic Text Structuring and Retrieval -- Experiments in Automatic Encyclopedia Searching | | BIBA | PDF | 21-30 | |
| Gerard Salton; Chris Buckley | |||
| Many conventional approaches to text analysis and information retrieval prove ineffective when large text collections must be processed in heterogeneous subject areas. An alternative text manipulation system is outlined useful for the retrieval of large heterogeneous texts, and for the recognition of content similarities between text excerpts, based on flexible text matching procedures carried out in several contexts of different scope. The methods are illustrated by search experiments performed with the 29-volume Funk and Wagnalls encyclopedia. | |||
| The Use of Phrases and Structured Queries in Information Retrieval | | BIBA | PDF | 32-45 | |
| W. Bruce Croft; Howard R. Turtle; David D. Lewis | |||
| Both phrases and Boolean queries have a long history in information retrieval, particularly in commercial systems. In previous work, Boolean queries have been used as a source of phrases for a statistical retrieval model. This work, like the majority of research on phrases, resulted in little improvement in retrieval effectiveness. In this paper, we describe an approach where phrases identified in natural language queries are used to build structured queries for a probabilistic retrieval model. Our results show that using phrases in this way can improve performance, and that phrases that are automatically extracted from a natural language query perform nearly as well as manually selected phrases. | |||
| Combining Model-Oriented and Description-Oriented Approaches for Probabilistic Indexing | | BIBA | PDF | 46-56 | |
| Norbert Fuhr; Ulrich Pfeifer | |||
| We distinguish model-oriented and description-oriented approaches in probabilistic information retrieval. The former refer to certain representations of documents and queries and use additional independence assumptions, whereas the latter map documents and queries onto feature vectors which form the input to certain classification procedures or regression methods. Description-oriented approaches are more flexible with respect to the underlying representations, but the definition of the feature vector is a heuristic step. In this paper, we combine a probabilistic model for the Darmstadt Indexing Approach with logistic regression. Here the probabilistic model forms a guideline for the definition of the feature vector. Experiments with the purely theoretical approach and with several heuristic variations show that heuristic assumptions may yield significant improvements. | |||
| Some Inconsistencies and Misnomers in Probabilistic Information Retrieval | | BIBA | PDF | 57-61 | |
| William S. Cooper | |||
| The probabilistic theory of information retrieval involves the construction on mathematical models based on statistical assumptions of various sorts. One of the hazards inherent in this kind of theory construction is that the assumptions laid down may be inconsistent with the data to which they are applied. Another hazard is that the stated assumptions may not be the real assumptions on which the derived modelling equations or resulting experiments are actually based. Both kinds of error have been made repeatedly in research on probabilistic information retrieval. One consequence of these lapses in that the statistical character of certain probabilistic IR models, including the so-called 'binary independence' model, has been seriously misapprehended. | |||
| Generative Models for Bitmap Sets with Compression Applications | | BIBA | PDF | 63-71 | |
| Abraham Bookstein; Shmuel T. Klein | |||
| In large IR systems, information about word occurrence may be stored as a bit matrix, with rows corresponding to different words and columns to documents. Such a matrix is generally very large and very sparse. New methods for compressing such matrices are presented, which exploit possible correlation between rows and between columns. The methods are based on partitioning the matrix into small blocks and predicting the 1-bit distribution within a block by means of various bit generation models. Each block is then encoded using Huffman or arithmetic coding. Preliminary experimental results indicate improvements over previous methods. | |||
| Posting Compression in Dynamic Retrieval Environments | | BIBA | PDF | 72-81 | |
| IJsbrand Jan Aalbersberg | |||
| This paper describes a posting compression technique to be used in dynamic full-text document retrieval environments. The compression technique being presented is applicable in main-memory document retrieval systems, and consists of two parts. First there is the efficient use of auxiliary tables, and second there is the application of the well-known rank-frequency law of Zipf. It is shown that on the basis of this law term weights can be approximated, and thus that their explicit storage can be avoided. | |||
| A Hybrid Bilevel Image Decode Algorithm for Group 4 FAX | | BIBA | PDF | 82-91 | |
| Chengjie Luo; Clement Yu | |||
| The modified READ code is a two-dimensional coding scheme standardized by CCITT to compress black and white pictures. Existing decompression algorithms process the compressed data bit-by-bit. In this paper, we propose a hybrid decompressing algorithm which processes most of the compressed data byte-by-byte. The remaining data is processed bit-by-bit. It is known statistically that the former situation, where byte-by-byte processing occurs, happens much more often than the later situation, where bit-by-bit processing takes place. Thus, decompression will be speeded up by the proposed algorithm. | |||
| The CORE Electronic Chemistry Library | | BIBA | PDF | 93-112 | |
| Michael Lesk | |||
| A major online file of chemical journal literature complete with graphics is being developed to test the usability of fully electronic access to documents. The test file will include ten years of the American Chemical Society's online journals, supplemented with the graphics from the paper publication, and the indexing of the articles from Chemical Abstracts. Our goals are (1) to assess the effectiveness and acceptability of electronic access to primary journals as compared with paper, and (2) to identify the most desirable functions of the user interface to an electronic systems of journals, including in particular a comparison of page image display with Ascii display interfaces. This paper describes the chemical journal data, the interfaces for searching and reading it, and the experiments being done. | |||
| Retrieval Algorithm Effectiveness in a Wide Area Network Information Filter | | BIBA | PDF | 114-122 | |
| H. P. Frei; M. F. Wyle | |||
| We present an application of the usefulness performance measure in a WAN-based SDI system. Components of two basic indexing and retrieval algorithms are compared experimentally. The components we investigate include indexing token type (words versus N-grams), the amount of word reduction used in indexing, and the use of an indirect similarity component in retrieval. The theoretical basis and implementation of the basic algorithms and variations are discussed. Results indicate that works perform better than N-grams, that S-stemming is better than full-stemming, and that indirect similarity provides an improvement to the cosine measure. Performance improvements are, however, small. | |||
| Distributed Representations in a Text Based Information Retrieval System: A New Way of Using the Vector Space Model | | BIBA | PDF | 123-132 | |
| Richard F. E. Sutcliffe | |||
| In this paper we discuss how the Vector Space model of Information Retrieval can be used in a new way by combining connectionist ideas about distributed representations with the concept of propositional structure (semantic case structure) derived from mainstream Natural Language Understanding research. We show how distributed representations may be used to capture both amorphous concept representations and propositional structures and we discuss a prototype Information Retrieval system, PELICAN, which has been constructed in order to experiment with these ideas. | |||
| To See, or Not to See -- Is That the Query? | | BIBA | PDF | 134-141 | |
| Robert R. Korfhage | |||
| Traditional information retrieval systems, in the guise of presenting the most relevant information to the searcher, really put blinders on him. They present certain information to the searcher, but strongly inhibit him from seeing other information, or even knowing of its existence. In this paper we present an argument for a new retrieval paradigm, one that focuses on the organized display of all documents, rather than on the linear display of just the "best." | |||
| Integrating Query, Thesaurus, and Documents through a Common Visual Representation | | BIBA | PDF | 142-151 | |
| Richard H. Fowler; Wendy A. L. Fowler; Bradley A. Wilson | |||
| Document retrieval is a highly interactive process dealing with large amounts of information. Visual representations can both provide a means for managing the complexity of large information structures and support an interface style well suited to interactive manipulation. The system we have designed utilizes visually displayed graphic structures and a direct manipulation interface style to supply an integrated environment for retrieval. A common visually displayed network structure is used for query, document content, and term relations. A query can be modified through direct manipulation of its visual form by incorporating terms from any other information structure the system displays. An associative thesaurus of terms and an interdocument network provide information about a document collection that can complement other retrieval aids. Visualization of these large data structures makes use of fisheye views and overview diagrams to help overcome some of the difficulties of orientation and navigation in large information structures. | |||
| A Case-Based Architecture for a Dialogue Manager for Information-Seeking Processes | | BIBAK | PDF | 152-161 | |
| Anne Tissen | |||
| In this paper, we propose a case-based architecture for a dialogue manager.
The dialogue manager is one of the main components of the cognitive layer of an
interface system for information-seeking processes. Information-seeking is a
highly exploratory and navigational process and needs therefore elaborated
interaction functionality. In our approach, this functionality will be
provided by the dialogue manager operating on a set of case-based dialogue
plans. In a case-based planning system a new plan will be generated by
retrieving the plan which is most appropriate to the user's goals and adapting
it dynamically during the ongoing dialogue. We propose a case-based
architecture for two reasons. First, operating on old solutions provides a
coherent framework which prevents the user from being 'lost in hyperspace'.
Second, it allows flexible adaptations, domain dependents ones, using
perspectives on domain objects, and domain independent ones, that change the
sequence of dialogue steps. Keywords: Case-based reasoning, Human-computer interaction, Information-seeking | |||
| Addressing the Requirements of a Dynamic Corporate Textual Information Base | | BIBA | PDF | 163-172 | |
| Peter G. Anick; Rex A. Flynn; David R. Hanssen | |||
| AI-STARS is a lexicon-assisted full-text Information Retrieval system, designed for use in a dynamic corporate environment. In this paper, we explore how the requirements of such an environment have influenced many key aspects of the design and implementation of the AI-STARS system. We promote the use of "views" to create logical partitions in large, heterogeneous databases, and argue that storing not only article instances, but also class definitions, stored queries, display templates and linguistic data in a single object repository has consequences that can be exploited for schema and lexicon evolution, security and subject filtering, information navigation, and data distribution. | |||
| Data Conversion, Aggregation and Deduction for Advanced Retrieval from Heterogeneous Fact Databases | | BIBA | PDF | 173-182 | |
| Kalervo Jarvelin; Timo Niemi | |||
| Modern distributed fact databases are heterogeneous and autonomous. Their heterogeneity is due to many reasons, including varying data models, data structures, attribute naming conventions, units of measurement or naming of data values, composition of data as attributes, technical representation of data, abstraction levels of data, etc. Database autonomity means that the database users have hardly any means for reducing such heterogeneity. Present information retrieval (IR) systems either provide no support for overcoming such heterogeneity or their support is insufficient and difficult to utilize. In this paper we offer integrated and powerful data conversion aggregation and deductive techniques for advanced IR in such environments. These techniques allow the users to overcome data inconsistency due to units of measurement or naming of data values, composition of data as attributes, abstraction levels of data, and difficulties related to deductive use of hierarchically classified data. In complex situations, all these inconsistencies appears together. Therefore we also show how these techniques are integrated into a powerful query language which has been implemented in Prolog in a workstation environment. | |||
| Querying Office Systems about Document Roles | | BIBA | PDF | 183-190 | |
| A. Celentano; M. G. Fugini; S. Pozzi | |||
| This paper describes the architecture of a document retrieval system integrating classical IR features with knowledge about the procedural and application context where documents are used. The paper focuses on the query language that allows the user to pose queries involving the analysis of both the semantic network where procedures, office agents, and events of the office context are represented as elements accessing, modifying, filing, manipulating document, and the document contents, i.e. their text. The coupling of the query system with a browser tool is also discussed. The system relies on a knowledge representation model for document and document roles developed in previous phases of the research. | |||
| Query Modification and Expansion in a Network with Adaptive Architecture | | BIBA | PDF | 192-201 | |
| K. L. Kwok | |||
| This paper shows how a network view of probabilistic information indexing and retrieval with components may implement query expansion and modification (based on user relevance feedback) by growing new edges and adapting weights between queries and terms of relevant documents. Experimental results with two collections and partial feedback confirm that the process can lead to much improved performance. Learning from irrelevant documents however was not effective. | |||
| Using the Cosine Measure in a Neural Network for Document Retrieval | | BIBA | PDF | 202-210 | |
| Ross Wilkinson; Philip Hingston | |||
| The task of document retrieval systems is to match one natural language query against a large number of natural language documents. Neural networks are known to be good pattern matchers. This paper reports our investigations in implementing a document retrieval system based on a neural network model. It shows that many of the standard strategies of information retrieval are applicable in a neural network model. | |||
| Preference Structure, Inference and Set-Oriented Retrieval | | BIBA | PDF | 211-218 | |
| Y. Y. Yao; S. K. M. Wong | |||
| In this paper, a framework for modeling information retrieval is introduced by combining the salient features of many inference-based and set-oriented retrieval models. The degrees of relevance of different subsets of documents are inferred from the user preference judgments on subsets of index terms. In order to demonstrate the usefulness of the proposed framework, the Boolean and the binary vector space models are analyzed. This analysis reveals the structures implicitly used in these models. | |||
| Distributed Indexing: A Scalable Mechanism for Distributed Information Retrieval | | BIBAK | PDF | 220-229 | |
| Peter B. Danzig; Jongsuk Ahn; John Noll; Katia Obraczka | |||
| Despite blossoming computer network bandwidths and the emergence of
hypertext and CD-ROM databases, little progress has been made towards uniting
the world's library-style bibliographic databases. While a few advanced
distributed retrieval systems can broadcast a query to hundreds of
participating databases, experience shows that local users almost always clog
library retrieval systems. Hence broadcast remote queries will clog nearly
every systems. The premise of this work is that broadcast-based systems do not
scale to world-wide systems. This project describes an indexing scheme that
will permit thorough yet efficient searches of millions of retrieval systems.
Our architecture will work with an arbitrary number of indexing companies and
information providers, and, in the market place, could provide economic
incentive for cooperation between database and indexing services. We call our
scheme distributed indexing, and believe it will help researchers disseminate
and locate both published and republication material.
We are building and plan to distribute a research prototype for the Internet that demonstrates these ideas. Our prototype will index technical reports and public domain software from dozens of computer science departments around the country. Keywords: Information retrieval, Heterogeneous databases, Resource location,
Bibliographic databases | |||
| On the Allocation of Documents in Multiprocessor Information Retrieval Systems | | BIBA | PDF | 230-239 | |
| Ophir Frieder; Hava Tova Siegelmann | |||
| Information retrieval is the selection of documents that are potentially relevant to a user's information need. Given the vast volume of data stored in modern information retrieval systems, searching the document database requires vast computational resources. To meet these computational demands, various researchers have developed parallel information retrieval systems. As efficient exploitation of parallelism demands fast access to the documents, data organization and placement significantly affect the total processing time. We describe and evaluate a data placement strategy for distributed memory, distributed I/O multicomputers. Initially, a formal description of the Multiprocessor Document Allocation Problem (MDAP) and a proof that MDAP is NP Complete are presented. A document allocation algorithm for MDAP based on Genetic Algorithms is developed. This algorithm assumes that the documents are clustered using any one of the many clustering algorithms. We define a cost function for the derived allocation and evaluate the performance of our algorithm using this function. As part of the experimental analysis, the effects of varying the number of documents and their distribution across the clusters as well the exploitation of various differing architectural interconnection topologies are studied. We also experiment with the several parameters common to Genetic Algorithms, e.g., the probability of mutation and the population size. | |||
| An Object-Oriented Modeling of the History of Optimal Retrievals | | BIBA | PDF | 241-250 | |
| Yong Zhang; Vijay V. Raghavan; Jitender S. Deogun | |||
| Learning techniques are used in IR to exploit user feedback in order that
the system can improve its performance with respect to particular queries.
This process involves the construction of an optimal query that best separates
the documents known to be relevant from those that are not. Since obtaining
relevance judgments and constructing an optimal query involve a great deal of
effort, in this paper, we develop a framework for organizing the history of
optimal retrievals. The framework involves the identification of a hierarchy
of document classes such that the concepts corresponding to higher level
classes are more general than those of the lower level classes.
The ways in which such a hierarchy may be used to retrieve answers to new queries are outlined. This approach has the advantage that the query specification is concept-based, where as the retrieval mechanism is numerically-oriented involving optimal query vectors. It is shown that the construction of a hierarchy of optimal queries can correspond to an object-oriented modeling of IR objects. Furthermore, the resulting model can be easily implemented using a relational DBMS. | |||
| Retrieving Software Objects in an Example-Based Programming Environment | | BIBAK | PDF | 251-260 | |
| Scott Henninger | |||
| Example-based programming is a form of software reuse in which existing code
examples are modified to meet current task needs. Example-based programming
systems that have enough examples to be useful present the problem of finding
relevant examples. A prototype system named CodeFinder, which explores issues
of retrieving software objects relevant to the design task, is presented.
CodeFinder supports human-computer dialogue by providing the means to
incrementally construct a query and by providing associative cues that are
compatible with human memory retrieval principles. Keywords: Human-computer interaction, Retrieval, Software reuse, Connectionism,
Cooperative problem solving, Information access, Retrieval by reformulation,
Associative spreading activation | |||
| A Self-Organizing Semantic Map for Information Retrieval | | BIBA | PDF | 262-269 | |
| Xia Lin; Dagobert Soergel; Gary Marchionini | |||
| A neural network's unsupervised learning algorithm, Kohonen's feature map, is applied to constructing a self-organizing semantic map for information retrieval. The semantic map visualizes semantic relationships between input documents, and has properties of economic representation of data with their interrelationships. The potentials of the semantic map include using the map as a retrieval interface for an online bibliographic system. A prototype system that demonstrates this potential is described. | |||
| Incorporating a Semantic Analysis into a Document Retrieval Strategy | | BIBA | PDF | 270-279 | |
| Edgar B. Wendlandt; James R. Driscoll | |||
| Current information retrieval systems focus on the use of keywords to respond to user queries. We propose the additional use of surface level knowledge in order to improve the accuracy of information retrieval. Our approach is based on the database concept of semantic modeling (particularly entities and relationships among entities). We extend the concept of query-document similarity by recognizing basic entity properties (attributes) which appear in text. We also extend query-document similarity using the linguistic concept of thematic roles. Thematic roles allow us to recognize relationship properties which appear in text. We include several examples to illustrate our approach. Test results which support our approach are reported. The test results concern searching documents and using their contents to perform the intelligent task of answering a question. | |||
| Complementary Structures in Disjoint Science Literatures | | BIB | PDF | 280-289 | |
| Don R. Swanson | |||
| An Efficient Directory System for Document Retrieval | | BIBAK | PDF | 291-304 | |
| D. Motzkin | |||
| This paper introduces a file directory structure which provides an efficient
access path for document retrieval. The directory structure is based on the
multi-B-tree structure. This directory structure is compatible with current
automatic retrieval and query processing techniques. Weights that are assigned
to index terms can be included in the directory with the terms at no additional
cost. In addition, it provides for indexing a secondary attribute within a
primary attribute with no additional cost. Updates are achieved with a high
degree of efficiency as well. It is shown that this structure achieves a
better overall performance than inverted files, standard B-trees, and other
directory structures. Keywords: Access methods, Indices, Directories, M-B-T directory, B-trees,
Multi-B-trees, Information retrieval, Document retrieval, Database management
systems, Non-dense attributes | |||
| Image Query Processing Based on Multi-Level Signatures | | BIBA | PDF | 305-314 | |
| F. Rabitti; P. Savino | |||
| This paper describes the processing of queries, expressing conditions on the content of images, in large image databases. The query language assumes that a semantic interpretation of the image content is available (i.e. an image symbolic interpretation), as result of an image analysis process. The image query language addresses important aspects of the image interpretations resulting from image analysis, by defining partial conditions on the composition of the complex objects, requirements on their degree of recognition, and requirements on their position in the image interpretation. Particular emphasis is given on the definition of suitable content-based access structures to make more efficient the query processing. An approach based on multi-level signatures is adopted. The query is pre-processed on the signatures to filter-out most of the images not satisfying the query. Finally, an evaluation of the efficiency and precision of the signature technique is given. | |||
| A Two-Level Hypertext Retrieval Model for Legal Data | | BIBA | PDF | 316-325 | |
| Maristella Agosti; Roberto Colotti; Girolamo Gradenigo | |||
| This paper introduces an associative information retrieval model based on the two-level architecture proposed in [Agosti et al, 1989a] and [Agosti et al, 1990], and an experimental prototype developed in order to validate the model in a personal computing environment. In the first part of the paper, related work and motivations are presented. In the second part, the model, entitled EXPLICIT, is introduced. EXPLICIT is based on a two-level architecture which holds the two main parts of the informative resource managed by an information retrieval tool: the collection of documents and the indexing term structure. The term structure is managed as a schema of concepts which can be used by the final user as a frame of reference in the query formulation process. The model supports the concurrent use of different schemas of concepts to satisfy information needs of different categories of users. In the third part of the paper, the main characteristics of the experimental prototype, named HyperLaw, are presented. | |||
| Automatic Generation of "Hyper-Paths" in Information Retrieval Systems: A Stochastic and an Incremental Algorithm | | BIBA | PDF | 326-335 | |
| Alain Lelu | |||
| A hypertext procedure for browsing through documentary databases is proposed, based upon a global synthetic mapping in addition to a set of local scanning axes. A method is developed for automatic generation of these relevant axes: local component analysis. It consists in tracking the local maxima of a "partial inertia" landscape. First, a "neural" algorithm converging after several passes on the data is presented. Then a deterministic one-pass algorithm is deduced, allowing dynamic data-flow analysis. | |||
| Creating Segmented Databases from Free Text for Text Retrieval | | BIBA | PDF | 337-346 | |
| Lisa F. Rau; Paul S. Jacobs | |||
| Indexing text for accurate retrieval is a difficult and important problem.
On-line information services generally depend on "keyword" indices rather than
other methods of retrieval, because of the practical features of keywords for
storage, dissemination, and browsing as well as for retrieval. However, these
methods of indexing have two major drawbacks: First, they must be laboriously
assigned by human indexers. Second, they are inaccurate, because of mistakes
made by these indexers as well as the difficulties users have in choosing
keywords for their queries, and the ambiguity a keyword may have.
Current natural language text processing (NLP) methods help to overcome these problems. Such methods can provide automatic indexing and keyword assignment capabilities that are at least as accurate as human indexers in many applications. In addition, NLP systems can increase the information contained in keyword fields by separating keywords into segments, or distinct fields that capture certain discriminating content or relations among keywords. This paper reports on a system that uses natural language text processing to derive keywords from free text news stories, separate these keywords into segments, and automatically build a segmented database. The system is used as part of a commercial news "clipping" and retrieval product. Preliminary results show improved accuracy, as well as reduced cost, resulting from these automated techniques. | |||
| Retrieval Performance in FERRET: A Conceptual Information Retrieval System | | BIBA | PDF | 347-355 | |
| Michael L. Mauldin | |||
| FERRET is a full text, conceptual information retrieval system that uses a
partial understanding of its texts to provide greater precision and recall
performance than keyword search techniques. It uses a machine-readable
dictionary to augment its lexical knowledge and a variant of genetic learning
to extend its script database.
Comparison of FERRET's retrieval performance on a collection of 1065 astronomy texts using 22 sample user queries with a standard boolean keyword query system showed that precision increased from 35 to 48 percent, and recall more than doubled, from 19.4 to 52.4 percent. This paper describes the FERRET system's architecture, parsing and matching abilities, and focuses on the use of the Webster's Seventh dictionary to increase the system's lexical coverage. | |||
| The Smart Project in Automatic Document Retrieval | | BIBA | PDF | 356-358 | |
| Gerard Salton; Michael E. Lesk; Donna Harman; Robert E. Williamson; Edward A. Fox; Chris Buckley | |||
| The Smart project in automatic text retrieval was started in 1961. It is the oldest, continuously running research project in information retrieval. The panel members are all major contributors to the Smart system work. The discussion covers aspects of the Smart system design and examines the past and future significance of some of the research conducted in the Smart environment. | |||