HCI Bibliography Home | HCI Conferences | IR Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
IR Tables of Contents: 888990919293949596979899000102030405060708

Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

Fullname:Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
Editors:W. Bruce Croft; Alistair Moffat; C. J. van Rijsbergen; Ross Wilkinson; Justin Zobel
Location:Melbourne, Australia
Dates:1998-Aug-24 to 1998-Aug-28
Publisher:ACM
Standard No:ISBN 1-58113-015-5; ACM Order Number 816098SP; ACM DL: Table of Contents hcibib: IR98
Papers:75
Pages:393
Links:Conference Home Page
  1. Keynote Address
  2. Opening Session
  3. Event Detection and Clustering
  4. Cross-Lingual Retrieval
  5. Categorisation
  6. Distributed Retrieval
  7. Using Structure
  8. Interactive Retrieval
  9. Combination of Evidence
  10. Query and Profile Modification
  11. Information Retrieval Experiments
  12. Retrieval Models
  13. Efficiency
  14. Information Retrieval Experiments
  15. Panel
  16. Posters
  17. Demonstrations

Keynote Address

The Future of Internet Search BIBAPDF 1
  Steve Kirsch
Existing techniques for searching the Internet are becoming less and less practical as the Internet continues to grow in size. To continue to be useful, we need to have some new approaches to the problem of searching extremely large collections. Using Java and a patented distributed searching technique, Infoseek is taking a unique approach to the problem while at the same time enhancing speed, reliability, and accuracy.

Opening Session

Advantages of Query Biased Summaries in Information Retrieval BIBAPDF 2-10
  Anastasios Tombros; Mark Sanderson
This paper presents an investigation into the utility of document summarisation in the context of information retrieval, more specifically in the application of so called query biased (or user directed) summaries: summaries customised to reflect the information need expressed in a query. Employed in the retrieved document list displayed after a retrieval took place, the summaries' utility was evaluated in a task-based environment by measuring users' speed and accuracy in identifying relevant documents. This was compared to the performance achieved when users were presented with the more typical output of an IR system: a static predefined summary composed of the title and first few sentences of retrieved documents. The results from the evaluation indicate that the use of query biased summaries significantly improves both the accuracy and speed of user relevance judgements.
A Theory of Term Weighting Based on Exploratory Data Analysis BIBAPDF 11-19
  Warren R. Greiff
Techniques of exploratory data analysis are used to study the weight of evidence that the occurrence of a query term provides in support of the hypothesis that a document is relevant to an information need. In particular, the relationship between the document frequency and the weight of evidence is investigated. A correlation between document frequency normalized by collection size and the mutual information between relevance and term occurrence is uncovered. This correlation is found to be robust across a variety of query sets and document collections. Based on this relationship, a theoretical explanation of the efficacy of inverse document frequency for term weighting is developed which differs in both style and content from theories previously put forth. The theory predicts that a "flattening" of idf at both low and high frequency should result in improved retrieval performance. This altered idf formulation is tested on all TREC query sets. Retrieval results corroborate the prediction of improved retrieval performance. In conclusion, we argue that exploratory data analysis can be a valuable tool for research whose goal is the development of an explanatory theory of information retrieval.
New Techniques for Open-Vocabulary Spoken Document Retrieval BIBAPDF 20-27
  Martin Wechsler; Eugen Munteanu; Peter Schauble
This paper presents four novel techniques for open-vocabulary spoken document retrieval: a method to detect slots that possibly contain a query feature; a method to estimate occurrence probabilities; a technique that we call collection-wide probability re-estimation and a weighting scheme which takes advantage of the fact that long query features are detected more reliably. These four techniques have been evaluated using the TREC-6 spoken document retrieval test collection to determine the improvements in retrieval effectiveness with respect to a baseline retrieval method. Results show that the retrieval effectiveness can be improved considerably despite the large number of speech recognition errors.

Event Detection and Clustering

A Study on Retrospective and On-Line Event Detection BIBAPDF 28-36
  Yiming Yang; Tom Pierce; Jaime Carbonell
This paper investigates the use and extension of text retrieval and clustering techniques for event detection. The task is to automatically detect novel events from a temporally-ordered stream of news stories, either retrospectively or as the stories arrive. We applied hierarchical and non-hierarchical document clustering algorithms to a corpus of 15,836 stories, focusing on the exploitation of both content and temporal information. We found the resulting cluster hierarchies highly informative for retrospective detection of previously unidentified events, effectively supporting both query-free and query-driven retrieval. We also found that temporal distribution patterns of document clusters provide useful information for improvement in both retrospective detection and on-line detection of novel events. In an evaluation using manually labelled events to judge the system-detected events, we obtained a result of 82% in the F1 measure for retrospective detection, and a F1 value of 42% for on-line detection.
On-Line New Event Detection and Tracking BIBAPDF 37-45
  James Allan; Ron Papka; Victor Lavrenko
We define and describe the related problems of new event detection and event tracking within a stream of broadcast news stories. We focus on a strict on-line setting -- i.e., the system must make decisions about one story before looking at any subsequent stories. Our approach to detection uses a single pass clustering algorithm and a novel thresholding model that incorporates the properties of events as a major component. Our approach to tracking is similar to typical information filtering methods. We discuss the value of "surprising" features that have unusual occurrence characteristics, and briefly explore on-line adaptive filtering to handle evolving events in the news.
   New event detection and event tracking are part of the Topic Detection and Tracking (TDT) initiative.
Web Document Clustering: A Feasibility Demonstration BIBAPDF 46-54
  Oren Zamir; Oren Etzioni
Users of Web search engines are often forced to sift through the long ordered list of document returned by the engines. The IR community has explored document clustering as an alternative method of organizing retrieval results, but clustering has yet to be deployed on the major search engines.
   The paper articulates the unique requirements of Web document clustering and reports on the first evaluation of clustering methods in this domain. A key requirement is that the methods create their clusters based on the short snippets returned by Web search engines. Surprisingly, we find that clusters based on snippets are almost as good as clusters created using the full text of Web documents.
   To satisfy the stringent requirements of the Web domain, we introduce an incremental, linear time (in the document collection size) algorithm called Suffix Tree Clustering (STC), which creates clusters based on phrases shared between documents. We show that STC is faster than standard clustering methods in this domain, and argue that Web document clustering via STC is both feasible and potentially beneficial.

Cross-Lingual Retrieval

The Effects of Query Structure and Dictionary Setups in Dictionary-Based Cross-Language Information Retrieval BIBAPDF 55-63
  Ari Pirkola
In this study, the effects of query structure and various setups of translation dictionaries on the performance of cross-language information retrieval (CLIR) were tested. The document collection was a subset of the TREC collection, and as test requests the study used TREC's health related topics. The test system was the INQUERY retrieval system. The performance of translated Finnish queries against English documents was compared to the performance of original English queries against English documents. Four natural language query types and three query translation methods, using a general dictionary and a domain specific (= medical) dictionary, were studied. There was only a slight difference in performance between the original English queries and the best cross-language queries, i.e., structured queries with medical dictionary and general dictionary translation. The structuring of queries was done on the basis of the output of dictionaries.
Resolving Ambiguity for Cross-Language Retrieval BIBAPDF 64-71
  Lisa Ballesteros; W. Bruce Croft
One of the main hurdles to improved CLIR effectiveness is resolving ambiguity associated with translation. Availability of resources is also a problem. First we present a technique based on co-occurrence statistics from unlinked corpora which can be used to reduce the ambiguity associated with phrasal and term translation. We then combine this method with other techniques for reducing ambiguity and achieve more than 90% monolingual effectiveness. Finally, we compare the co-occurrence method with parallel corpus and machine translation techniques and show that good retrieval effectiveness can be achieved without complex resources.
Cross-Language Information Retrieval with the UMLS Metathesaurus BIBAPDF 72-80
  David Eichmann; Miguel E. Ruiz; Padmini Srinivasan
We investigate an automatic method for Cross Language Information Retrieval (CLIR) that utilizes the multilingual UMLS Metathesaurus to translate Spanish and French natural language queries into English. Two experiments are presented using OHSUMED, a subset of MEDLINE. Both experiments examine retrieval effectiveness of the translated queries. However, in the second experiment, the query translation procedure is augmented with digram based vocabulary normalization procedures. In this comparative study of retrieval effectiveness the measures used are: 11-point-average precision score (11-AvgP); average interpolated precision at recall of 0.1; and noninterpolated (i.e., exact) precision after 10 retrieved documents. Our results indicate that for Spanish the UMLS Metathesaurus based CLIR method appears equivalent to multilingual dictionary based approaches investigated in the current literature. French yields less favorable results and our analysis suggests that linguistic differences may have caused the performance differences.

Categorisation

Using a Generalized Instance Set for Automatic Text Categorization BIBAPDF 81-89
  Wai Lam; Chao Yang Ho
We investigate several recent approaches for text categorization under the framework of similarity-based learning. They include two families of text categorization techniques, namely the k-nearest neighbor (k-NN) algorithm and linear classifiers. After identifying the weakness and strength of each technique, we propose a new technique known as the generalized instance set (GIS) algorithm by unifying the strengths of k-NN and linear classifiers and adapting to characteristics of text categorization problems. We also explore some variants of our GIS approach. We have implemented our GIS algorithm, the ExpNet algorithm, and some linear classifiers. Extensive experiments have been conducted on two common document corpora, namely the OHSUMED collection and the Reuters-21578 collection. The results show that our new approach outperforms the latest k-NN approach and linear classifiers in all experiments.
Automatic Essay Grading using Text Categorization Techniques BIBAPDF 90-95
  Leah S. Larkey
Several standard text-categorization techniques were applied to the problem of automated essay grading. Bayesian independence classifiers and k-nearest-neighbor classifiers were trained to assign scores to manually-graded essays. These scores were combined with several other summary text measures using linear regression. The classifiers and regression equations were then applied to a new set of essays. The classifiers worked very well. The agreement between the automated grader and the final manual grade was as good as the agreement between human graders.
Distributional Clustering of Words for Text Classification BIBAPDF 96-103
  L. Douglas Baker; Andrew Kachites McCallum
This paper describes the application of Distributional Clustering to document classification. This approach clusters words into groups based on the distribution of class labels associated with each word. Thus, unlike some other unsupervised dimensionality-reduction techniques, such as Latent Semantic Indexing, we are able to compress the feature space much more aggressively, while still maintaining high document classification accuracy.
   Experimental results obtained on three real-world data sets show that we can reduce the feature dimensionality by three orders of magnitude and lose only 2% accuracy -- significantly better than Latent Semantic Indexing, class-based clustering, feature selection by mutual information, or Markov-blanket-based feature selection. We also show that less aggressive clustering sometimes results in improved classification accuracy over classification without clustering.

Distributed Retrieval

Improved Algorithms for Topic Distillation in a Hyperlinked Environment BIBAPDF 104-111
  Krishna Bharat; Monika R. Henzinger
This paper addresses the problem of topic distillation on the World Wide Web, namely, given a typical user query to find quality documents related to the query topic. Connectivity analysis has been shown to be useful in identifying high quality pages within a topic specific graph of hyperlinked documents. The essence of our approach is to augment a previous connectivity analysis based algorithm with content analysis. We identify three problems with the existing approach and devise algorithms to tackle them. The results of a user evaluation are reported that show an improvement of precision at 10 documents by at least 45% over pure connectivity analysis.
Effective Retrieval with Distributed Collections BIBAPDF 112-120
  Jinxi Xu; Jamie Callan
This paper evaluates the retrieval effectiveness of distributed information retrieval systems in realistic environments. We find that when a large number of collections are available, the retrieval effectiveness is significantly worse than that of centralized systems, mainly because typical queries are not adequate for the purpose of choosing the right collections. We propose two techniques to address the problem. One is to use phrase information in the collection selection index and the other is query expansion. Both techniques enhance the discriminatory power of typical queries for choosing the right collections and hence significantly improve retrieval results. Query expansion, in particular, brings the effectiveness of searching a large set of distributed collections close to that of searching a centralized collection.
Evaluating Database Selection Techniques: A Testbed and Experiment BIBAPDF 121-129
  James C. French; Allison L. Powell; Charles L. Viles; Travis Emmitt; Kevin J. Prey
We describe a testbed for database selection techniques and an experiment conducted using this testbed. The testbed is a decomposition of the TREC/TIPSTER data that allows analysis of the data along multiple dimensions, including collection-based and temporal-based analysis. We characterize the subcollections in this testbed in terms of number of documents, queries against which the documents have been evaluated for relevance, and distribution of relevant documents. We then present initial results from a study conducted using this testbed that examines the effectiveness of the gGlOSS approach to database selection. The databases from our testbed were ranked using the gGlOSS techniques and compared to the gGlOSS Ideal(l) baseline and a baseline derived from TREC relevance judgements. We have examined the degree to which several gGlOSS estimate functions approximate these baselines. Our initial results suggest that the gGlOSS estimators are excellent predictors of the Ideal(l) ranks but that the Ideal(l) ranks do not estimate relevance-based ranks well.

Using Structure

The Impact of Query Structure and Query Expansion on Retrieval Performance BIBAPDF 130-137
  Jaana Kekalainen; Kalervo Jarvelin
The effects of query structures and query expansion (QE) on retrieval performance were tested with a best match retrieval system (INQUERY). Query structure means the use of operators to express the relations between search keys. Eight different structures were tested, representing weak structures (averages and weighted averages of the weights of the keys) and strong structures (e.g., queries with more elaborated search key relations). QE was based on concepts, which were first selected from a conceptual model, and then expanded by semantic relationships given in the model. The expansion levels were (a) no expansion, (b) a synonym expansion, (c) a narrower concept expansion, (d) an associative concept expansion, and (e) a cumulative expansion of all other expansions. With weak structures and Boolean structured queries, QE was not very effective. The best performance was achieved with one of the strong structures at the largest expansion level.
A Flexible Model for Retrieval of SGML Documents BIBAPDF 138-145
  Sung Hyon Myaeng; Dong-Hyun Jang; Mun-Seok Kim; Zong-Cheol Zhoo
In traditional information retrieval (IR) systems, a document as a whole is the target for a query. With increasing interests in structured documents like SGML documents, there is a growing need to build an IR system that can retrieve parts of documents, which satisfy not only content-based but also structure-based requirements. In this paper, we describe an inference-net-based approach to this problem. The model is capable of retrieving elements at any level in a principled way, satisfying certain containment constraints in a query. Moreover, while the model is general enough to reproduce the ranking strategy adopted by conventional document retrieval systems by making use of document and collection level statistics such as TF and IDF, its flexibility allows for incorporation of a variety of pragmatic and semantic information associated with document structures. We implemented the model and ran a series of experiments to show that, in addition to the added functionality, the use of the structural information embedded in SGML documents can improve the effectiveness of document retrieval, compared to the case where no such information is used. We also show that giving a pragmatic preference to a certain element type of the SGML documents can enhance retrieval effectiveness.
Discovering Typical Structures of Documents: A Road Map Approach BIBAPDF 146-154
  Ke Wang; Huiqing Liu
The structure of a document refers to the role and hierarchy of subdocument references. Many on-line documents are similarly structured, though not identically structured. We study the problem of discovering "typical" structures of a collection of such documents, where the user specifies the minimum frequency of a typical structure. We will consider structural features of subdocument references such as labeling, nesting, ordering, cyclicity, and wild-card references, like those found on the Web and digital libraries. Typical structures can be used to serve the following purposes. (a) The table-of-content for gaining the general information of a source. (b) A road map for browsing and querying a source. (c) A basis for clustering documents. (d) Partial schemas for building structured layers to provide standard database access methods. (e) User/customer's interests and browsing patterns. We present a solution to the discovery problem.

Interactive Retrieval

A Cognitive Model for Searching for Ill-Defined Targets on the Web: The Relationship between Search Strategies and User Satisfaction BIBAPDF 155-163
  Mari Saito; Kazunori Ohmura
We performed an experiment in which subjects were asked to find a gift for a friend using the Web. The subjects were divided into two groups, only one of which was instructed to form a mental image prior to the search. We found that subjects who formed a prior mental image took longer to complete their search and also reported higher degrees of satisfaction with their search result. We present a cognitive model in which the search takes place as an incremental refinement of search target candidates and their attributes.
Comparing Interactive Information Retrieval Systems Across Sites: The TREC-6 Interactive Track Matrix Experiment BIBAPDF 164-172
  Eric Lagergren; Paul Over
This is a case study in the design and analysis of a 9-site TREC-6 experiment aimed at comparing the performance of 12 interactive information retrieval (IR) systems on a shared problem: a question-answering task, 6 statements of information need, and a collection of 210,158 articles from the Financial Times of London 1991-1994.
   The study discusses the application of experimental design principles and the use of a shared control IR system in addressing the problems of comparing experimental interactive IR systems across sites: isolating the effects of topics, human searchers, and other site-specific factors within an affordable design.
   The results confirm the dominance of the topic effect, show the searcher effect is almost as often absent as present, and indicate that for several sites the 2-factor interactions are negligible. An analysis of variance found the system effect to be significant, but a multiple comparisons test found no significant pairwise differences.
Aspect Windows, 3-D Visualizations, and Indirect Comparisons of Information Retrieval Systems BIBAPDF 173-181
  Russell C. Swan; James Allan
We built two Information Retrieval systems that were targeted for the TREC-6 "aspect oriented" retrieval track. The systems were built to test the usefulness of different visualizations in an interactive IR setting -- in particular, an "aspect window" for the chosen task, and a 3-D visualization of document inter-relationships. We studied 24 users of the system in order to investigate: whether the systems were more effective than a control system, whether experienced users outperformed novices, whether spatial reasoning ability was a good predictor of effective use of 3-D, and whether the systems could be compared indirectly via a control system. Our results show substantial differences in user performance are related to spatial reasoning ability and to a lesser degree other traits. We also obtained markedly different results from the direct and indirect comparisons.

Combination of Evidence

Modeling and Combining Evidence Provided by Document Relationships Using Probabilistic Argumentation Systems BIBAPDF 182-189
  Justin Picard
Previous research has shown that hypertext links may be a useful source of evidence for document contents and relevance, but these evidence are rather difficult to represent and combine. This paper introduces a new and sound model for representing and handling hypertext links in IR. This model is based on "Probabilistic Argumentation Systems", a technique which combines logic and probability to handle uncertainty. Some experiments show the potential of this approach, which requires no ad-hoc parameters.
Predicting the Performance of Linearly Combined IR Systems BIBAPDF 190-196
  Christopher C. Vogt; Garrison W. Cottrell
We introduce a new technique for analyzing combination models. The technique allows us to make qualitative conclusions about which IR systems should be combined. We achieve this by using a linear regression to accurately (r²=0.98) predict the performance of the combined system based on quantitative measurements of individual component systems taken from TREC5. When applied to a linear model (weighted sum of relevance scores), the technique supports several previously suggested hypotheses: one should maximize both the individual systems' performances and the overlap of relevant documents between systems, while minimizing the overlap of nonrelevant documents. It also suggests new conclusions: both systems should distribute scores similarly, but not rank relevant documents similarly. It furthermore suggests that the linear model is only able to exploit a fraction of the benefit possible from combination. The technique is general in nature and capable of pointing out the strengths and weaknesses of any given combination approach.
Experiments in Japanese Text Retrieval and Routing using the NEAT System BIBAPDF 197-205
  Gareth J. F. Jones; Tetsuya Sakai; Masahiro Kajiura; Kazuo Sumita
This paper describes a structured investigation into the retrieval of Japanese text. The study includes a comparison of different indexing strategies for documents and queries, investigation of term weighting strategies principally derived for use with English texts, and the application of relevance feedback for query expansion. Results on the standard BMIR-J1 and BMIR-J2 Japanese retrieval collections indicate that term weighting transfers well to Japanese text. Indexing using dictionary based morphological analysis and character strings are both shown to be individually effective, but marginally better in combination. We also demonstrate that relevance feedback can be used effectively for query expansion in Japanese routing applications.

Query and Profile Modification

Improving Automatic Query Expansion BIBAPDF 206-214
  Mandar Mitra; Amit Singhal; Chris Buckley
Most casual users of IR systems type short queries. Recent research has shown that adding new words to these queries via adhoc feedback improves the retrieval effectiveness of such queries. We investigate ways to improve this query expansion process by refining the set of documents used in feedback. We start by using manually formulated Boolean filters along with proximity constraints. Our approach is similar to the one proposed by Hearst [HEARST96]. Next, we investigate a completely automatic method that makes use of term cooccurrence information to estimate word correlation. Experimental results show that refining the set of documents used in query expansion often prevents the query drift caused by blind expansion and yields substantial improvements in retrieval effectiveness, both in terms of average precision and precision in the top twenty documents. More importantly, the fully automatic approach developed in this study performs competitively with the best manual approach and requires little computational overhead.
Boosting and Rocchio Applied to Text Filtering BIBAPDF 215-223
  Robert E. Schapire; Yoram Singer; Amit Singhal
We discuss two learning algorithms for text filtering: modified Rocchio and a boosting algorithm called AdaBoost. We show how both algorithms can be adapted to maximize any general utility matrix that associates cost (or gain) for each pair of machine prediction and correct label. We first show that AdaBoost significantly outperforms another highly effective text filtering algorithm. We then compare AdaBoost and Rocchio over three large text filtering tasks. Overall both algorithms are comparable and are quite effective. AdaBoost produces better classifiers than Rocchio when the training collection contains a very large number of relevant documents. However, on these tasks, Rocchio runs much faster than AdaBoost.
Learning While Filtering Documents BIBAPDF 224-231
  Jamie Callan
This paper examines the problems of learning queries and dissemination thresholds from relevance feedback in a dynamic information filtering environment. It revisits the EG algorithm for learning queries, identifying several problems in using it reliably for information filtering, and providing solutions. It also presents a new algorithm for learning dissemination thresholds automatically, from the same relevance feedback information used to learn queries.

Information Retrieval Experiments

Spatial Querying for Image Retrieval: A User-Oriented Evaluation BIBAPDF 232-240
  Joemon M. Jose; Jonathan Furner; David J. Harper
Epic is an image retrieval system that implements a novel spatial-querying mechanism. A user-centred, task-oriented, comparative evaluation of Epic was undertaken in which two versions of the system -- one set up to enable spatial queries only, the other allowing textual queries only -- were compared. Use was made of the two systems by design professionals in simulated work task situations, and quantitative and qualitative data collected as indicators of the levels of users' satisfaction. Results demonstrated that users often had a 'mental image' of a potentially satisfying picture in mind, that they were happy to express this need in visual terms, and that in doing so they preferred to have access to Epic's spatial-querying facility. Success in obtaining statistically significant results appears to support validation of the novel methodological framework adopted.
Extracting Classification Knowledge of Internet Documents with Mining Term Associations: A Semantic Approach BIBAPDF 241-249
  Shian-Hua Lin; Chi-Sheng Shih; Meng Chang Chen; Jan-Ming Ho; Ming-Tat Ko; Yueh-Ming Huang
In this paper, we present a system that extracts and generalizes terms from Internet documents to represent classification knowledge of a given class hierarchy. We propose a measurement to evaluate the importance of a term with respect to a class in the class hierarchy, and denote it as support. With a given threshold, terms with high supports are sifted as keywords of a class, and terms with low supports are filtered out. To further enhance the recall of this approach, Mining Association Rules technique is applied to mine the association between terms. An inference model is composed of these association relations and the previously computed supports of the terms in the class. To increase the recall rate of the keyword selection process, we then present a polynomial-time inference algorithm to promote a term, strongly associated to a known keyword, to a keyword. According to our experiment results on the collected Internet documents from Yam search engine, we show that the proposed methods in the paper contribute to refine the classification knowledge and increase the recall of keyword selection.
Improving Two-Stage Ad-Hoc Retrieval for Short Queries BIBAPDF 250-256
  K. L. Kwok; M. Chan
Short queries in an ad-hoc retrieval environment are difficult but unavoidable. We present several methods to try to improve our current strategy of 2-stage pseudo-relevance feedback retrieval in such a situation. They are: 1) avtf query term weighting, 2) variable high frequency Zipfian threshold, 3) collection enrichment, 4) enhancing term variety in raw queries, and 5) using retrieved document local term statistics. Avtf employs collection statistics to weight terms in short queries. Variable high frequency threshold defines and ignores statistical stopwords based on query length. Collection enrichment adds other collections to the one under investigation so as to improve the chance of ranking more relevant documents in the top n for the pseudo-feedback process. Enhancing term variety to raw queries tries to find highly associated terms in a set of documents that is domain-related to the query. Making the query longer may improve 1st stage retrieval. And retrieved document local statistics re-weight terms in the 2nd stage using the set of domain-related documents rather than the whole collection as used during the initial stage. Experiments were performed using the TREC 5 and 6 environment. It is found that together these methods perform well for the difficult TREC-5 topics, and also works for the TREC-6 very short topics.

Retrieval Models

DOLORES: A System for Logic-Based Retrieval of Multimedia Objects BIBAPDF 257-265
  Norbert Fuhr; Norbert Govert; Thomas Rolleke
We describe the design and implementation of a system for logic-based multimedia retrieval. As high-level logic for retrieval of hypermedia documents, we have developed a probabilistic object-oriented logic (POOL) which supports aggregated objects, different kinds of propositions (terms, classifications and attributes) and even rules as being contained in objects. Based on a probabilistic four-valued logic, POOL uses an implicit open world assumption, allows for closed world assumptions and is able to deal with inconsistent knowledge. POOL programs and queries are translated into probabilistic Datalog programs which can be interpreted by the HySpirit inference engine. For storing the multimedia data, we have developed a new basic IR engine which yields physical data abstraction. The overall architecture and the flexibility of each layer supports logic-based methods for multimedia information retrieval.
RELIEF: Combining Expressiveness and Rapidity into a Single System BIBAPDF 266-274
  Iadh Ounis; Marius Pasca
This paper constitutes a proposal for an efficient and effective logical information retrieval system. Following a relational indexing approach, which is in our opinion a necessity to cope with the emerging applications such as those based on multimedia, we use the conceptual graphs formalism as our indexing language. This choice allows for relational indexing support and captures all the useful properties of the logical information retrieval model, in a workable system. First order logic and standard in formation retrieval techniques are combined together, to the same effect: obtaining an expressive system, able to accurately handle complex documents, improve retrieval effectiveness, and achieve good time performance. Experimentations on an image test collection, within a system available on the Web, provide an illustration of the role that logic may have in the future development of information retrieval systems.
A Language Modeling Approach to Information Retrieval BIBAPDF 275-281
  Jay M. Ponte; W. Bruce Croft
Models of document indexing and document retrieval have been extensively studied. The integration of these two classes of models has been the goal of several researchers but it is a very difficult problem. We argue that much of the reason for this is the lack of an adequate indexing model. This suggests that perhaps a better indexing model would help solve the problem. However, we feel that making unwarranted parametric assumptions will not lead to better retrieval performance. Furthermore, making prior assumptions about the similarity of documents is not warranted either. Instead, we propose an approach to retrieval based on probabilistic language modeling. We estimate models for each document individually. Our approach to modeling is non-parametric and integrates document indexing and document retrieval into a single model. One advantage of our approach is that collection statistics which are used heuristically in many other retrieval models are an integral part of our model. We have implemented our model and tested it empirically. Our approach significantly outperforms standard tf*idf weighting on two different collections and query sets.

Efficiency

Efficient Construction of Large Test Collections BIBAPDF 282-289
  Gordon V. Cormack; Christopher R. Palmer; Charles L. A. Clarke
Test collections with a million or more documents are needed for the evaluation of modern information retrieval systems. Yet their construction requires a great deal of effort. Judgements must be rendered as to whether or not documents are relevant to each of a set of queries. Exhaustive judging, in which every document is examined and a judgement rendered, is infeasible for collections of this size. Current practice is represented by the "pooling method", as used in the TREC conference series, in which only the first k documents from each of a number of sources are judged. We propose two methods, Interactive Searching and Judging and Move-to-Front Pooling, that yield effective test collections while requiring many fewer judgements. Interactive Searching and Judging selects documents to be judged using an interactive search system, and may be used by a small research team to develop an effective test collection using minimal resources. Move-to-Front Pooling directly improves on the standard pooling method by using a variable number of documents from each source depending on its retrieval performance. Move-to-Front Pooling would be an appropriate replacement for the standard pooling method in future collection development efforts involving many independent groups.
Compressed Inverted Files with Reduced Decoding Overheads BIBAPDF 290-297
  Anh Ngoc Vo; Alistair Moffat
Compressed inverted files are the most compact way of indexing large text databases, typically occupying around 10% of the space of the collection they index. The drawback of compression is the need to decompress the index lists during query processing. Here we describe an improved implementation of compressed inverted lists that eliminates almost all redundant decoding and allows extremely fast processing of conjunctive Boolean queries and ranked queries. We also describe a pruning method to reduce the number of candidate documents considered during the evaluation of ranked queries. Experimental results with a database of 510 Mb show that the new mechanism can reduce the CPU and elapsed time for Boolean queries of 4-10 terms to one tenth and one fifth respectively of the standard technique. For ranked queries, the new mechanism reduces both CPU and elapsed time to one third and memory usage to less than one tenth of the standard algorithm, with no degradation in retrieval effectiveness.
Fast Searching on Compressed Text Allowing Errors BIBAPDF 298-306
  Edleno Silva de Moura; Gonzalo Navarro; Nivio Ziviani; Ricardo Baeza-Yates
We present a fast compression and decompression scheme for natural language texts that allows efficient and flexible string matching by searching the compressed text directly. The compression scheme uses a word-based Huffman encoding and the coding alphabet is byte-oriented rather than bit-oriented. We achieve about 30% compression ratio for typical English texts, against 40% and 35% for Compress and Gzip, respectively. Compression times are close to the times of Compress and approximately half the times of Gzip, and decompression times are lower than those of Gzip and one third of those of Compress.
   The searching algorithm allows a large number of variations of the exact and approximate compressed string matching problem, such as phrases, ranges, complements, wild cards and arbitrary regular expressions. Separators and stopwords can be discarded at search time without significantly increasing the cost. The algorithm is based on a word-oriented shift-or algorithm and a fast Boyer-Moore-type filter. It concomitantly uses the vocabulary of the text available as part of the Huffman coding data. When searching for simple patterns, our experiments show that running our algorithm on a compressed text is twice as fast as running agrep on the uncompressed version of the same text. When searching complex or approximate patterns, our algorithm is up to 8 times faster than agrep. We also mention the impact of our technique in inverted files pointing to documents or logical blocks as Glimpse.

Information Retrieval Experiments

How Reliable are the Results of Large-Scale Information Retrieval Experiments? BIBAPDF 307-314
  Justin Zobel
Two stages in measurement of techniques for information retrieval are gathering of documents for relevance assessment and use of the assessments to numerically evaluate effectiveness. We consider both of these stages in the context of the TREC experiments, to determine whether they lead to measurements that are trustworthy and fair. Our detailed empirical investigation of the TREC results shows that the measured relative performance of systems appears to be reliable, but that recall is overestimated: it is likely that many relevant documents have not been found. We propose a new pooling strategy that can significantly increase the number of relevant documents found for given effort, without compromising fairness.
Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness BIBAPDF 315-323
  Ellen M. Voorhees
Test collections have traditionally been used by information retrieval researchers to improve their retrieval strategies. To be viable as a laboratory tool, a collection must reliably rank different retrieval variants according to their true effectiveness. In particular, the relative effectiveness of two retrieval strategies should be insensitive to modest changes in the relevant document set since individual relevance assessments are known to vary widely.
   The test collections developed in the TREC workshops have become the collections of choice in the retrieval research community. To verify their reliability, NIST investigated the effect changes in the relevance assessments have on the evaluation of retrieval results. Very high correlations were found among the rankings of systems produced using different relevance judgment sets. The high correlations indicate that the comparative evaluation of retrieval performance is stable despite substantial differences in relevance judgments, and thus reaffirm the use of the TREC collections as laboratory tools.
Measures of Relative Relevance and Ranked Half-Life: Performance Indicators for Interactive IR BIBAPDF 324-331
  Pia Borlund; Peter Ingwersen
This paper introduces the concepts of the relative relevance (RR) measure and a new performance indicator of the positional strength of the retrieved and ranked documents. The former is seen as a measure of associative performance computed by the application of the Jaccard formula. The latter is named the Ranked Half-Life (RHL) indicator and denotes the degree to which relevant documents are located on the top of a ranked retrieval result. The measures are proposed to be applied in addition to the traditional performance parameters such as precision and/or recall in connection with evaluation of interactive IR systems. The RR measure describes the degree of agreement between the types of relevance applied in evaluation of information retrieval (IR) systems in a non-binary assessment context. It is shown that the measure has potential to bridge the gap between subjective and objective relevance, as it makes it possible to understand and interpret the relation between these two main classes of relevance used in interactive IR experiments. The relevance concepts are defined, and the application of the measures is demonstrated by interrelating three types of relevance assessments: algorithmic; intellectual topicality and; situational assessments. Further, the paper shows that for a given set of queries at given precision levels the RHL indicator adds to the understanding of comparisons of IR performance.

Panel

Tools for Searching the Web BIBAPDF 332
  Donna Harman; Paul Over
Each day significant numbers of people worldwide have their first experience in interactive information retrieval using the technologies of the World Wide Web on the Internet and many others return to try again. Surfing the web can be great entertainment, and even often a source of useful information. Certainly the ability to connect to known sites, such as company home pages, weather reports, etc. can be extremely helpful. But attempting to actually search the web for a specific piece of information can often be very frustrating.
   Some of the problems encountered in searching the web are the usual bane of search engines dependent on matching document and query terms. Others are usability problems common to many interactive applications. But still others are unique to searching on the Internet or are magnified by its scope and heterogeneity. Such problems include dependence on cross-cultural knowledge (language, naming conventions, social structure/institutions), heavy use of data not directly accessible to conventional webcrawlers (multimedia, maps, tables), and simply the enormous scale of the operational net (huge result sets, "worldwide wait").
   The panel will attempt to provide insight into some of these web search problems and examine what types of additional tools might be useful in a web environment to aid users in searching. It will do this by examining the results of a "web treasure hunt".

Posters

Modern Classical Document Indexing: A Linguistic Contribution to Knowledge-Based IR BIBAPDF 333-334
  Bas van Bakel
This poster presents Condorcet, a domain-specific prototype indexing system for tens of thousands of documents covering two scientific domains: engineering ceramics and epilepsy. The development corpus consists of 300 documents taken from machine-readable one year volumes of two scientific journals: the 1988 volume of Excerpta Medica from Elsevier Science Publishers, and the 1990 volume of Engineered Materials Abstracts from Materials Information. The Condorcet research project is funded by the Dutch Technology Foundation.
   Condorcet takes a controlled-term approach to indexing: a document is indexed by mapping its title and abstract to concepts and relations from a given domain, defined in modern versions of classical thesauri, i.e. structured ontologies. Concepts rather than natural language terms are used as indexes. The claim is that using structured concepts as index terms will lead to a higher precision, as they are language-independent and non-ambiguous. For example, simple concepts like aspirin and headache point to all documents in which these two concepts are mentioned. However, by using structured concepts we can distinguish documents discussing aspirin as a cause of headache (causes[aspirin,headache]) from documents on aspirin as a cure for headache (cures[aspirin,headache]). Searching for the former group will exclude the latter, which is considerably more difficult -- if not impossible -- when documents are indexed with simple, unstructured concepts.
   The index process makes intensive use of linguistic knowledge, i.e. Chomsky's Government & Binding theory. The poster will focus on the linguistic principles that form the conceptual basis of the index process.
The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries BIBAPDF 335-336
  Jaime Carbonell; Jade Goldstein
This paper presents a method for combining query-relevance with information-novelty in the context of text retrieval and summarization. The Maximal Marginal Relevance (MMR) criterion strives to reduce redundancy while maintaining query relevance in re-ranking retrieved documents and in selecting appropriate passages for text summarization. Preliminary results indicate some benefits for MMR diversity ranking in document retrieval and in single document summarization. The latter are borne out by the recent results of the SUMMAC conference in the evaluation of summarization systems. However, the clearest advantage is demonstrated in constructing non-redundant multi-document summaries, where MMR results are clearly superior to non-MMR passage selection.
A Method for Scoring Correlated Features in Query Expansion BIBAPDF 337-338
  Martin Franz; Salim Roukos
In this poster we describe experiments in information retrieval using a new method for scoring correlated features. This method uses information about word co-occurrences in the documents ranking high after the initial scoring to reduce combined scores of correlated words. We have experimented with this technique in conjunction with both simple Okapi scoring and a query expansion method using a probabilistic model, improving system performance in the context of TREC standardized tasks.
Using Maps as a User Interface to a Digital Library BIBAPDF 339-340
  Mountaz Hascoet; Xavier Soinard
The Nemo project aims to develop a graphical interface contributing to fast navigation and exploration in the Digital Library of EDF. The digital library of EDF contains collections of documents in SGML-HyTime format, containing information on people, activities, projects, scientific work, etc. These documents are currently accessed through traditional boolean query interface and the aim of the Nemo project is to provide customizable graphical maps of documents resulting from a query. The originality of the maps lies in the fact that (1) they are computed dynamically from the characteristics of the data retrieved with a query, (2) the mapping of graphical attributes with data characteristics is highly and easily customizable and (3) map layout account for a perspective (or classification system) that is also customizable.
Comparison between Proximity Operation and Dependency Operation in Japanese Full-Text Retrieval BIBAPDF 341-342
  Yasuaki Hyoudo; Kazuhiko Niimi; Takashi Ikeda
In this paper we propose a full-text retrieval for Japanese document using dependency relation between words and evaluate it comparing to proximity operation. The proximity relation has been used as an approximation to syntactic or semantic relation because syntactic or semantic analysis with high accuracy has still been a high-cost task for a computer. We have developed the method of skeletal syntactic analysis for Japanese and apply it to the full-text retrieval. The figures of the index size, response time and accuracy of retrieval of our experiment show the effectiveness of the use of dependency relation.
Term-Ordered Query Evaluation versus Document-Ordered Query Evaluation for Large Document Databases BIBAPDF 343-344
  Marcin Kaszkiel; Justin Zobel
There are two main families of technique for efficient processing of ranked queries on large text collections: document-ordered processing and term-ordered processing. In this note we compare these techniques experimentally. We show that they have similar costs for short queries, but that for long queries document-ordered processing is much more costly. Overall, we conclude that term-ordered processing, with the refinements of limited accumulators and hierarchical index structuring, is the more efficient mechanism.
Lessons from BMIR-J2: A Test Collection for Japanese IR Systems BIBAPDF 345-346
  Tsuyoshi Kitani; Yasushi Ogawa; Tetsuya Ishikawa; Haruo Kimoto; Ikuo Keshi; Jun Toyoura; Toshikazu Fukushima; Kunio Matsui; Yoshihiro Ueda; Tetsuya Sakai; Takenobu Tokunaga; Hiroshi Tsuruoka; Hidekazu Nakawatase; Teru Agata
BMIR-J2 is the first complete Japanese test collection available for use in evaluating information retrieval systems. It contains sixty queries and the IDs of 5080 newspaper articles in the fields of economics and engineering. The queries are classified into five categories, based on the functions the system is likely to use to interpret them correctly and retrieve relevant texts. This collection has two levels of relevance, topically relevant and partially relevant. Also discussed are design issues such as collection types and size. This collection and the principles derived in designing it should be helpful in the future development of new test collections.
Automatically Locating, Extracting and Analyzing Tabular Data BIBAPDF 347-348
  William Kornfeld; John Wattecamps
Information retrieval of ASCII documents generally refers to retrieval based on linear patterns (1-dimensional array of symbols) found in the source documents. This paper describes a 2-dimensional IR application, that of recognizing and extracting tabular data, in this case financial tables. Tables are located and extracted using a version of the LR(k) parsing algorithm adapted for this purpose. Because of sloppiness in the construction of tables, somewhat less than 100% of the tables can be retrieved completely automatically; a method has been found to integrate the parsing algorithm into a user interface analogous to a program language debugger that allows operators to quickly correct defects in the source document allowing the parsing to complete successfully. This paper describes a deployed application currently in commercial use.
Using Global Colour Features for General Photographic Image Indexing and Retrieval BIBAPDF 349-350
  Ting-Sheng Lai; John Tait
At the present time there is a growing demand for the development of content-based image retrieval. This paper attempts to address the problems of indexing and retrieval of general photographic images. An indexing and retrieval scheme is introduced which models human colour perception. An initial evaluation of an implementation of the scheme reveals reasonable classification effectiveness, and a refinement method is proposed to overcome the problem of intrinsic colour appearance encountered with the implementation.
Automatic Acquisition of Phrasal Knowledge for English-Chinese Bilingual Information Retrieval BIBAPDF 351-352
  Ming-Jer Lee; Lee-Feng Chien
Extraction of phrasal knowledge, such as proper names, domain-specific keyphrases and lexical templates from a domain-specific text collection are significant for developing effective information retrieval systems for the Internet. In this paper, we are going to introduce our ongoing research on automatic phrasal knowledge acquisition for English-Chinese bilingual texts. The underlying techniques consist of adaptive keyphrase extraction, lexical template extraction, phrase translation extraction and high-order Markov language model construction. In addition to the increase of retrieval effectiveness, IR systems based on these techniques are expected able to perform much better in many aspects, such as automatic term suggestion, information filtering, text classification and cross-language information retrieval, etc.
Visual Interactions with a Multidimensional Ranked List BIBAPDF 353-354
  Anton Leouski; James Allan
Performance analysis of an interactive visualization system generally requires an extensive user study, a method that is very expensive and that often yields inconclusive results. To do a successful user study, the researcher has to be well aware of the system's possibilities. In this paper we present a different kind of analysis. We show how the system behavior and performance could be investigated off-line, without direct user intervention. We introduce an evaluation measure to assess the quality of a multidimensional visualization. Next, we suggest two methods for dealing with user's feedback. We also discuss the effect the dimensionality of the visualization has on the actual performance.
Predicting Query Times BIBAPDF 355-356
  Rodger McNab; Yong Wang; Ian H. Witten; Carl Gutwin
We outline the need for search engines to provide user feedback on the expected time for a query, describe a scheme for learning a model of query time by observing sample queries, and discuss the results obtained for a set of actual user queries on a document collection using the MG search engine.
The WebCluster Project: Using Clustering for Mediating Access to the World Wide Web BIBAPDF 357-358
  Mourad Mechkour; David J. Harper; Gheorghe Muresan
We present in this poster the WebCluster project, in which we propose and implement an innovative approach to improve the effectiveness of information retrieval on the World Wide Web. Our approach is based on combining mediated access to the WWW with document clustering. In this approach we use a source document collection, which is well structured and specific to the user interest domain, as a filter on the WWW. This filter will limit the user scope of view to the documents considered as relevant regarding this document collection. The basic techniques used in this project are document clustering, which allows us to extract the inherent structure of the source collection, and different search strategies (cluster based, best match retrieval, browsing). The software developed is a two part tool: a clustering framework (CF), that allow users to choose the best clustering method for their document collection, and an end user interface for accessing the services of this CF (clustering, and different retrieval strategies). WebCluster's final goal is to allow a user or group of users to have their own organization of the WWW, combine different retrieval strategies (for example cluster based search or browsing) using the same interface.
Automatic Abstracting of Magazine Articles: The Creation of 'Highlight' Abstracts BIBAPDF 359-360
  Marie-Francine Moens; Jos Dumortier
Our research contributes to the design and development of text grammars in a text-abstracting task. We developed a system that parses texts based upon the text grammar of a specific text type and that extracts sentences and statements that are relevant to include in the abstracts. The system is applied to create "highlight" abstracts of magazine articles. It employs the knowledge of discourse patterns that are typical for news stories. Browsing a database of article abstracts is one way to select and buy relevant magazine articles on-line.
Optimizing Recall/Precision Scores in IR over the WWW BIBAPDF 361-362
  Matthew Montebello
The rapid growth of the World Wide Web (WWW) and the massive size of the information corpus available for access symbolizes the wealth and benefits of this medium. At the same time this immense pool of information has created an information overflow which requires users to revert to techniques and tools in order to take advantage of such a resource and enhance the effectiveness of online information access. Search engines were created to assist users to find information by employing indexing techniques and suggest appropriate alternatives to browse. These search engines have inefficiencies and are not focused enough to the needs of individual users and little has been done to ensure that the information presented is of a high recall and precision standard. 'Recall' measures how efficient the system is at retrieving the relevant documents from the WWW, while 'precision' measures the relevance of the retrieved set of documents to the users' requirements.
   We present our experiences with a system we developed to optimize the recall/precision scores. We attempt to achieve this objective by employing a number of search engines and user profiling in tandem.
   Namely, we attempt to optimize:
  • recall by aggregating hits from major search engines and other previously
       developed retrieval systems,
  • precision by generating user profiles and predicting appropriate and focused
       information to specific users. Our system is able to easily and inexpensively accommodate future generations of web-based retrieval systems and technologies. Our contribution to the IR field is that we were able to incorporate several desirable characteristics from different techniques to optimize and personalize WWW searching.
  • Interactive Multidimensional Document Visualization BIBAPDF 363-364
      Josiane Mothe; Taoufiq Dkaki
    Complementary approaches can be used to improve information retrieval efficiency. One of these approaches consists in focusing on the system inner processes used (e.g. document indexing, query-document similarity computing,...). An other approach consists in improving the user-system interaction. That requires powerful interface design. In this poster we propose an interactive multidimensional document visualization. The selected documents (selected by either a search or a filtering process) are studied according to several of their features. Those features are either extracted using an indexing type process (the result is a set of phrases) or using information extraction mechanisms (to extract for example the author names, publication dates,...). Those elements are crossed in order to discover the strong links that exist between them. Discovered relationships are then displayed on a 4-dimensional graphical view. This interface allows the user to graphically select the elements he is most interested in and to automatically construct new filters that will change interactively the set of the displayed documents.
    Speech Retrieval using Phonemes with Error Correction BIBAPDF 365-366
      Corinna Ng; Justin Zobel
    In speech retrieval, text queries are matched to spoken-word recordings. We undertook a series of retrieval experiments on such speech documents, using both phonemes and words for matching text to speech. We found that, given a reliable automatic word-based process for speech recognition, phoneme-based retrieval can be as effective as word-based retrieval. However, retrieval is worse with phoneme-based speech recognition. Error correction techniques, such as manual error correction of phoneme sequences and use of a standard string edit distance, did not improve the effectiveness of phoneme-based retrieval.
    Optimizing Query Evaluation in n-Gram Indexing BIBAPDF 367-368
      Yasushi Ogawa; Toru Matsuda
    In n-gram indexing widely used in Asian languages, because a long query word is divided into more than one n-gram, proximity constraints among n-grams need to be checked. And this checking degrades the retrieval speed. This paper proposes two optimizations of query evaluation: one selects a combination of n-grams whose processing cost is the minimum among all the possible combinations, and the other uses additional n-grams in finding potential documents whose proximity constraints need to be checked.
    Four Text Classification Algorithms Compared on a Dutch Corpus BIBAPDF 369-370
      Hein Ragas; Cornelis H. A. Koster
    In the context of the document routing project DORO we performed an experiment in applying text classification algorithms to Dutch texts. Four well-known learning algorithms, Rocchio's algorithm, the Simple Bayesian Classifier (SBC), the Sleeping Experts (SE) and Winnow were implemented. They were tested on a corpus of articles from the Dutch newspaper NRC, pre-classified into four categories. Using keywords as terms, the algorithms were compared on learning speed and error rate. We also investigated the effect of discarding terms, using either a dynamic stoplist or the Winnow heuristic.
       The Winnow heuristic with 2 steps greatly improved the performance of BC. Under the circumstances of our experiment, Rocchio and SBC performed better than Winnow, and a lot better than the Sleeping Experts. Rocchio learns faster than SBC, but it rapidly reaches a performance plateau.
       A combination of Roccio and SBC should give the best overall performance.
    Automatic Acquisition of Terminological Relations from a Corpus for Query Expansion BIBAPDF 371-372
      Jean-David Sta
    One of the means used for query expansion consists in adding related terms from a thesaurus to the terms of the query. How efficient this technique can be depends upon the presence of terminological relations in the thesaurus. The experiment that is described here consists in assessing the capacity of various statistic measures of associations between terms, to highlight terminological relations. From a corpus of over 6,000 documents, the associations between more than 5,000 terms two by two were compared to the 70,000 terminological relations of a thesaurus. The measure based on context likeness is the most effective.
    Keyword Extraction of Radio News using Term Weighting with an Encyclopedia and Newspaper Articles BIBAPDF 373-374
      Yoshimi Suzuki; Fumiyo Fukumoto; Yoshihiro Sekiguchi
    In this paper, we propose a method for keyword extraction of radio news. Using our method, data sparseness problem and false alarm problem was lightened even for short discourse or document. Also, our method is robust for partial errors of phoneme recognition. In our method, there are two procedures: i.e. term weighting and keyword extraction.
       In procedure of term weighting, a feature vector of each domain is calculated using an encyclopedia and newspaper articles. In procedure of keyword extraction, keywords are extracted using feature vectors and result of domain identification. The results of experiments demonstrate the applicability of the method.
    Efficient Search Server Assignment in a Disproportionate System Environment BIBAPDF 375-376
      Toru Takaki; Tsuyoshi Kitani
    An efficient method is proposed for assigning jobs to search servers in cluster-type system environments when search server performance levels differ. The assignment is done by estimating the search time for each query, based on factors such as the number of databases to be retrieved by the query and the performance of the servers. Search processing efficiency is improved by assigning time-consuming jobs to high-performance servers, regardless of the order of request arrival. Experimental results showed a 10 to 50% shortening of the overall response time.
    Multilingual Keyword Extraction for Term Suggestion BIBAPDF 377-378
      Yuen-Hsien Tseng
    An efficient keyword extraction algorithm applicable to documents in any languages is presented. Because documents concentrating on a topic tend to mention a set of words in a specific sequence repeatedly, the approach assumes that keywords are repeated string patterns. The proposed algorithm has some distinct features: it requires no extra resources such as lexicons, corpora, or NLP parsers; the time and space complexity are linear in average case; the threshold, the only parameter in this algorithm, is easily tuned; keywords of any length can be identified; when used in character level, single words or word stems can be identified as well as multiple-word phrases. The extracted keywords are suitable for perusal and selection in term suggestion application. Experimental results for some English and Chinese documents are shown. Applications to bibliographic data of over 320000 records and full text of 13000 news abstracts for term suggestion are presented.
    Experiments of Collecting WWW Information using Distributed WWW Robots BIBAPDF 379-380
      Hayato Yamana; Kent Tamura; Hiroyuki Kawano; Satoshi Kamei; Masanori Harada; Hideki Nishimura; Isao Asai; Hiroyuki Kusumoto; Yoichi Shinoda; Yoichi Muraoka
    The world-wide web, in short the web, is a large distributed digital information space. It is the most popular internet service and is now indispensable for us. Since the web itself has no protocols for searching the web documents, we need to collect the documents on the web servers to make a database to search.
       In this paper, we propose distributed WWW robots to collect the web documents quickly. Our final goal is to collect all of the documents on the web in Japan within one day. Currently, eight distributed WWW robots, whose system code is mostly written in Java with some C, are running in Japan. We have already found 13,320 domains that have jp domain.
       The experimental results show that we are able to gain 5.8 to 9.7 times speedup when four distributed WWW robots are placed at different places in comparison with when only one WWW robot is used. We also expect that we are able to gain about (2.8 x n) times speedup at most when we use n WWW robots to collect the web documents.

    Demonstrations

    Presenting Web Site Search Results in Context: A Demonstration BIBAPDF 381
      Michael Chen; Marti A. Hearst
    We address search over large, heterogeneous web sites such as those found at universities and within corporate intranets. The goal is to make use of structure implicit within the site to provide context for the retrieved documents, even for those sites for which there is no centralized organization. Most web search engines simply list titles, urls, and abstracts, and thus do not place the results in context.
       We will demonstrate our alternative: a simple but novel approach to organizing and presenting the results of search over the pages of a large, heterogeneous web site. The main idea is to show, for each page matching the query, the path of web links that a user would follow from a root page to the search hit. The resulting interface, which we call Cha-Cha, is a hierarchical characterization of the search results that both shows the context in which the hits appear and educates the user about the structure of the web site.
    Personal Browser BIBAPDF 382
      Yi-Shiou Chen; Schy Chiou; Yuan-Kai Wang; Wen-Lian Hsu
    Browsers are now among the most popular software for people to surf web pages and retrieve information from Internet for their own interest. However, people usually find the information they search for are scattered in several web pages, and cannot specify which parts of contents are interesting and how these parts are presented. In other words, today's browsers are weak in personalization. To enhance the capability of personalization, browsers must have more power in managing contents of web pages. In this demo, we apply natural language processing (NLP) techniques to create a browser that can be personalized.
    Towards a Fast Precision-Oriented Image Retrieval System BIBAPDF 383
      Yves Chiaramella; Philippe Mulhem; Mechkour Mourad; Iadh Ounis; Marius Pasca
    The RELIEF image retrieval system is proposed as an integration of object-oriented modeling and Web technology, in an implementation of the logical information retrieval model on the O2 database management system. Following a relational indexing approach, it uses the conceptual graph formalism as the indexing language, and logic and standard information retrieval techniques to speed-up the retrieval process. The latter is performed by a polynomial matching function. Moreover, it includes relation-based inference which refines and extends the original matching function used in conceptual graphs, and improves the system performance in terms of recall/precision. Therefore, RELIEF is a first answer to the challenge of using expressive relational formalisms for accurate indexing in a precision-oriented system, while limiting retrieval time within reasonable values.
    Teraphim: An Engine for Distributed Information Retrieval BIBAPDF 384
      Owen de Kretser; Alistair Moffat; Justin Zobel
    The MG software system has been used for several years as a testbed for research into the implementation of large-scale information retrieval systems. Initially designed as a platform for experiments with document compression techniques, index compression mechanisms, and index construction algorithms, MG has also been used as a tool in the study of query evaluation strategies and similarity heuristics. The MG system has also been used since the inception of TREC as the basis of much of the information retrieval experimentation done in Melbourne at RMIT and the University of Melbourne, as well as several other research centres. In this demonstration we will display the latest version of MG, called Teraphim, which supports distributed information retrieval across local and wide area networks.
    Cheshire II: Combining Probabilistic and Boolean Retrieval BIBAPDF 385
      Ray R. Larson
    The Cheshire II system was originally designed to apply probabilistic retrieval methods to online library catalog searches in order to help overcome the pervasive twin problems of topical searching of Boolean online catalogs: search failure and information overload [larson96b]. It was intended to be a next-generation online catalog and full-text information retrieval system that would apply probabilistic retrieval methods to simple MARC records and clustered record surrogates. Over time the system has been expanded to include support for full-text SGML documents (ranging from simple document types as used in the TREC database [larson97a] to complex full-text document encoded using the TEI and EAD DTDs) and support for full-text OCR from scanned page image files linked to SGML bibliographic records. The system is the primary text search engine for the UC Berkeley Environmental Digital Library project sponsored by NSF, NASA, and ARPA. It also provides access to a number of diverse databases databases via the WWW using an HTTP to Z39.50 gateway. It has also been adopted for use as a search engine in working library environments including the Physical Sciences Libraries at UC Berkeley, The Data Archives at the University of Essex, and the special collection department of the University of Liverpool Library.
    A Research Prototype Image Retrieval System BIBAPDF 386
      S. Nepal; M. V. Ramakrishna; J. A. Thom
    Many content based image retrieval systems have been developed in recent years. Some of these systems capture feature information (such as color and texture) and leave semantics to the users. While posing queries the users can express semantics in terms of feature information. In other systems the semantic information is captured in the system's design; these are suitable and designed for domain specific applications such as a collection of images of human faces. Neither of these approaches allow modeling of semantic information for a collection of unconstrained images. In addition, these systems do not support semantic queries. Capturing and storing of semantic information, such as sunset and mountain, is trivial for humans but is a challenge for content based image retrieval systems. To address this problem, we designed a fuzzy system for content based image retrieval.
    The Structured Information Manager (SIM) BIBAPDF 387
      Ron Sacks-Davis; Alan Kent
    SIM, the Structured Information Manager, is an information retrieval system which is designed to manage multi-gigabyte collections of documents containing text, images, and other forms of data, storing them natively in SGML, MARC, RTF, and ASCII formats. It is a fully-fledged system that provides integrated support for efficient ranked full text, boolean, and structural querying via a robust database server capable of practically managing multi-gigabyte document collections.
       SIM can and has been used for a variety of applications, including bibliographic and full-text retrieval, legal statute management, and intra-net document delivery. SIM represents the concrete implementation of numerous advanced information retrieval research issues, including text indexing and querying techniques, storage and querying of structured documents, automatic document markup, and the management and versioning of legal documents.
    PWA: An Extended Probabilistic Web Algebra BIBAPDF 388
      Dan Smith; Rattasit Sukhahuta
    PWA is an extended relational algebra that provides functionality to fetch and query information directly from intranet/internet data sources. In PWA, information extraction is combined with an extended relational algebra to deal with imprecise queries over semistructured information sources.
       The information we access is either not organised as structured records, or is likely to be structured in an inconvenient or unhelpful form. Our information extraction approach is based on a document abstract model in which the document is hierarchically partitioned into regions that contain concepts of interest. Each region corresponds to a limited structural and semantic domain that allows us to simplify the concept identification process. This allows us to adopt an effective structured approach and to avoid the use of complex natural language understanding techniques during the concept recognition process. In PWA, concepts are viewed as complex values in a relational framework. Each concept has an associated set of values and weights, forming a probabilistic relation, which allows a PWA user to optionally specify the maximum and minimum probability range for any concept.
       The implementation of PWA demonstrates how information extraction techniques, combined with an extended relational algebra, allows users to select and manipulate relevant subsets of information for a wide variety of semistructured data sources.
    Cafe: An Indexed Approach to Searching Genomic Databases BIBAPDF 389
      Hugh E. Williams
    Genomic databases assist molecular biologists in understanding the biochemical function, chemical structure, and evolutionary history of organisms. Popular systems for searching genomic databases match queries to answers by comparing a query to each of the sequences in the database. Efficiency in such exhaustive systems is crucial, since some servers process over 40,000 queries per day, and resolution of each query can require comparison to over one gigabyte of genomic sequence data. While exhaustive systems are practical at present, they may become prohibitively expensive as genomic databases continue to double in size yearly, and as user numbers and query rates grow.
       We demonstrate our successful, novel indexing and retrieval techniques for querying genomic databases, which are embodied in a full-scale prototype retrieval system, Cafe. The principal features of Cafe are the incorporation of novel and efficient data structures for query resolution and the demonstration that, despite earlier negative results, indexing can be successfully applied to genomic databases. We demonstrate the Cafe search capabilities and show that it has the requisite properties of scalability and efficiency in space and speed, as well as comparable accuracy to existing search systems.
    Fast Speculative Search Engine on the Highly Parallel Computer EM-X BIBAPDF 390
      Hayato Yamana; Hanpei Koike; Yuetsu Kodama; Hirofumi Sakane; Yoshinori Yamaguchi
    This demonstration presents the new World Wide Web search engine called "Fast Speculative Search Engine" that uses speculative execution on multiprocessor systems to shorten the total time to retrieve information from the WWW.
       The proposed search engine predicts users' next queries and initiates the searches with the predicted queries before receiving them to accelerate narrowing the search space.
       We have implemented the fast speculative search engine using the data speculation on the EM-X which consists of 80 processors. On the EM-X, idling processors are used to predict the next queries and no predictions are made when all processors are busy. Thus, we can provide minimum search service at busy time when there are many search requests, and provide maximum search service at free time when there are small number of search requests. We call such controlling scheme as Unlimited Speculative Execution.
       The experimental results, using the data set of WWW pages in our organization, show that the 42% of users' queries hit on the speculative searched results.