HCI Bibliography Home | HCI Conferences | IR Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
IR Tables of Contents: 919293949596979899000102030405060708091011

Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

Fullname:Proceedings of the 24th International ACM SIGIR Conference on Research and Development in Information Retrieval
Location:New Orleans, Louisiana, USA
Dates:2001-Sep-09 to 2001-Sep-12
Standard No:ISBN 0-89791-331-6; ACM Order Number 606010; ACM DL: Table of Contents hcibib: IR01
Applying summarization techniques for term selection in relevance feedback BIBAFull-Text 1-9
  Adenike M. Lam-Adesina; Gareth J. F. Jones
Query-expansion is an effective Relevance Feedback technique for improving performance in Information Retrieval. In general query-expansion methods select terms from the complete contents of relevant documents. One problem with this approach is that expansion terms unrelated to document relevance can be introduced into the modified query due to their presence in the relevant documents and distribution in the document collection. Motivated by the hypothesis that query-expansion terms should only be sought from the most relevant areas of a document, this investigation explores the use of document summaries in query-expansion. The investigation explores the use of both context-independent standard summaries and query-biased summaries. Experimental results using the Okapi BM25 probabilistic retrieval model with the TREC-8 ad hoc retrieval task show that query-expansion using document summaries can be considerably more effective than using full-document expansion. The paper also presents a novel approach to term-selection that separates the choice of relevant documents from the selection of a pool of potential expansion terms. Again, this technique is shown to be more effective that standard methods.
Temporal summaries of new topics BIBAFull-Text 10-18
  James Allan; Rahul Gupta; Vikas Khandelwal
We discuss technology to help a person monitor changes in news coverage over time. We define temporal summaries of news stories as extracting a single sentence from each event within a news topic, where the stories are presented one at a time and sentences from a story must be ranked before the next story can be considered. We explain a method for evaluation, and describe an evaluation corpus that we have built. We also propose several methods for constructing temporal summaries and evaluate their effectiveness in comparison to degenerate cases. We show that simple approaches are effective, but that the problem is far from solved.
Generic text summarization using relevance measure and latent semantic analysis BIBAFull-Text 19-25
  Yihong Gong; Xin Liu
In this paper, we propose two generic text summarization methods that create text summaries by ranking and extracting sentences from the original documents. The first method uses standard IR methods to rank sentence relevances, while the second method uses the latent semantic analysis technique to identify semantically important sentences, for summary creations. Both methods strive to select sentences that are highly ranked and different from each other. This is an attempt to create a summary with a wider coverage of the document's main content and less redundancy. Performance evaluations on the two summarization methods are conducted by comparing their summarization outputs with the manual summaries generated by three independent human evaluators. The evaluations also study the influence of different VSM weighting schemes on the text summarization performances. Finally, the causes of the large disparities in the evaluators' manual summarization results are investigated, and discussions on human text summarization patterns are presented.
A new approach to unsupervised text summarization BIBAFull-Text 26-34
  Tadashi Nomoto; Yuji Matsumoto
The paper presents a novel approach to unsupervised text summarization. The novelty lies in exploiting the diversity of concepts in text for summarization, which has not received much attention in the summarization literature. A diversity-based approach here is a principled generalization of Maximal Marginal Relevance criterion by Carbonell and Goldstein.
   We propose, in addition, an information-centric approach to evaluation, where the quality of summaries is judged not in terms of how well they match human-created summaries but in terms of how well they represent their source documents in IR tasks such document retrieval and text categorization.
   To find the effectiveness of our approach under the proposed evaluation scheme, we set out to examine how a system with the diversity functionality performs against one without, using the BMIR-J2 corpus, a test data developed by a Japanese research consortium. The results demonstrate a clear superiority of a diversity based approach to a non-diversity based approach.
Vector-space ranking with effective early termination BIBAFull-Text 35-42
  Vo Ngoc Anh; Owen de Kretser; Alistair Moffat
Considerable research effort has been invested in improving the effectiveness of information retrieval systems. Techniques such as relevance feedback, thesaural expansion, and pivoting all provide better quality responses to queries when tested in standard evaluation frameworks. But such enhancements can add to the cost of evaluating queries. In this paper we consider the pragmatic issue of how to improve the cost-effectiveness of searching. We describe a new inverted file structure using quantized weights that provides superior retrieval effectiveness compared to conventional inverted file structures when early termination heuristics are employed. That is, we are able to reach similar effectiveness levels with less computational cost, and so provide a better cost/performance compromise than previous inverted file organisations.
Static index pruning for information retrieval systems BIBAFull-Text 43-50
  David Carmel; Doron Cohen; Ronald Fagin; Eitan Farchi; Michael Herscovici; Yoelle S. Maarek; Aya Soffer
We introduce static index pruning methods that significantly reduce the index size in information retrieval systems.
   We investigate uniform and term-based methods that each remove selected entries from the index and yet have only a minor effect on retrieval results. In uniform pruning, there is a fixed cutoff threshold, and all index entries whose contribution to relevance scores is bounded above by a given threshold are removed from the index. In term-based pruning, the cutoff threshold is determined for each term, and thus may vary from term to term. We give experimental evidence that for each level of compression, term-based pruning outperforms uniform pruning, under various measures of precision. We present theoretical and experimental evidence that under our term-based pruning scheme, it is possible to prune the index greatly and still get retrieval results that are almost as good as those based on the full index.
Rank-preserving two-level caching for scalable search engines BIBFull-Text 51-58
  Paricia Correia Saraiva; Edleno Silva de Moura; Nivio Ziviani; Wagner Meira; Rodrigo Fonseca; Berthier Ribeiro-Neto
Using event segmentation to improve indexing of consumer photographs BIBAFull-Text 59-65
  Amanda Stent; Alexander Loui
Automatic albuming -- the automatic organization of photographs, either as an end in itself or for use in other applications -- is an application that promises to be of great assistance to photographers. Relatively sophisticated image content analysis techniques have been used for image indexing, organization and retrieval. In this paper, we describe a method of organizing photographs into events using spoken photograph captions. The results of this process can be used to improve image indexing and retrieval.
Ranking retrieval systems without relevance judgments BIBAFull-Text 66-73
  Ian Soboroff; Charles Nicholas; Patrick Cahan
The most prevalent experimental methodology for comparing the effectiveness of information retrieval systems requires a test collection, composed of a set of documents, a set of query topics, and a set of relevance judgments indicating which documents are relevant to which topics. It is well known that relevance judgments are not infallible, but recent retrospective investigation into results from the Text REtrieval Conference (TREC) has shown that differences in human judgments of relevance do not affect the relative measured performance of retrieval systems. Based on this result, we propose and describe the initial results of a new evaluation methodology which replaces human relevance judgments with a randomly selected mapping of documents to topics which we refer to as pseudo-relevance judgments.
   Rankings of systems with our methodology correlate positively with official TREC rankings, although the performance of the top systems is not predicted well. The correlations are stable over a variety of pool depths and sampling techniques. With improvements, such a methodology could be useful in evaluating systems such as World-Wide Web search engines, where the set of documents changes too often to make traditional collection construction techniques practical.
Evaluation by highly relevant documents BIBAFull-Text 74-82
  Ellen M. Voorhees
Given the size of the web, the search engine industry has argued that engines should be evaluated by their ability to retrieve highly relevant pages rather than all possible relevant pages. To explore the role highly relevant documents play in retrieval system evaluation, assessors for the TREC-9 web track used a three-point relevance scale and also selected best pages for each topic. The relative effectiveness of runs evaluated by different relevant document sets differed, confirming the hypothesis that different retrieval techniques work better for retrieving highly relevant documents. Yet evaluating by highly relevant documents can be unstable since there are relatively few highly relevant documents. TREC assessors frequently disagreed in their selection of the best page, and subsequent evaluation by best page across different assessors varied widely. The discounted cumulative gain measure introduced by Jarvelin and Kekalainen increases evaluation stability by incorporating all relevance judgments while still giving precedence to highly relevant documents.
Meta-scoring: automatically evaluating term weighting schemes in IR without precision-recall BIBAFull-Text 83-89
  Rong Jin; Christos Falusos; Alex G. Hauptmann
In this paper, we present a method that can automatically evaluate performance of different term weighting schemes in information retrieval without resorting to precision-recall based on human relevance judgments. Specifically, the problem is: given two document-term matrixes generated from two different term weighting schemes, can we tell which term weighting scheme will performance better than the other? We propose a meta-scoring function, which takes as input the document-term matrix generated by some term weighting scheme and computes a goodness score from the document-term matrix. In our experiments, we found out that this score is highly correlated with the precision-recall measurement for all the collections and term weighting schema we tried. Thus, we conclude that our meta-scoring function can be a substitute for the precision-recall measurement that needs relevance judgments of human subject. Furthermore, this meta-scoring function is not limited only to text information retrieval can be applied to fields such as image and DNA retrieval.
Improving cross language retrieval with triangulated translation BIBAFull-Text 90-95
  Tim Gollins; Mark Sanderson
Most approaches to cross language information retrieval assume that resources providing a direct translation between the query and document languages exist. This paper presents research examining the situation where such an assumption is false. Here, an intermediate (or pivot) language provides a means of transitive translation of the query language to that of the document via the pivot, at the cost, however, of introducing much error. The paper reports the novel approach of translating in parallel across multiple intermediate languages and fusing the results. Such a technique removes the error, raising the effectiveness of the tested retrieval system, up to and possibly above the level expected, had a direct translation route existed. Across a number of retrieval situations and combinations of languages, the approach proves to be highly effective.
Improving query translation for cross-language information retrieval using statistical models BIBAFull-Text 96-104
  Jianfeng Gao; Jian-Yun Nie; Endong Xun; Jian Zhang; Ming Zhou; Changning Huang
Dictionaries have often been used for query translation in cross-language information retrieval (CLIR). However, we are faced with the problem of translation ambiguity, i.e. multiple translations are stored in a dictionary for a word. In addition, a word-by-word query translation is not precise enough. In this paper, we explore several methods to improve the previous dictionary-based query translation. First, as many as possible, noun phrases are recognized and translated as a whole by using statistical models and phrase translation patterns. Second, the best word translations are selected based on the cohesion of the translation words. Our experimental results on TREC English-Chinese CLIR collection show that these techniques result in significant improvements over the simple dictionary approaches, and achieve even better performance than a high-quality machine translation system.
Evaluating a probabilistic model for cross-lingual information retrieval BIBAFull-Text 105-110
  Jinxi Xu; Ralph Weischedel; Chanh Nguyen
This work proposes and evaluates a probabilistic cross-lingual retrieval system. The system uses a generative model to estimate the probability that a document in one language is relevant, given a query in another language. An important component of the model is translation probabilities from terms in documents to terms in a query. Our approach is evaluated when 1) the only resource is a manually generated bilingual word list, 2) the only resource is a parallel corpus, and 3) both resources are combined in a mixture model. The combined resources produce about 90% of monolingual performance in retrieving Chinese documents. For Spanish the system achieves 85% of monolingual performance using only a pseudo-parallel Spanish-English corpus. Retrieval results are comparable with those of the structural query translation technique (Pirkola, 1998) when bilingual lexicons are used for query translation. When parallel texts in addition to conventional lexicons are used, it achieves better retrieval results but requires more computation than the structural query translation technique. It also produces slightly better results than using a machine translation system for CLIR, but the improvement over the MT system is not significant.
Document language models, query models, and risk minimization for information retrieval BIBAFull-Text 111-119
  John Lafferty; Chengxiang Zhai
We present a framework for information retrieval that combines document models and query models using a probabilistic ranking function based on Bayesian decision theory. The framework suggests an operational retrieval model that extends recent developments in the language modeling approach to information retrieval. A language model for each document is estimated, as well as a language model for each query, and the retrieval problem is cast in terms of risk minimization. The query language model can be exploited to model user preferences, the context of a query, synonomy and word senses. While recent work has incorporated word translation models for this purpose, we introduce a new method using Markov chains defined on a set of documents to estimate the query models. The Markov chain method has connections to algorithms from link analysis and social networks. The new approach is evaluated on TREC collections and compared to the basic language modeling approach and vector space models together with query expansion using Rocchio. Significant improvements are obtained over standard query expansion methods for strong baseline TF-IDF systems, with the greatest improvements attained for short queries on Web data.
Relevance based language models BIBAFull-Text 120-127
  Victor Lavrenko; W. Bruce Croft
We explore the relation between classical probabilistic models of information retrieval and the emerging language modeling approaches. It has long been recognized that the primary obstacle to effective performance of classical models is the need to estimate a relevance model: probabilities of words in the relevant class. We propose a novel technique for estimating these probabilities using the query alone. We demonstrate that our technique can produce highly accurate relevance models, addressing important notions of synonymy and polysemy. Our experiments show relevance models outperforming baseline language modeling systems on TREC retrieval and TDT tracking tasks. The main contribution of this work is an effective formal method for estimating a relevance model with no training data.
A statistical learning model of text classification for support vector machines BIBAFull-Text 128-136
  Thorsten Joachims
This paper develops a theoretical learning model of text classification for Support Vector Machines (SVMs). It connects the statistical properties of text-classification tasks with the generalization performance of a SVM in a quantitative way. Unlike conventional approaches to learning text classifiers, which rely primarily on empirical evidence, this model explains why and when SVMs perform well for text classification. In particular, it addresses the following questions: Why can support vector machines handle the large feature spaces in text classification effectively? How is this related to the statistical properties of text? What are sufficient conditions for applying SVMs to text-classification problems successfully?
A study of thresholding strategies for text categorization BIBAFull-Text 137-145
  Yiming Yang
Thresholding strategies in automated text categorization are an underexplored area of research. This paper presents an examination of the effect of thresholding strategies on the performance of a classifier under various conditions. Using k-Nearest Neighbor (kNN) as the classifier and five evaluation benchmark collections as the test best, three common thresholding methods were investigated, including rank-based thresholding (RCut), proportion-based assignments (PCut) and score-based local optimization (SCut); in addition, new variants of these methods are proposed to overcome significant problems in the existing approaches. Experimental results show that the choice of thresholding strategy can significantly influence the performance of kNN, and that the ``optimal'' strategy may vary by application. SCut is potentially better for fine-tuning but risks overfitting. PCut copes better with rare categories and exhibits a smoother trade-off in recall versus precision, but is not suitable for online decision making. RCut is most natural for online response but is too coarse-grained for global or local optimization. RTCut, a new method combining the strength of category ranking and scoring, outperforms both PCut and RCut significantly.
On feature distributional clustering for text categorization BIBAFull-Text 146-153
  Ron Bekkerman; Ran El-Yaniv; Naftali Tishby; Yoad Winter
We describe a text categorization approach that is based on a combination of feature distributional clusters with a support vector machine (SVM) classifier. Our feature selection approach employs distributional clustering of words via the recently introduced information bottleneck method, which generates a more efficient word-cluster representation of documents. Combined with the classification power of an SVM, this method yields high performance text categorization that can outperform other recent methods in terms of categorization accuracy and representation efficiency. Comparing the accuracy of our method with other techniques, we observe significant dependency of the results on the data set. We discuss the potential reasons for this dependency.
Iterative residual rescaling BIBAFull-Text 154-162
  Rie Kubota Ando; Lillian Lee
We consider the problem of creating document representations in which inter-document similarity measurements correspond to semantic similarity. We first present a novel subspace-based framework for formalizing this task. Using this framework, we derive a new analysis of Latent Semantic Indexing(LSI), showing a precise relationship between its performance and the uniformityof the underlying distribution of documents over topics. This analysis helps explain the improvements gained by Ando's (2000) Iterative Residual Rescaling (IRR) algorithm: IRR can compensate for distributional non-uniformity. A further benefit of our framework is that it provides a well-motivated, effective method for automatically determining the rescaling factor IRR depends on, leading to further improvements. A series of experiments over various settings and with several evaluation metrics validates our claims.
Expressive retrieval from XML documents BIBAFull-Text 163-171
  Taurai Tapiwa Chinenyanga; Nicholas Kushmerick
The emergence of XML as a standard interchange format for structured documents/data has given rise to many XML query language proposals. However, some of these languages do not support information retrieval-style ranked queries based on textual similarity. There have been several extensions to these query languages to support keyword search, but the resulting query languages cannot express queries such as``find books and CDs with similar titles''. Either these extensions use keywords as mere boolean filters, or similarities can be calculated only between data values and constants rather than two data values. We propose ELIXIR, an expressive and efficient language for XML information retrieval that extends the query language XML-QL [6],[7] with a textual similarity operator. ELIXIR is a general-purpose XML information retrieval language, sufficiently expressive to handle the above query. Our algorithm for answering ELIXIR queries rewrites the original ELIXIR query into a series of XML-QL queries that generate intermediate relational data, and uses relational database techniques to efficiently evaluate the similarity operators on this intermediate data, yielding an XML document with nodes ranked by similarity. Our experiments demonstrate that our prototype scales well with the size of the XML data and complexity of the query.
XIRQL: a query language for information retrieval in XML documents BIBAFull-Text 172-180
  Norbert Fuhr; Kai Gross
Based on the document-centric view of XML, we present the query language XIRQL. Current proposals for XML query languages lack most IR-related features, which are weighting and ranking, relevance-oriented search, datatypes with vague predicates, and semantic relativism. XIRQL integrates these features by using ideas from logic-based probabilistic IR models, in combination with concepts from the database area. For processing XIRQL queries, a path algebra is presented, that also serves as a starting point for query optimization.
Empirical investigations on query modification using abductive explanations BIBAFull-Text 181-189
  Ian Rithven; Mounia Lalmas; Keith van Rijsbergen
In this paper we report on a series of experiments designed to investigate query modification techniques motivated by the area of abductive reasoning. In particular we use the notion of abductive explanation, explanations being a description of data that highlight important features of the data. We describe several methods of creating abductive explanations, exploring term reweighting and query reformulation techniques and demonstrate their suitability for relevance feedback.
Generic summaries for indexing in information retrieval BIBAFull-Text 190-198
  Tetsuya Sakai; Karen Sparck-Jones
This paper examines the use of generic summaries for indexing in information retrieval. Our main observations are that: (1) With or without pseudo-relevance feedback, a summary index may be as effective as the corresponding fulltext index for precision-oriented search of highly relevant documents. But a reasonably sophisticated summarizer, using a compression ratio of 10-30%, is desirable for this purpose. (2) In pseudo-relevance feedback, using a summary index at initial search and a fulltext index at final search is possibly effective for precision-oriented search, regardless of relevance levels. This strategy is significantly more effective than the one using the summary index only and probably more effective than using summaries as mere term selection filters. The use of summaries as mere term selection filters. The summary quality is probably not a critical factor for this strategy, For this strategy, the summary quality is probably not a critical factor, and a compression ratio of 5-10% appears best.
Automatic generation of concise summaries of spoken dialogues in unrestricted domains BIBAFull-Text 199-207
  Klaus Zechner
Automatic summarization of open domain spoken dialogues is a new research area. This paper introduces the task, the challenges involved, and presents an approach to obtain automatic extract summaries for multi-party dialogues of four different genres, without any restriction on domain. We address the following issues which are intrinsic to spoken dialogue summarization and typically can be ignored when summarizing written text such as newswire data: (i) detection and removal of speech disfluencies; (ii) detection and insertion of sentence boundaries; (iii) detection and linking of cross-speaker information units (question-answer pairs). A global system evaluation using a corpus of 23 relevance annotated dialogues containing 80 topical segments shows that for the two more informal genres, our summarization system using dialogue specific components significantly outperforms a baseline using TFIDF term weighting with maximum marginal relevance ranking (MMR).
Enhanced topic distillation using text, markup tags, and hyperlinks BIBAFull-Text 208-216
  Soumen Chakrabarti; Mukul Joshi; Vivek Tawde
Topic distillation is the analysis of hyperlink graph structure to identify mutually reinforcing authorities (popular pages) and hubs (comprehensive lists of links to authorities). Topic distillation is becoming common in Web search engines, but the best-known algorithms model the Web graph at a coarse grain, with whole pages as single nodes. Such models may lose vital details in the markup tag structure of the pages, and thus lead to a tightly linked irrelevant subgraph winning over a relatively sparse relevant subgraph, a phenomenon called topic drift or contamination. The problem gets especially severe in the face of increasingly complex pages with navigation panels and advertisement links. We present an enhanced topic distillation algorithm which analyzes text, the markup tag trees that constitute HTML pages, and hyperlinks between pages. It thereby identifies subtrees which have high text- and hyperlink-based coherence w.r.t. the query. These subtrees get preferential treatment in the mutual reinforcement process. Using over 50 queries, 28 from earlier topic distillation work, we analyzed over 700,000 pages and obtained quantitative and anecdotal evidence that the new algorithm reduces topic drift.
Transparent Queries: investigation users' mental models of search engines BIBAFull-Text 217-224
  Jack Muramatsu; Wanda Pratt
Typically, commercial Web search engines provide very little feedback to the user concerning how a particular query is processed and interpreted. Specifically, they apply key query transformations without the users knowledge. Although these transformations have a pronounced effect on query results, users have very few resources for recognizing their existence and understanding their practical importance. We conducted a user study to gain a better understanding of users knowledge of and reactions to the operation of several query transformations that web search engines automatically employ. Additionally, we developed and evaluated Transparent Queries, a software system designed to provide users with lightweight feedback about opaque query transformations. The results of the study suggest that users do indeed have difficulties understanding the operation of query transformations without additional assistance. Finally, although transparency is helpful and valuable, interfaces that allow direct control of query transformations might ultimately be more helpful for end-users.
Why batch and user evaluations do not give the same results BIBAFull-Text 225-231
  Andrew H. Turpin; William Hersh
Much system-oriented evaluation of information retrieval systems has used the Cranfield approach based upon queries run against test collections in a batch mode. Some researchers have questioned whether this approach can be applied to the real world, but little data exists for or against that assertion. We have studied this question in the context of the TREC Interactive Track. Previous results demonstrated that improved performance as measured by relevance-based metrics in batch studies did not correspond with the results of outcomes based on real user searching tasks. The experiments in this paper analyzed those results to determine why this occurred. Our assessment showed that while the queries entered by real users into systems yielding better results in batch studies gave comparable gains in ranking of relevant documents for those users, they did not translate into better performance on specific tasks. This was most likely due to users being able to adequately find and utilize relevant documents ranked further down the output list.
Evaluating a content based image retrieval system BIBAFull-Text 232-240
  Sharon McDonald; Ting-Sheng Lai; John Tait
Content Based Image Retrieval (CBIR) presents special challenges in terms of how image data is indexed, accessed, and how end systems are evaluated. This paper discusses the design of a CBIR system that uses global colour as the primary indexing key, and a user centered evaluation of the systems visual search tools. The results indicate that users are able to make use of a range of visual search tools, and that different tools are used at different points in the search process. The results also show that the provision of a structured navigation and browsing tool can support image retrieval, particularly in situations in which the user does not have a target image in mind. The results are discussed in terms of their implications for the design of visual search tools, and their implications for the use of user-centered evaluation for CBIR systems.
Evaluating topic-driven web crawlers BIBAFull-Text 241-249
  Filippo Menczer; Gautam Pant; Padmini Srinivasan; Miguel E. Ruiz
Due to limited bandwidth, storage, and computational resources, and to the dynamic nature of the Web, search engines cannot index every Web page, and even the covered portion of the Web cannot be monitored continuously for changes. Therefore it is essential to develop effective crawling strategies to prioritize the pages to be indexed. The issue is even more important for topic-specific search engines, where crawlers must make additional decisions based on the relevance of visited pages. However, it is difficult to evaluate alternative crawling strategies because relevant sets are unknown and the search space is changing. We propose three different methods to evaluate crawling strategies. We apply the proposed metrics to compare three topic-driven crawling algorithms based on similarity ranking, link analysis, and adaptive agents.
Effective site finding using link anchor information BIBAFull-Text 250-257
  Nick Craswell; David Hawking; Stephen Robertson
Link-based ranking methods have been described in the literature and applied in commercial Web search engines. However, according to recent TREC experiments, they are no better than traditional content-based methods. We conduct a different type of experiment, in which the task is to find the main entry point of a specific Web site. In our experiments, ranking based on link anchor text is twice as effective as ranking based on document content, even though both methods used the same BM25 formula. We obtained these results using two sets of 100 queries on a 18.5 million document set and another set of 100 on a 0.4 million document set. This site finding effectiveness begins to explain why many search engines have adopted link methods. It also opens a rich new area for effectiveness improvement, where traditional methods fail.
Stable algorithms for link analysis BIBAFull-Text 258-266
  Andrew Y. Ng; Alice X. Zheng; Michael I. Jordan
The Kleinberg HITS and the Google PageRank algorithms are eigenvector methods for identifying ``authoritative'' or ``influential'' articles, given hyperlink or citation information. That such algorithms should give reliable or consistent answers is surely a desideratum, and in [10], we analyzed when they can be expected to give stable rankings under small perturbations to the linkage patterns. In this paper, we extend the analysis and show how it gives insight into ways of designing stable link analysis methods. This in turn motivates two new algorithms, whose performance we study empirically using citation data and web hyperlink data.
Modeling score distributions for combining the outputs of search engines BIBAFull-Text 267-275
  R. Manmatha; T. Rath; F. Feng
In this paper the score distributions of a number of text search engines are modeled. It is shown empirically that the score distributions on a per query basis may be fitted using an exponential distribution for the set of non-relevant documents and a normal distribution for the set of relevant documents. Experiments show that this model fits TREC-3 and TREC-4 data for not only probabilistic search engines like INQUERY but also vector space search engines like SMART for English. We have also used this model to fit the output of other search engines like LSI search engines and search engines indexing other languages like Chinese.
   It is then shown that given a query for which relevance information is not available, a mixture model consisting of an exponential and a normal distribution can be fitted to the score distribution. These distributions can be used to map the scores of a search engine to probabilities. We also discuss how the shape of the score distributions arise given certain assumptions about word distributions in documents. We hypothesize that all 'good' text search engines operating on any language have similar characteristics.
   This model has many possible applications. For example, the outputs of different search engines can be combined by averaging the probabilities (optimal if the search engines are independent) or by using the probabilities to select the best engine for each query. Results show that the technique performs as well as the best current combination techniques.
Models for metasearch BIBAFull-Text 276-284
  Javed A. Aslam; Mark Montague
Given the ranked lists of documents returned by multiple search engines in response to a given query, the problem of metasearch is to combine these lists in a way which optimizes the performance of the combination. This paper makes three contributions to the problem of metasearch: (1) We describe and investigate a metasearch model based on an optimal democratic voting procedure, the Borda Count; (2) we describe and investigate a metasearch model based on Bayesian inference; and (3) we describe and investigate a model for obtaining upper bounds on the performance of metasearch algorithms. Our experimental results show that metasearch algorithms based on the Borda and Bayesian models usually outperform the best input system and are competitive with, and often outperform, existing metasearch strategies. Finally, our initial upper bounds demonstrate that there is much to learn about the limits of the performance of metasearch.
The score-distributional threshold optimization for adaptive binary classification tasks BIBAFull-Text 285-293
  Avi Arampatzis; Andre van Hameran
The thresholding of document scores has proved critical for the effectiveness of classification tasks. We review the most important approaches to thresholding, and introduce thescore-distributional (S-D) threshold optimization method. The method is based on score distributions and is capable of optimizing any effectiveness measure defined in terms of the traditional contingency table.
   As a byproduct, we provide a model for score distributions, and demonstrate its high accuracy in describing empirical data. The estimation method can be performed incrementally, a highly desirable feature for adaptive environments. Our work in modeling score distributions is useful beyond threshold optimization problems. It directly applies to other retrieval environments that make use of score distributions, e.g., distributed retrieval, or topic detection and tracking.
   The most accurate version of S-D thresholding -- although incremental -- can be computationally heavy. Therefore, we also investigate more practical solutions. We suggest practical approximations and discuss adaptivity, threshold initialization, and incrementality issues. The practical version of S-D thresholding has been tested in the context of the TREC-9 Filtering Track and found to be very effective [2].
Maximum likelihood estimation for filtering thresholds BIBAFull-Text 294-302
  Yi Zhang; Jamie Callan
Information filtering systems based on statistical retrieval models usually compute a numeric score indicating how well each document matches each profile. Documents with scores above profile-specific dissemination threshold sare delivered.
   An optimal dissemination threshold is one that maximizes a given utility function based on the distributions of the scores of relevant and non-relevant documents. The parameters of the distribution can be estimated using relevance information, but relevance information obtained while filtering is biased. This paper presents a new method of adjusting dissemination thresholds that explicitly models and compensates for this bias. The new algorithm, which is based on the Maximum Likelihood principle, jointly estimates the parameters of the density distributions for relevant and non-relevant documents and the ratio of the relevant document in the corpus. Experiments with TREC-8 and TREC-9 Filtering Track data demonstrate the effectiveness of the algorithm.
A meta-learning approach for text categorization BIBAFull-Text 303-309
  Wai Lam; Kwok-Yin Lai
We investigate a meta-model approach, called Meta-learning Using Document Feature characteristics (MUDOF), for the task of automatic textual document categorization. It employs a meta-learning phase using document feature characteristics. Document feature characteristics, derived from the training document set, capture some inherent category-specific properties of a particular category. Different from existing categorization methods, MUDOF can automatically recommend a suitable algorithm for each category based on the category-specific statistical characteristics. Hence, different algorithms may be employed for different categories. Experiments have been conducted on a real-world document collection demonstrating the effectiveness of our approach. The results confirm that our meta-model approach can exploit the advantage of its component algorithms, and demonstrate a better performance than existing algorithms.
Unsupervised and supervised clustering for topic tracking BIBAFull-Text 310-317
  Martin Franz; Todd Ward; J. Scott McCarley; Wei-Jing Zhu
We investigate important differences between two styles of document clustering in the context of Topic Detection and Tracking. Converting a Topic Detection system into a Topic Tracking system exposes fundamental differences between these two tasks that are important to consider in both the design and the evaluation of TDT systems. We also identify features that can be used in systems for both tasks.
Intelligent information triage BIBAFull-Text 318-326
  Sofus A. Macskassy; Foster Provost
In many applications, large volumes of time-sensitive textual information require triage: rapid, approximate prioritization for subsequent action. In this paper, we explore the use of prospective indications of the importance of a time-sensitive document, for the purpose of producing better document filtering or ranking. By prospective, we mean importance that could be assessed by actions that occur in the future. For example, a news story may be assessed (retrospectively) as being important, based on events that occurred after the story appeared, such as a stock price plummeting or the issuance of many follow-up stories. If a system could anticipate (prospectively) such occurrences, it could provide a timely indication of importance. Clearly, perfect prescience is impossible. However, sometimes there is sufficient correlation between the content of an information item and the events that occur subsequently. We describe a process for creating and evaluating approximate information-triage procedures that are based on prospective indications. Unlike many information-retrieval applications for which document labeling is a laborious, manual process, for many prospective criteria it is possible to build very large, labeled, training corpora automatically. Such corpora can be used to train text classification procedures that will predict the (prospective) importance of each document. This paper illustrates the process with two case studies, demonstrating the ability to predict whether a news story will be followed by many, very similar news stories, and also whether the stock price of one or more companies associated with a news story will move significantly following the appearance of that story. We conclude by discussing how the comprehensibility of the learned classifiers can be critical to success.
Discovering information flow suing high dimensional conceptual space BIBAFull-Text 327-333
  Dawei Song; Peter Bruza
This paper presents an informational inference mechanism realized via the use of a high dimensional conceptual space. More specifically, we claim to have operationalized important aspects of Gardenforss recent three-level cognitive model. The connectionist level is primed with the Hyperspace Analogue to Language (HAL) algorithm which produces vector representations for use at the conceptual level. We show how inference at the symbolic level can be implemented by employing Barwise and Seligmans theory of information flow. This article also features heuristics for enhancing HAL-based representations via the use of quality properties, determining concept inclusion and computing concept composition. The worth of these heuristics in underpinning informational inference are demonstrated via a series of experiments. These experiments, though small in scale, show that informational inference proposed in this article has a very different character to the semantic associations produced by the Minkowski distance metric and concept similarity computed via the cosine coefficient. In short, informational inference generally uncovers concepts that are carried, or, in some cases, implied by another concept, (or combination of concepts).
A study of smoothing methods for language models applied to Ad Hoc information retrieval BIBAFull-Text 334-342
  Chengxiang Zhai; John Lafferty
Language modeling approaches to information retrieval are attractive and promising because they connect the problem of retrieval with that of language model estimation, which has been studied extensively in other application areas such as speech recognition. The basic idea of these approaches is to estimate a language model for each document, and then rank documents by the likelihood of the query according to the estimated language model. A core problem in language model estimation is smoothing, which adjusts the maximum likelihood estimator so as to correct the inaccuracy due to data sparseness. In this paper, we study the problem of language model smoothing and its influence on retrieval performance. We examine the sensitivity of retrieval performance to the smoothing parameters and compare several popular smoothing methods on different test collections.
Topic segmentation with an aspect hidden Markov model BIBAFull-Text 343-348
  David M. Blei; Pedro J. Moreno
We present a novel probabilistic method for topic segmentation on unstructured text. One previous approach to this problem utilizes the hidden Markov model (HMM) method for probabilistically modeling sequence data [7]. The HMM treats a document as mutually independent sets of words generated by a latent topic variable in a time series. We extend this idea by embedding Hofmann's aspect model for text [5] into the segmenting HMM to form an aspect HMM (AHMM). In doing so, we provide an intuitive topical dependency between words and a cohesive segmentation model. We apply this method to segment unbroken streams of New York Times articles as well as noisy transcripts of radio programs on SpeechBot, an online audio archive indexed by an automatic speech recognition engine. We provide experimental comparisons which show that the AHMM outperforms the HMM for this task.
Finding topic words for hierarchical summarization BIBAFull-Text 349-357
  Dawn Lawrie; W. Bruce Croft; Arnold Rosenberg
Hierarchies have long been used for organization, summarization, and access to information. In this paper we define summarization in terms of a probabilistic language model and use the definition to explore a new technique for automatically generating topic hierarchies by applying a graph-theoretic algorithm, which is an approximation of the Dominating Set Problem. The algorithm efficiently chooses terms according to a language model. We compare the new technique to previous methods proposed for constructing topic hierarchies including subsumption and lexical hierarchies, as well as the top TF.IDF terms. Our results show that the new technique consistently performs as well as or better than these other techniques. They also show the usefulness of hierarchies compared with a list of terms.
Exploiting redundancy in question answering BIBAFull-Text 358-365
  Charles L. A. Clarke; Gordon V. Cormack; Thomas R. Lynam
Our goal is to automatically answer brief factual questions of the form ``When was the Battle of Hastings?'' or ``Who wrote The Wind in the Willows?''. Since the answer to nearly any such question can now be found somewhere on the Web, the problem reduces to finding potential answers in large volumes of data and validating their accuracy. We apply a method for arbitrary passage retrieval to the first half of the problem and demonstrate that answer redundancy can be used to address the second half. The success of our approach depends on the idea that the volume of available Web data is large enough to supply the answer to most factual questions multiple times and in multiple contexts. A query is generated from a question and this query is used to select short passages that may contain the answer from a large collection of Web data. These passages are analyzed to identify candidate answers. The frequency of these candidates within the passages is used to ``vote'' for the most likely answer. The approach is experimentally tested on questions taken from the TREC-9 question-answering test collection. As an additional demonstration, the approach is extended to answer multiple choice trivia questions of the form typically asked in trivia quizzes and television game shows.
High performance question/answering BIBAFull-Text 366-374
  Marius A. Pasca; Sandra M. Harabagiu
In this paper we present the features of a Question/Answering (Q/A) system that had unparalleled performance in the TREC-9 evaluations. We explain the accuracy of our system through the unique characteristics of its architecture: (1) usage of a wide-coverage answer type taxonomy; (2) repeated passage retrieval; (3) lexico-semantic feedback loops; (4) extraction of the answers based on machine learning techniques; and (5) answer caching. Experimental results show the effects of each feature on the overall performance of the Q/A system and lead to general conclusions about Q/A from large text collections.
Searcher performance in question answering BIBAFull-Text 375-381
  Mingfang Wu; Michael Fuller; Ross Wilkinson
There are many tasks that require information finding. Some can be largely automated, and others greatly benefit from successful interaction between system and searcher. We are interested in the task of answering questions where some synthesis of information is required-the answer would not generally be given from a single passage of a single document. We investigate whether variation in the way a list of documents is delivered affected searcher performance in the question answering task. We will show that there is a significant difference in performance using a list customized to the task type, compared with a standard web-engine list. This indicates that paying attention to the task and the searcher interaction may provide substantial improvement in task performance.
Toward an improved concept-based information retrieval system BIBAFull-Text 384-385
  Peter V. Henstock; Daniel J. Pack; Young-Suk Lee; Clifford J. Weinstein
This paper presents a novel information retrieval system that includes 1) the addition of concepts to facilitate the identification of the correct word sense, 2) a natural language query interface, 3) the inclusion of weights and penalties for proper nouns that build upon the Okapi weighting scheme, and 4) a term clustering technique that exploits the spatial proximity of search terms in a document to further improve the performance. The effectiveness of the system is validated by experimental results.
Metasearch consistency BIBAFull-Text 386-387
  Mark Montague; Javed A. Aslam
We investigate the performance of metasearch algorithms in terms of how much they improve consistency. We find that three different metasearch algorithms, each over three datasets, usually improve the consistency of search results; sometimes the improvement is dramatic. Furthermore, consistency tends to improve when performance improves.
Anchor text mining for translation extraction of query terms BIBAFull-Text 388-389
  Wen-Hsiang Lu; Lee-Feng Chein; Hsi-Jian Lee
This paper presents an approach to automatically extracting the bilingual translations of many Web query terms through mining the Web anchor texts. Some preliminary experiments are conducted on using 109,416 Web pages containing both Chinese and English anchor texts in their in-links to extract Chinese translations of 200 English queries selected from popular query terms in Taiwan. It is found that the effective translations of 75% of the popular query terms can be extracted, in which 87.2% cannot be obtained in common translation dictionaries.
Selecting expansion terms in automatic query expansion BIBFull-Text 390-391
  Hiroko Mano; Yasushi Ogawa
An experimental framework for email categorization and management BIBAFull-Text 392-393
  Kenricj Mock
Many problems are difficult to adequately explore until a prototype exists in order to elicit user feedback. One such problem is a system that automatically categorizes and manages email. Due to a myriad of user interface issues, a prototype is necessary to determine what techniques and technologies are effective in the email domain. This paper describes the implementation of an add-in for Microsoft Outlook 2000 TM that intends to address two problems with email: 1) help manage the inbox by automatically classifying email based on user folders, and 2) to aid in search and retrieval by providing a list of email relevant to the selected item. This add-in represents a first step in an experimental system for the study of other issues related to information management. The system has been set up to allow experimentation with other classification algorithms and the source code is available online in an effort to promote further experimentation.
Analyses of multiple-evidence combinations for retrieval strategies BIBFull-Text 394-395
  Abdur Chowdhury; Ophir Frieder; David Grossman; Catherine McCabe
Flexible pseudo-relevance feedback using optimization tables BIBFull-Text 396-397
  Tetsuya Sakai; Stephen E. Robertson
Quantifying the utility of parallel corpora BIBAFull-Text 398-399
  Martin Franz; J. Scott McCarley; Todd Ward; Wei-Jing Zhu
Our English-Chinese cross-language IR system is trained from parallel corpora; we investigate its performance as a function of training corpus size for three different training corpora. We find that the performance of the system as trained on the three parallel corpora can be related by a simple measure, namely the out-of-vocabulary rate of query words.
Unitary operators for fast latent semantic indexing (FLSI) BIBAFull-Text 400-401
  Eduard Hoenkamp
Latent Semantic Indexing (LSI) dramatically reduces the dimension of the document space by mapping it into a space spanned by conceptual indices. Empirically, the number of concepts that can represent the documents are far fewer than the great variety of words in the textual representation. Although this almost obviates the problem of lexical matching, the mapping incurs a high computational cost compared to document parsing, indexing, query matching, and updating. This paper shows how LSI is based on a unitary transformation, for which there are computationally more attractive alternatives. This is exemplified by the Haar transform, which is memory efficient, and can be computed in linear to sublinear time. The principle advantages of LSI are thus preserved while the computational costs are drastically reduced.
Probabilistic combination of content and links BIBAFull-Text 402-403
  Rong Jin; Susan Dumais
Previous research has shown that citations and hypertext links can be usefully combined with document content to improve retrieval. Links can be used in many ways, e.g., link topology can be used to identify important pages, anchor text can be used to augment the text of cited pages, and activation can be spread to linked pages. This paper introduces a probabilistic model that integrates content matching and these three uses of link information in a single unified framework. Experiments with a web collection show benefits for link information especially for general queries.
Structure and content-based segmentation of speech transcripts BIBAFull-Text 404-405
  Dulce Ponceleon; Savitha Srinivasan
algorithm for the segmentation of an audio/video source into topically cohesive segments based on automatic speech recognition (ASR) transcriptions is presented. A novel two-pass algorithm is described that combines a boundary-based method with a content-based method. In the first pass, the temporal proximity and the rate of arrival of ngram features is analyzed in order to compute an initial segmentation. In the content- based second pass, changes in content-bearing words are detected by using the ngram features as queries in an information-retrieval system. The second pass validates the initial segments and merges them as needed. Feasibility of the segmentation task can vary enormously depending on the structure of the audio content, and the accuracy of ASR. For real-world corporate training data our method identifies, at worst, a single salient segment of the audio and, at best, a high-level table-of-contents. We illustrate the algorithm in detail with some examples and validate the results with segmentation boundaries generated manually.
Text summarization via hidden Markov models BIBAFull-Text 406-407
  John M. Conroy; Dianne P. O'leary
A sentence extract summary of a document is a subset of the document's sentences that contains the main ideas in the document. We present an approach to generating such summaries, a hidden Markov model that judges the likelihood that each sentence should be contained in the summary. We compare the results of this method with summaries generated by humans, showing that we obtain significantly higher agreement than do earlier methods.
Reading time, scrolling and interaction: exploring implicit sources of user preferences for relevance feedback BIBFull-Text 408-409
  Diane Kelly; Nicholas J. Belkin
Interactive phrase browsing within compressed text BIBFull-Text 410-411
  Raymond Wan; Alistair Moffat
Query-biased web page summarisation: a task-oriented evaluation BIBAFull-Text 412-413
  Ryen White; Joemon M. Jose; Ian Ruthven
We present a system that offers a new way of assessing web document relevance and new approach to the web-based evaluation of such a system. Provisionally named WebDocSum, the system is a query-biased web page summariser that aims to provide an alternative to the short, irrelevant abstracts typical of many web search result lists. Based on an initial evaluation the system appears to be more useful in helping users gauge document relevance than the traditional ranked titles/abstracts approach.
Query expansion based on predictive algorithms for collaborative filtering BIBFull-Text 414-415
  Keiichiro Hoashi; Kazunori Matsumoto; Naomi Inoue; Kazuo Hashimoto
Query optimization for vector space problems BIBAFull-Text 416-417
  K. Goda; T. Tamura; M. Kitsuregawa; A. Chowdhury; O. Frieder
We present performance measurement results for a parallel SQL based information retrieval system implemented on a PC cluster system. We used the Web-TREC dataset under a left-deep query execution plan. We achieved satisfactory speed up.
Generic topic segmentation of document texts BIBAFull-Text 418-419
  Marie-Francine Moens; Rik De Busser
Topic segmentation is an important initial step in many text-based tasks. A hierarchical representation of a texts topics is useful in retrieval and allows judging relevancy at different levels of detail. This short paper describes research on generic algorithms for topic detection and segmentation that are applicable on texts of heterogeneous types and domains.
Towards the use of prosodic information for spoken document retrieval BIBFull-Text 420-421
  Fabio Crestani
A homogeneous framework to model relevance feedback BIBAFull-Text 422-423
  David E. Losada; Alvaro Barreiro
Relevance feedback is an appreciated process to produce increasingly better retrieval. Usually, positive feedback plays a fundamental role in the feedback process whereas the role of negative feedback is limited. We think that negative feedback is a promising precision oriented mechanism and we propose a logical framework in which positive and negative feedback are homogeneously modeled. Evaluation results against small test collections are provided.
Combining semantic and syntactic document classifiers to improve first story detection BIBAFull-Text 424-425
  Nicola Stokes; Joe Carthy
In this paper we describe a type of data fusion involving the combination of evidence derived from multiple document representations. Our aim is to investigate if a composite representation can improve the online detection of novel events in a stream of broadcast news stories. This classification process otherwise known as first story detection FSD (or in the Topic Detection and Tracking pilot study as online new event detection [1]), is one of three main classification tasks defined by the TDT initiative. Our composite document representation consists of a semantic representation (based on the lexical chains derived from a text) and a syntactic representation (using proper nouns). Using the TDT1 evaluation methodology, we evaluate a number of document representation combinations using these document classifiers.
Browsing in a digital library collecting linearly arranged documents BIBAFull-Text 426-427
  Yanhua Qu; Keizo Sato; Makoto Nakashima; Tetsuro Ito
A method of assisting a user in finding the required documents effectively is proposed. A user being informed which documents are worth examining can browse in a digital library (DL) in a linear fashion. Computational evaluations were carried out, and a DL and its navigator are designed and constructed.
Feature selection for polyphonic music retrieval BIBFull-Text 428-429
  Jeremy Pickens
Automatic information extraction from web pages BIBAFull-Text 430-431
  Budi Rahardjo; Roland H. C. Yap
Many web pages have implicit structure. In this paper, we show the feasibility of automatically extracting data from web pages by using approximate matching techniques. This can be applied to generate automatic wrappers or to notify/display web page differences, web page change monitoring, etc.
Automatic web search query generation to create minority language corpora BIBAFull-Text 432-433
  Rayid Ghani; Rosie Jones; Dunja Mladenic
The Web is a valuable source of language specific resources but collecting, organizing and utilizing this information is difficult. We describe CorpusBuilder, an approach for automatically generating Web-search queries to collect documents in a minority language. It differs from pseudo-relevance feedback in that retrieved documents are labeled by an automatic language classifier as relevant or irrelevant and a subset of documents is used to generate new queries. We experiment with various query-generation methods and query-lengths to find inclusion/exclusion terms that are helpful for finding documents in the target language and find that using odds-ratio scores calculated over the documents acquired so far was one of the most consistently accurate query-generation methods. We also describe experiments using a handful of words elicited from a user instead of initial documents and show that the methods perform similarly. Applying the same approach to multiple languages show that our system generalizes to a variety of languages.
Perpetual consistency improves image retrieval performance BIBAFull-Text 434-435
  Huizhong Long; Wee Kheng Leow
An ideal retrieval system should retrieve images that satisfy the user's need, and should, therefore, measure image similarity in a manner consistent with human's perception. However, existing computational similarity measures are not perceptually consistent. This paper proposes an approach of improving retrieval performance by improving the perceptual consistency of computational similarity measures for textures based on relevance feedback judgments.
Intelligent object-based image retrieval using cluster-driven personal preference learning BIBAFull-Text 436-437
  Kyong-Mi Lee; W. Nick Street
This paper introduces a personalization method for image retrieval based on the learning of personal preferences. The proposed system indexes objects based on shape and groups them into a set of clusters, or prototypes. Our personalization method refines corresponding prototypes from objects provided by the user in the foreground, and simultaneously adapts the database index in the background.
Construction of a hierarchical classifier schema using a combination of text-based and image-based approaches BIBAFull-Text 438-439
  Cheng Lu; Mark S. Drew
Web document hierarchical classification approaches often rely on textual features alone even though web pages include multimedia data. We propose a new hierarchical integrated web classification approach that combines image-based and text-based approaches. Instead of using a flat classifier to combine text and image classification, we perform classification on a hierarchy differently on different levels of the tree, using text for branches and images only at leaves. The results of our experiments show that the use of the hierarchical structure improved web document classification performance significantly.
A method based on the chi-square test for document classification BIBAFull-Text 440-441
  Michael Oakes; Robert Gaaizauskas; Helene Fowkes; Anna Jonsson; Vincent Wan; Micheline Beaulieu
We introduce a method for document classification based on using the chi-square test to identify characteristic vocabulary of document classes.
Query clustering using content words and user feedback BIBAFull-Text 442-443
  Ji-Rong Wen; Jian-Yun Nie; Hong-Jiang Zhang
Query clustering is crucial for automatically discovering frequently asked queries (FAQs) or most popular topics on a question-answering search engine. Due to the short length of queries, the traditional approaches based on keywords are not suitable for query clustering. This paper describes our attempt to cluster similar queries according to their contents as well as the document click information in the user logs.
Modifications of Kleinberg's HITS algorithm using matrix exponentiation and web log records BIBFull-Text 444-445
  Joel C. Miller; Gregory Rae; Fred Schaefer; Lesley A. Ward; Thomas LoFaro; Ayman Farahat
Cite me, cite my references?: (Scholarly use of the ACM SIGIR proceedings based on two citation indexes) BIBAFull-Text 446-447
  Elana Broch
A three-part study was designed to document Internet use in scholarly research, using the Annual SIGIR Conference Proceedings from 1997 through 1999. The results suggest an increasing trend toward electronic self-publishing. Furthermore, while electronic availability did not insure that one would be cited, the most highly cited articles were available on the "free" web. The study also found that electronic availability has not, in most cases, decreased the length of time between publication and citation.
iFind: a web image search engine BIBFull-Text 450
  Zheng Chen; Liu Wenyin; Chunhui Hu; Mingjing Li; Hong-Jiang Zhang
Building interoperable digital library services: MARIAN, open archives, and the NDLTD BIBAFull-Text 451
  Edward A. Fox; Robert France; Marcos Andre Goncalves; Hussein Suleman
In this demonstration, we present interoperable and personalized search services for the Networked Digital Library of Theses and Dissertations (NDLTD). Using standard protocols and software, including those specified by the Open Archives Initiative (OAI), distributed sites can share metadata easily. On top of these harvesting protocols, we implement a union collection of theses managed by the MARIAN digital library system. Our demonstration covers aspects of NDLTD, OAI, and MARIAN.
AUTINDEX: an automatic multilingual indexing system BIBFull-Text 452
  Barbel Ripplinger; Paul Schmidt
Does visualization improve our ability to find and learn from internet based information? BIBFull-Text 453
  Daniel A. Kauwell; Jim Levin; Hwan Jo Yu; Young Jin Lee; Jeff Ellen; Arun Bahalla
The HySpirit retrieval platform BIBFull-Text 454
  Thomas Rolleke; Ralf Lubeck; Gabriella Kazai
Distributed resource discovery and structured data searching with Chesire II BIBAFull-Text 455
  Ray R. Larson
This demonstration will show describe the construction and application of Cross-Domain Information Servers using features of the standard Z39.50 information retrieval protocol[Z39.50]. The system is currently being used to build and search distributed indexes for databases with disparate structured data (SGML and XML). We use the Z39.50 Explain Database to determine the databases and indexes of a given server, then use the Z39.50 SCAN facility to extract the contents of the indexes. This information is used to build collection documents that can be retrieved using probabilistic retrieval algorithms.
Searching the deep web: distributed explorit directed query applications BIBAFull-Text 456
  Valerie S. Allen; Abe Lederman
In 1999 a directed query distributed search engine was integrated into a new Department of Energy Virtual Library of Energy Science and Technology. Millions of pages of government information across multiple agencies were made immediately searchable via one query, setting the stage for the development of a variety of interagency initiatives and applications.
CROWSE: a system for organizing repositories and web search results BIBFull-Text 457
  A Kinshuman; Sudeshna Sarkar
MS read: user modeling in the web environment BIBAFull-Text 458
  Natasa Milio-Frayling; Ralph Sommerer
MS Read is a prototype application implemented as an extension of the Web Browser that creates an evolving model of the users topic of interest. It uses that model to analyze documents that are accessed while searching and browsing the Web. In the presented version of MS Read the model is used to highlight topic related terminology in the documents. MS Read model of the user need is created by applying natural language processing to search queries captured within the Browser and to topic descriptions explicitly provided by the user while browsing and reading documents. It is semantically enhanced using linguistic and custom knowledge resources.