HCI Bibliography Home | HCI Conferences | IR Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
IR Tables of Contents: 929394959697989900010203040506070809101112

Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

Fullname:Proceedings of the 25th International ACM SIGIR Conference on Research and Development in Information Retrieval
Editors:Micheline Beaulieu; Ricardo Baeza-Yates; Sung Hyon Myaeng
Location:Tampere, Finland
Dates:2002-Aug-11 to 2002-Aug-15
Standard No:ISBN: 1-58113-561-0; ACM Order Number: 606020; ACM DL: Table of Contents hcibib: IR02
  1. Web Information Retrieval
  2. Information Retrieval Theory
  3. User Studies
  4. Filtering
  5. Summarization
  6. Text Categorization
  7. Cross-language Information Retrieval
  8. Clustering
  9. Efficiency
  10. Collaborative Filtering
  11. Arabic Information Retrieval
  12. Queries
  13. Evaluation
  14. Multimedia
  15. Poster session
  16. Demo session
Landmarks in information retrieval: the message out of the bottle BIBAFull-Text 1
  Keith Van Rijsbergen
For many years I have wanted to give a talk like this: look back on our subject, identify the high (and perhaps low) points, consider what worked, what did not work, and speculate a little about the future. Now that I at last have the opportunity to give such a talk the realisation has dawned just how difficult it is to do justice to the topic. The only way out of this difficulty for me is to emphasise that this is a personal account, based on my involvement with the field since 1968, and that errors of omission and commission are not deliberate but simply due to lack of knowledge and time on my part.

Web Information Retrieval

Impact transformation: effective and efficient web retrieval BIBAFull-Text 3-10
  Vo Ngoc Anh; Alistair Moffat
We extend the applicability of impact transformation, which is a technique for adjusting the term weights assigned to documents so as to boost the effectiveness of retrieval when short queries are applied to large document collections. In conjunction with techniques called quantization and thresholding, impact transformation allows improved query execution rates compared to traditional vector-space similarity computations, as the number of arithmetic operations can be reduced. The transformation also facilitates a new dynamic query pruning heuristic. We give results based upon the trec web data that show the combination of these various techniques to yield highly competitive retrieval, in terms of both effectiveness and efficiency, for both short and long queries.
Analysis of lexical signatures for finding lost or related documents BIBAFull-Text 11-18
  Seung-Taek Park; David M. Pennock; C. Lee Giles; Robert Krovetz
A lexical signature of a web page is often sufficient for finding the page, even if its URL has changed. We conduct a large-scale empirical study of eight methods for generating lexical signatures, including Phelps and Wilensky's [14] original proposal (PW) and seven of our own variations. We examine their performance on the web and on a TREC data set, evaluating their ability both to uniquely identify the original document and to locate other relevant documents if the original is lost. Lexical signatures chosen to minimize document frequency (DF) are good at unique identification but poor at finding relevant documents. PW works well on the relatively small TREC data set, but acts almost identically to DF on the web, which contains billions of documents. Term-frequency-based lexical signatures (TF) are very easy to compute and often perform well, but are highly dependent on the ranking system of the search engine used. In general, TFIDF-based method and hybrid methods (which combine DF with TF or TFIDF) seem to be the most promising candidates for generating effective lexical signatures.
Using sampled data and regression to merge search engine results BIBAFull-Text 19-26
  Luo Si; Jamie Callan
This paper addresses the problem of merging results obtained from different databases and search engines in a distributed information retrieval environment. The prior research on this problem either assumed the exchange of statistics necessary for normalizing scores (cooperative solutions) or is heuristic. Both approaches have disadvantages. We show that the problem in uncooperative environments is simpler when viewed as a component of a distributed IR system that uses query-based sampling to create resource descriptions. Documents sampled for creating resource descriptions can also be used to create a sample centralized index, and this index is a source of training data for adaptive results merging algorithms. A variety of experiments demonstrate that this new approach is more effective than a well-known alternative, and that it allows query-by-query tuning of the results merging function.
The Importance of Prior Probabilities for Entry Page Search BIBAFull-Text 27-34
  Wessel Kraaij; Thijs Westerveld; Djoerd Hiemstra
An important class of searches on the world-wide-web has the goal to find an entry page (homepage) of an organisation. Entry page search is quite different from Ad Hoc search. Indeed a plain Ad Hoc system performs disappointingly. We explored three non-content features of web pages: page length, number of incoming links and URL form. Especially the URL form proved to be a good predictor. Using URL form priors we found over 70% of all entry pages at rank 1, and up to 89% in the top 10. Non-content features can easily be embedded in a language model framework as a prior probability.

Information Retrieval Theory

Term-specific smoothing for the language modeling approach to information retrieval: the importance of a query term BIBAFull-Text 35-41
  Djoerd Hiemstra
This paper follows a formal approach to information retrieval based on statistical language models. By introducing some simple reformulations of the basic language modeling approach we introduce the notion of importance of a query term. The importance of a query term is an unknown parameter that explicitly models which of the query terms are generated from the relevant documents (the important terms), and which are not (the unimportant terms). The new language modeling approach is shown to explain a number of practical facts of today's information retrieval systems that are not very well explained by the current state of information retrieval theory, including stop words, mandatory terms, coordination level ranking and retrieval using phrases.
Title language model for information retrieval BIBAFull-Text 42-48
  Rong Jin; Alex G. Hauptmann; Cheng Xiang Zhai
In this paper, we propose a new language model, namely, a title language model, for information retrieval. Different from the traditional language model used for retrieval, we define the conditional probability P(Q|D) as the probability of using query Q as the title for document D. We adopted the statistical translation model learned from the title and document pairs in the collection to compute the probability P(Q|D). To avoid the sparse data problem, we propose two new smoothing methods. In the experiments with four different TREC document collections, the title language model for information retrieval with the new smoothing method outperforms both the traditional language model and the vector space model for IR significantly.
Two-stage language models for information retrieval BIBAFull-Text 49-56
  ChengXiang Zhai; John Lafferty
The optimal settings of retrieval parameters often depend on both the document collection and the query, and are usually found through empirical tuning. In this paper, we propose a family of two-stage language models for information retrieval that explicitly captures the different influences of the query and document collection on the optimal settings of retrieval parameters. As a special case, we present a two-stage smoothing method that allows us to estimate the smoothing parameters completely automatically. In the first stage, the document language model is smoothed using a Dirichlet prior with the collection language model as the reference model. In the second stage, the smoothed document language model is further interpolated with a query background language model. We propose a leave-one-out method for estimating the Dirichlet parameter of the first stage, and the use of document mixture models for estimating the interpolation parameter of the second stage. Evaluation on five different databases and four types of queries indicates that the two-stage smoothing method with the proposed parameter estimation methods consistently gives retrieval performance that is close to -- or better than -- the best results achieved using a single smoothing method and exhaustive parameter search on the test data.

User Studies

Finding relevant documents using top ranking sentences: an evaluation of two alternative schemes BIBAFull-Text 57-64
  Ryen W. White; Ian Ruthven; Joemon M. Jose
In this paper we present an evaluation of techniques that are designed to encourage web searchers to interact more with the results of a web search. Two specific techniques are examined: the presentation of sentences that highly match the searcher's query and the use of implicit evidence. Implicit evidence is evidence captured from the searcher's interaction with the retrieval results and is used to automatically update the display. Our evaluation concentrates on the effectiveness and subject perception of these techniques. The results show, with statistical significance, that the techniques are effective and efficient for information seeking.
Predicting category accesses for a user in a structured information space BIBAFull-Text 65-72
  Mao Chen; Andrea S. LaPaugh; Jaswinder Pal Singh
In a categorized information space, predicting users' information needs at the category level can facilitate personalization, caching and other topic-oriented services. This paper presents a two-phase model to predict the category of a user's next access based on previous accesses. Phase 1 generates a snapshot of a user's preferences among categories based on a temporal and frequency analysis of the user's access history. Phase 2 uses the computed preferences to make predictions at different category granularities. Several alternatives for each phase are evaluated, using the rating behaviors of on-line raters as the form of access considered. The results show that a method based on re-access pattern and frequency analysis of a user's whole history has the best prediction quality, even over a path-based method (Markov model) that uses the combined history of all users.
Detecting and Browsing Events in Unstructured text BIBAFull-Text 73-80
  David A. Smith
Previews and overviews of large, heterogeneous information resources help users comprehend the scope of collections and focus on particular subsets of interest. For narrative documents, questions of "what happened? where? and when?" are natural points of entry. Building on our earlier work at the Perseus Project with detecting terms, place names, and dates, we have exploited co-occurrences of dates and place names to detect and describe likely events in document collections. We compare statistical measures for determining the relative significance of various events. We have built interfaces that help users preview likely regions of interest for a given range of space and time by plotting the distribution and relevance of various collocations. Users can also control the amount of collocation information in each view. Once particular collocations are selected, the system can identify key phrases associated with each possible event to organize browsing of the documents themselves.


Novelty and redundancy detection in adaptive filtering BIBAFull-Text 81-88
  Yi Zhang; Jamie Callan; Thomas Minka
This paper addresses the problem of extending an adaptive information filtering system to make decisions about the novelty and redundancy of relevant documents. It argues that relevance and redundance should each be modelled explicitly and separately. A set of five redundancy measures are proposed and evaluated in experiments with and without redundancy thresholds. The experimental results demonstrate that the cosine similarity metric and a redundancy measure based on a mixture of language models are both effective for identifying redundant documents.
Improving realism of topic tracking evaluation BIBAFull-Text 89-96
  Anton Leuski; James Allan
Topic tracking and information filtering are models of interactive tasks, but their evaluations are generally done in a way that does not reflect likely usage. The models either force frequent judgments or disallow any at all, assume the user is always available to make a judgment, and do not allow for user fatigue. In this study we extend the evaluation framework for topic tracking to incorporate those more realistic issues. We demonstrate that tracking can be done in a realistic interactive setting with minimal impact on tracking cost and with substantial reduction in required interaction.
Bayesian online classifiers for text classification and filtering BIBAFull-Text 97-104
  Kian Ming Adam Chai; Hai Leong Chieu; Hwee Tou Ng
This paper explores the use of Bayesian online classifiers to classify text documents. Empirical results indicate that these classifiers are comparable with the best text classification systems. Furthermore, the online approach offers the advantage of continuous learning in the batch-adaptive text filtering task.


The use of unlabeled data to improve supervised learning for text summarization BIBAFull-Text 105-112
  Massih-Reza Amini; Patrick Gallinari
With the huge amount of information available electronically, there is an increasing demand for automatic text summarization systems. The use of machine learning techniques for this task allows one to adapt summaries to the user needs and to the corpus characteristics. These desirable properties have motivated an increasing amount of work in this field over the last few years. Most approaches attempt to generate summaries by extracting sentence segments and adopt the supervised learning paradigm which requires to label documents at the text span level. This is a costly process, which puts strong limitations on the applicability of these methods. We investigate here the use of semi-supervised algorithms for summarization. These techniques make use of few labeled data together with a larger amount of unlabeled data. We propose new semi-supervised algorithms for training classification models for text summarization. We analyze their performances on two data sets -- the Reuters news-wire corpus and the Computation and Language (cmp_lg) collection of TIPSTER SUMMAC. We perform comparisons with a baseline -- non learning -- system, and a reference trainable summarizer system.
Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering BIBAFull-Text 113-120
  Hongyuan Zha
A novel method for simultaneous keyphrase extraction and generic text summarization is proposed by modeling text documents as weighted undirected and weighted bipartite graphs. Spectral graph clustering algorithms are useed for partitioning sentences of the documents into topical groups with sentence link priors being exploited to enhance clustering quality. Within each topical group, saliency scores for keyphrases and sentences are generated based on a mutual reinforcement principle. The keyphrases and sentences are then ranked according to their saliency scores and selected for inclusion in the top keyphrase list and summaries of the document. The idea of building a hierarchy of summaries for documents capturing different levels of granularity is also briefly discussed. Our method is illustrated using several examples from news articles, news broadcast transcripts and web documents.
Cross-document summarization by concept classification BIBAFull-Text 121-128
  Hilda Hardy; Nobuyuki Shimizu; Tomek Strzalkowski; Liu Ting; Xinyang Zhang; G. Bowden Wise
In this paper we describe a Cross Document Summarizer XDoX designed specifically to summarize large document sets (50-500 documents and more). Such sets of documents are typically obtained from routing or filtering systems run against a continuous stream of data, such as a newswire. XDoX works by identifying the most salient themes within the set (at the granularity level that is regulated by the user) and composing an extraction summary, which reflects these main themes. In the current version, XDoX is not optimized to produce a summary based on a few unrelated documents; indeed, such summaries are best obtained simply by concatenating summaries of individual documents. We show examples of summaries obtained in our tests as well as from our participation in the first Document Understanding Conference (DUC).

Text Categorization

Unsupervised document classification using sequential information maximization BIBAFull-Text 129-136
  Noam Slonim; Nir Friedman; Naftali Tishby
We present a novel sequential clustering algorithm which is motivated by the Information Bottleneck (IB) method. In contrast to the agglomerative IB algorithm, the new sequential (sIB) approach is guaranteed to converge to a local maximum of the information with time and space complexity typically linear in the data size. information, as required by the original IB principle. Moreover, the time and space complexity are significantly improved. We apply this algorithm to unsupervised document classification. In our evaluation, on small and medium size corpora, the sIB is found to be consistently superior to all the other clustering methods we examine, typically by a significant margin. Moreover, the sIB results are comparable to those obtained by a supervised Naive Bayes classifier. Finally, we propose a simple procedure for trading cluster's recall to gain higher precision, and show how this approach can extract clusters which match the existing topics of the corpus almost perfectly.
Topic difference factor extraction between two document sets and its application to text categorization BIBAFull-Text 137-144
  Takahiko Kawatani
To improve performance in text categorization, it is important to extract distinctive features for each class. This paper proposes topic difference factor analysis (TDFA) as a method to extract projection axes that reflect topic differences between two document sets. Suppose all sentence vectors that compose each document are projected onto projection axes. TDFA obtains the axes that maximize the ratio between the document sets as to the sum of squared projections by solving a generalized eigenvalue problem. The axes are called topic difference factors (TDF's). By applying TDFA to the document set that belongs to a given class and a set of documents that is misclassified as belonging to that class by an existent classifier, we can obtain features that take large values in the given class but small ones in other classes, as well as features that take large values in other classes but small ones in the given class. A classifier was constructed applying the above features to complement the kNN classifier. As the results, the micro averaged F1 measure for Reuters-21578 improved from 83.69 to 87.27%.
Text genre classification with genre-revealing and subject-revealing features BIBAFull-Text 145-150
  Yong-Bae Lee; Sung Hyon Myaeng
Subject or prepositional content has been the focus of most classification research. Genre or style, on the other hand, is a different and important property of text, and automatic text genre classification is becoming important for classification and retrieval purposes as well as for some natural language processing research. In this paper, we present a method for automatic genre classification that is based on statistically selected features obtained from both subject-classified and genre classified training data. The experimental results show that the proposed method outperforms a direct application of a statistical learner often used for subject classification. We also observe that the deviation formula and discrimination formula using document frequency ratios also work as expected. We conjecture that this dual feature set approach can be generalized to improve the performance of subject classification as well.
A new family of online algorithms for category ranking BIBAFull-Text 151-158
  Koby Crammer; Yoram Singer
We describe a new family of topic-ranking algorithms for multi-labeled documents. The motivation for the algorithms stems from recent advances in online learning algorithms. The algorithms we present are simple to implement and are time and memory efficient. We evaluate the algorithms on the Reuters-21578 corpus and the new corpus released by Reuters in 2000. On both corpora the algorithms we present outperform adaptations to topic-ranking of Rocchio's algorithm and the Perceptron algorithm. We also outline the formal analysis of the algorithm in the mistake bound model. To our knowledge, this work is the first to report performance results with the entire new Reuters corpus.

Cross-language Information Retrieval

Comparing cross-language query expansion techniques by degrading translation resources BIBAFull-Text 159-166
  Paul McNamee; James Mayfield
The quality of translation resources is arguably the most important factor affecting the performance of a cross-language information retrieval system. While many investigations have explored the use of query expansion techniques to combat errors induced by translation, no study has yet examined the effectiveness of these techniques across resources of varying quality. This paper presents results using parallel corpora and bilingual wordlists that have been deliberately degraded prior to query translation. Across different languages, translingual resources, and degrees of resource degradation, pre-translation query expansion is tremendously effective. In several instances, pre-translation expansion results in better performance when no translations are available, than when an uncompromised resource is used without pre-translation expansion. We also demonstrate that post-translation expansion using relevance feedback can confer modest performance gains. Measuring the efficacy of these techniques with resources of different quality suggests an explanation for the conflicting reports that have appeared in the literature.
Statistical cross-language information retrieval using n-best query translations BIBAFull-Text 167-174
  Marcello Federico; Nicola Bertoldi
This paper presents a novel statistical model for cross-language information retrieval. Given a written query in the source language, documents in the target language are ranked by integrating probabilities computed by two statistical models: a query-translation model, which generates most probable term-by-term translations of the query, and a query-document model, which evaluates the likelihood of each document and translation. Integration of the two scores is performed over the set of N most probable translations of the query. Experimental results with values N=1, 5, 10 are presented on the Italian-English bilingual track data used in the CLEF 2000 and 2001 evaluation campaigns.
Cross-lingual relevance models BIBAFull-Text 175-182
  Victor Lavrenko; Martin Choquette; W. Bruce Croft
We propose a formal model of Cross-Language Information Retrieval that does not rely on either query translation or document translation. Our approach leverages recent advances in language modeling to directly estimate an accurate topic model in the target language, starting with a query in the source language. The model integrates popular techniques of disambiguation and query expansion in a unified formal framework. We describe how the topic model can be estimated with either a parallel corpus or a dictionary. We test the framework by constructing Chinese topic models from English queries and using them in the CLIR task of TREC9. The model achieves performance around 95% of the strong mono-lingual baseline in terms of average precision. In initial precision, our model outperforms the mono-lingual baseline by 20%. The main contribution of this work is the unified formal model which integrates techniques that are essential for effective Cross-Language Retrieval.
Resolving query translation ambiguity using a decaying co-occurrence model and syntactic dependence relations BIBAFull-Text 183-190
  Jianfeng Gao; Ming Zhou; Jian-Yun Nie; Hongzhao He; Weijun Chen
Bilingual dictionaries have been commonly used for query translation in cross-language information retrieval (CLIR). However, we are faced with the problem of translation selection. Several recent studies suggested the utilization of term co-occurrences in this selection. This paper presents two extensions to improve them. First, we extend the basic co-occurrence model by adding a decaying factor that decreases the mutual information when the distance between the terms increases. Second, we incorporate a triple translation model, in which syntactic dependence relations (represented as triples) are integrated. Our evaluation on translation accuracy shows that translating triples as units is more precise than a word-by-word translation. Our CLIR experiments show that the addition of the decaying factor leads to substantial improvements of the basic co-occurrence model; and the triple translation model brings further improvements.


Document clustering with cluster refinement and model selection capabilities BIBAFull-Text 191-198
  Xin Liu; Yihong Gong; Wei Xu; Shenghuo Zhu
In this paper, we propose a document clustering method that strives to achieve: (1) a high accuracy of document clustering, and (2) the capability of estimating the number of clusters in the document corpus (i.e. the model selection capability). To accurately cluster the given document corpus, we employ a richer feature set to represent each document, and use the Gaussian Mixture Model (GMM) together with the Expectation-Maximization (EM) algorithm to conduct an initial document clustering. From this initial result, we identify a set of discriminative features for each cluster, and refine the initially obtained document clusters by voting on the cluster label of each document using this discriminative feature set. This self-refinement process of discriminative feature identification and cluster label voting is iteratively applied until the convergence of document clusters. On the other hand, the model selection capability is achieved by introducing randomness in the cluster initialization stage, and then discovering a value C for the number of clusters N by which running the document clustering process for a fixed number of times yields sufficiently similar results. Performance evaluations exhibit clear superiority of the proposed method with its improved document clustering and model selection accuracies. The evaluations also demonstrate how each feature as well as the cluster refinement process contribute to the document clustering accuracy.
Document clustering with committees BIBAFull-Text 199-206
  Patrick Pantel; Dekang Lin
Document clustering is useful in many information retrieval tasks: document browsing, organization and viewing of retrieval results, generation of Yahoo-like hierarchies of documents, etc. The general goal of clustering is to group data elements such that the intra-group similarities are high and the inter-group similarities are low. We present a clustering algorithm called CBC (Clustering By Committee) that is shown to produce higher quality clusters in document clustering tasks as compared to several well known clustering algorithms. It initially discovers a set of tight clusters (high intra-group similarity), called committees, that are well scattered in the similarity space (low inter-group similarity). The union of the committees is but a subset of all elements. The algorithm proceeds by assigning elements to their most similar committee. Evaluating cluster quality has always been a difficult task. We present a new evaluation methodology that is based on the editing distance between output clusters and manually constructed classes (the answer key). This evaluation measure is more intuitive and easier to interpret than previous evaluation measures.
Probabilistic combination of text classifiers using reliability indicators: models and results BIBAFull-Text 207-214
  Paul N. Bennett; Susan T. Dumais; Eric Horvitz
The intuition that different text classifiers behave in qualitatively different ways has long motivated attempts to build a better metaclassifier via some combination of classifiers. We introduce a probabilistic method for combining classifiers that considers the context-sensitive reliabilities of contributing classifiers. The method harnesses reliability indicators -- variables that provide a valuable signal about the performance of classifiers in different situations. We provide background, present procedures for building metaclassifiers that take into consideration both reliability indicators and classifier outputs, and review a set of comparative studies undertaken to evaluate the methodology.


Efficient phrase querying with an auxiliary index BIBAFull-Text 215-221
  Dirk Bahle; Hugh E. Williams; Justin Zobel
Search engines need to evaluate queries extremely fast, a challenging task given the vast quantities of data being indexed. A significant proportion of the queries posed to search engines involve phrases. In this paper we consider how phrase queries can be efficiently supported with low disk overheads. Previous research has shown that phrase queries can be rapidly evaluated using nextword indexes, but these indexes are twice as large as conventional inverted files. We propose a combination of nextword indexes with inverted files as a solution to this problem. Our experiments show that combined use of an auxiliary nextword index and a conventional inverted file allow evaluation of phrase queries in half the time required to evaluate such queries with an inverted file alone, and the space overhead is only 10% of the size of the inverted file. Further time savings are available with only slight increases in disk requirements.
Compression of inverted indexes For fast query evaluation BIBAFull-Text 222-229
  Falk Scholer; Hugh E. Williams; John Yiannis; Justin Zobel
Compression reduces both the size of indexes and the time needed to evaluate queries. In this paper, we revisit the compression of inverted lists of document postings that store the position and frequency of indexed terms, considering two approaches to improving retrieval efficiency: better implementation and better choice of integer compression schemes. First, we propose several simple optimisations to well-known integer compression schemes, and show experimentally that these lead to significant reductions in time. Second, we explore the impact of choice of compression scheme on retrieval efficiency.
   In experiments on large collections of data, we show two surprising results: use of simple byte-aligned codes halves the query evaluation time compared to the most compact Golomb-Rice bitwise compression schemes; and, even when an index fits entirely in memory, byte-aligned codes result in faster query evaluation than does an uncompressed index, emphasising that the cost of transferring data from memory to the CPU cache is less for an appropriately compressed index than for an uncompressed index. Moreover, byte-aligned schemes have only a modest space overhead: the most compact schemes result in indexes that are around 10% of the size of the collection, while a byte-aligned scheme is around 13%. We conclude that fast byte-aligned codes should be used to store integers in inverted lists.
Set-based model: a new approach for information retrieval BIBAFull-Text 230-237
  Bruno Possas; Nivio Ziviani; Wagner, Jr. Meira; Berthier Ribeiro-Neto
The objective of this paper is to present a new technique for computing term weights for index terms, which leads to a new ranking mechanism, referred to as set-based model. The components in our model are no longer terms, but termsets. The novelty is that we compute term weights using a data mining technique called association rules, which is time efficient and yet yields nice improvements in retrieval effectiveness. The set-based model function for computing the similarity between a document and a query considers the termset frequency in the document and its scarcity in the document collection. Experimental results show that our model improves the average precision of the answer set for all three collections evaluated. For the TReC-3 collection, our set-based model led to a gain, relative to the standard vector space model, of 37% in average precision curves and of 57% in average precision for the top 10 documents. Like the vector space model, the set-based model has time complexity that is linear in the number of documents in the collection.

Collaborative Filtering

Collaborative filtering with privacy via factor analysis BIBAFull-Text 238-245
  John Canny
Collaborative filtering (CF) is valuable in e-commerce, and for direct recommendations for music, movies, news etc. But today's systems have several disadvantages, including privacy risks. As we move toward ubiquitous computing, there is a great potential for individuals to share all kinds of information about places and things to do, see and buy, but the privacy risks are severe. In this paper we describe a new method for collaborative filtering which protects the privacy of individual data. The method is based on a probabilistic factor analysis model. Privacy protection is provided by a peer-to-peer protocol which is described elsewhere, but outlined in this paper. The factor analysis approach handles missing data without requiring default values for them. We give several experiments that suggest that this is most accurate method for CF to date. The new algorithm has other advantages in speed and storage over previous algorithms. Finally, we suggest applications of the approach to other kinds of statistical analyses of survey or questionnaire data.
Inverted file search algorithms for collaborative filtering BIBAFull-Text 246-252
  Rickard Coster; Martin Svensson
This paper explores the possibility of using a disk based inverted file structure for collaborative filtering. Our hypothesis is that this allows for faster calculation of predictions and also that early termination heuristics may be used to further speed up the filtering process and perhaps even improve the quality of the predictions. In an experiment on the EachMovie dataset this was tested. Our results indicate that searching the inverted file structure is many times faster than general in-memory vector search, even for very large profiles. The Continue termination heuristics produces the best ranked predictions in our experiments, and Quit is the top performer in terms of speed.
Methods and metrics for cold-start recommendations BIBAFull-Text 253-260
  Andrew I. Schein; Alexandrin Popescul; Lyle H. Ungar; David M. Pennock
We have developed a method for recommending items that combines content and collaborative data under a single probabilistic framework. We benchmark our algorithm against a naive Bayes classifier on the cold-start problem, where we wish to recommend items that no one in the community has yet rated. We systematically explore three testing methodologies using a publicly available data set, and explain how these methods apply to specific real-world applications. We advocate heuristic recommenders when benchmarking to give competent baseline performance. We introduce a new performance metric, the CROC curve, and demonstrate empirically that the various components of our testing strategy combine to obtain deeper understanding of the performance characteristics of recommender systems. Though the emphasis of our testing is on cold-start recommending, our methods for recommending and evaluation are general.

Arabic Information Retrieval

Term selection for searching printed Arabic BIBAFull-Text 261-268
  Kareem Darwish; Douglas W. Oard
Since many Arabic documents are available only in print, automating retrieval from collections of scanned Arabic document images using Optical Character Recognition (OCR) is an interesting problem. Arabic combines rich morphology with a writing system that presents unique challenges to OCR systems. These factors must be considered when selecting terms for automatic indexing. In this paper, alternative choices of indexing terms are explored using both an existing electronic text collection and a newly developed collection built from images of actual printed Arabic documents. Character n-grams or lightly stemmed words were found to typically yield near-optimal retrieval effectiveness, and combining both types of terms resulted in robust performance across a broad range of conditions.
Empirical studies in strategies for Arabic retrieval BIBAFull-Text 269-274
  Jinxi Xu; Alexander Fraser; Ralph Weischedel
This work evaluates a few search strategies for Arabic monolingual and cross-lingual retrieval, using the TREC Arabic corpus as the test-bed. The release by NIST in 2001 of an Arabic corpus of nearly 400k documents with both monolingual and cross-lingual queries and relevance judgments has been a new enabler for empirical studies. Experimental results show that spelling normalization and stemming can significantly improve Arabic monolingual retrieval. Character tri-grams from stems improved retrieval modestly on the test corpus, but the improvement is not statistically significant. To further improve retrieval, we propose a novel thesaurus-based technique. Different from existing approaches to thesaurus-based retrieval, ours formulates word synonyms as probabilistic term translations that can be automatically derived from a parallel corpus. Retrieval results show that the thesaurus can significantly improve Arabic monolingual retrieval. For cross-lingual retrieval (CLIR), we found that spelling normalization and stemming have little impact.
Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis BIBAFull-Text 275-282
  Leah S. Larkey; Lisa Ballesteros; Margaret E. Connell
Arabic, a highly inflected language, requires good stemming for effective information retrieval, yet no standard approach to stemming has emerged. We developed several light stemmers based on heuristics and a statistical stemmer based on co-occurrence for Arabic retrieval. We compared the retrieval effectiveness of our stemmers and of a morphological analyzer on the TREC-2001 data. The best light stemmer was more effective for cross-language retrieval than a morphological stemmer which tried to find the root for each word. A repartitioning process consisting of vowel removal followed by clustering using co-occurrence analysis produced stem classes which were better than no stemming or very light stemming, but still inferior to good light stemming or morphological analysis.


Automatic query refinement using lexical affinities with maximal information gain BIBAFull-Text 283-290
  David Carmel; Eitan Farchi; Yael Petruschka; Aya Soffer
This work describes an automatic query refinement technique, which focuses on improving precision of the top ranked documents. The terms used for refinement are lexical affinities (LAs), pairs of closely related words which contain exactly one of the original query terms. Adding these terms to the query is equivalent to re-ranking search results, thus, precision is improved while recall is preserved. We describe a novel method that selects the most "informative" LAs for refinement, namely, those LAs that best separate relevant documents from irrelevant documents in the set of results. The information gain of candidate LAs is determined using unsupervised estimation that is based on the scoring function of the search engine. This method is thus fully automatic and its quality depends on the quality of the scoring function. Experiments we conducted with TREC data clearly show a significant improvement in the precision of the top ranked documents.
Web question answering: is more always better? BIBAFull-Text 291-298
  Susan Dumais; Michele Banko; Eric Brill; Jimmy Lin; Andrew Ng
This paper describes a question answering system that is designed to capitalize on the tremendous amount of data that is now available online. Most question answering systems use a wide variety of linguistic resources. We focus instead on the redundancy available in large corpora as an important resource. We use this redundancy to simplify the query rewrites that we need to use, and to support answer mining from returned snippets. Our system performs quite well given the simplicity of the techniques being utilized. Experimental results show that question answering accuracy can be greatly improved by analyzing more and more matching passages. Simple passage ranking and n-gram extraction techniques work well in our system making it efficient to use with many backend retrieval engines.
Predicting query performance BIBAFull-Text 299-306
  Steve Cronen-Townsend; Yun Zhou; W. Bruce Croft
We develop a method for predicting query performance by computing the relative entropy between a query language model and the corresponding collection language model. The resulting clarity score measures the coherence of the language usage in documents whose models are likely to generate the query. We suggest that clarity scores measure the ambiguity of a query with respect to a collection of documents and show that they correlate positively with average precision in a variety of TREC test sets. Thus, the clarity score may be used to identify ineffective queries, on average, without relevance information. We develop an algorithm for automatically setting the clarity score threshold between predicted poorly-performing queries and acceptable queries and validate it using TREC data. In particular, we compare the automatic thresholds to optimum thresholds and also check how frequently results as good are achieved in sampling experiments that randomly assign queries to the two classes.
Using part-of-speech patterns to reduce query ambiguity BIBAFull-Text 307-314
  James Allan; Hema Raghavan
Query ambiguity is a generally recognized problem, particularly in Web environments where queries are commonly only one or two words in length. In this study, we explore one technique that finds commonly occurring patterns of parts of speech near a one-word query and allows them to be transformed into clarification questions. We use a technique derived from statistical language modeling to show that the clarification queries will reduce ambiguity much of the time, and often quite substantially.
Is natural language an inconvenience or an opportunity for IR? BIBAFull-Text 315
  Kimmo Koskenniemi
Natural language (NL) has evolved to facilitate human communication. It enables the speaker to make the listener's mind wander among her experiences and mental associations roughly according to the intentions of the speaker. The speaker and the listener usually share experiences and expectations, and they use mostly the same units and rules of a shared NL. Written language functions similarly, but in a less interactive way, with fewer possibilities for feedback.
   Both the symbols of NL (i.e. words or morphemes), and their arrangements are meaningful. Not with universal and precise meanings, but similar enough among different speakers and accurate enough for the communication mostly to succeed.
   NLs are mostly very large systems. Hundreds of thousands of words and infinitely many possible utterances. Even inflection alone might produce huge numbers of forms, e.g. more than ten thousand distinct forms out of every Finnish verb entry.
   NL processing (for IR or any other purpose) must cope with phenomena like (1) inflection and compounding, (2) synonymy, (3) polysemy, (4) ambiguity, (5) anaphora and (6) head-modifier relations among words and phrases.
   Language technology can neutralize much of the effect of these 'inconveniences' inherent with NL, but what kinds of advantages could NL have?
  • Redundant use of synonymous expressions can effectively identify new
  • Multilingual parallel documents may help in identifying their exact content.
  • NLs typically carry connotations, i.e. what is implied but not explicitly
       said (e.g. attitudes, politeness).
  • Vague associations are easy to express in NL, but not always in formal
       systems (e.g. "a few years ago there was an article about the rival of
       Yeltsin -- I don't remember his name but -- he then went over to some region
       in Siberia -- but what did the guy promise?")
  • Jokes and humor belong to NLs, not to formal systems. Are there any alternatives for NL? Not really, because any artificial and more precise formalisms fail to adapt to new concepts and they do not easily allow restructuring of previous ideas.
       One challenge for language technology is to find better solutions for the above 'inconveniences' in order to provide various IR, document classification, indexing and summarizing methods with more accurate and adequate input data. With more accurate input some of the more demanding tasks of IR can perhaps be solved.
  • Evaluation

    The effect of topic set size on retrieval experiment error BIBAFull-Text 316-323
      Ellen M. Voorhees; Chris Buckley
    Retrieval mechanisms are frequently compared by computing the respective average scores for some effectiveness metric across a common set of information needs or topics, with researchers concluding one method is superior based on those averages. Since comparative retrieval system behavior is known to be highly variable across topics, good experimental design requires that a "sufficient" number of topics be used in the test. This paper uses TREC results to empirically derive error rates based on the number of topics used in a test and the observed difference in the average scores. The error rates quantify the likelihood that a different set of topics of the same size would lead to a different conclusion. We directly compute error rates for topic sets up to size 25, and extrapolate those rates for larger topic set sizes. The error rates found are larger than anticipated, indicating researchers need to take care when concluding one method is better than another, especially if few topics are used.
    Liberal relevance criteria of TREC: counting on negligible documents? BIBAFull-Text 324-330
      Eero Sormunen
    Most test collections (like TREC and CLEF) for experimental research in information retrieval apply binary relevance assessments. This paper introduces a four-point relevance scale and reports the findings of a project in which TREC-7 and TREC-8 document pools on 38 topics were reassessed. The goal of the reassessment was to build a subcollection of TREC for experiments on highly relevant documents and to learn about the assessment process as well as the characteristics of a multigraded relevance corpus.
       Relevance criteria were defined so that a distinction was made between documents rich in topical information (relevant and highly relevant documents) and poor in topical information (marginally relevant documents). It turned out that about 50% of documents assessed as relevant were regarded as marginal. The characteristics of the relevance corpus and lessons learned from the reassessment project are discussed. The need to develop more elaborated relevance assessment schemes is emphasized.


    Robust temporal and spectral modeling for query By melody BIBAFull-Text 331-338
      Shai Shalev-Shwartz; Shlomo Dubnov; Nir Friedman; Yoram Singer
    Query by melody is the problem of retrieving musical performances from melodies. Retrieval of real performances is complicated due to the large number of variations in performing a melody and the presence of colored accompaniment noise. We describe a simple yet effective probabilistic model for this task. We describe a generative model that is rich enough to capture the spectral and temporal variations of musical performances and allows for tractable melody retrieval. While most of previous studies on music retrieval from melodies were performed with either symbolic (e.g. MIDI) data or with monophonic (single instrument) performances, we performed experiments in retrieving live and studio recordings of operas that contain a leading vocalist and rich instrumental accompaniment. Our results show that the probabilistic approach we propose is effective and can be scaled to massive datasets.
    Video retrieval using an MPEG-7 based inference network BIBAFull-Text 339-346
      Andrew Graves; Mounia Lalmas
    This work proposes a model for video retrieval based upon the inference network model. The document network is constructed using video metadata encoded using MPEG-7 and captures information pertaining to the structural aspects (video breakdown into shots and scenes), conceptual aspects (video, scene and shot content) and contextual aspects (context information about the position of conceptual content within the document). The retrieval process a) exploits the distribution of evidence among the shots to perform ranking of different levels of granularity, b) addresses the idea that evidence may be inherited during evaluation, and c) exploits the contextual information to perform constrained queries.

    Poster session

    Using self-supervised word segmentation in Chinese information retrieval BIBAFull-Text 349-350
      Fuchun Peng; Xiangji Huang; Dale Schuurmans; Nick Cercone; Stephen E. Robertson
    We propose a self-supervised word-segmentation technique for Chinese information retrieval. This method combines the advantages of traditional dictionary based approaches with character based approaches, while overcoming many of their shortcomings. Experiments on TREC data show comparable performance to both the dictionary based and the character based approaches. However, our method is language independent and unsupervised, which provides a promising avenue for constructing accurate multilingual information retrieval systems that are flexible and adaptive.
    Automatic classification in product catalogs BIBAFull-Text 351-352
      Ben Wolin
    In this paper, we present the AutoCat system for product classification. AutoCat uses a vector space model, modified to consider product attributes unavailable in traditional document classification. We present key features of our user interface, developed to assist users with evaluating and editing the output of the classification algorithm. Finally, we present observations about the use of this technology in the field.
    PageRank, HITS and a unified framework for link analysis BIBAFull-Text 353-354
      Chris Ding; Xiaofeng He; Parry Husbands; Hongyuan Zha; Horst D. Simon
    Two popular link-based webpage ranking algorithms are (i) PageRank[1] and (ii) HITS (Hypertext Induced Topic Selection)[3]. HITS makes the crucial distinction of hubs and authorities and computes them in a mutually reinforcing way. PageRank considers the hyperlink weight normalization and the equilibrium distribution of random surfers as the citation score. We generalize and combine these key concepts into a unified framework, in which we prove that rankings produced by PageRank and HITS are both highly correlated with the ranking by in-degree and out-degree.
    Task orientation in question answering BIBFull-Text 355-356
      Vanessa Murdock; W. Bruce Croft
    Experiments in high-dimensional text categorization BIBAFull-Text 357-358
      Fred J. Damerau; Tong Zhang; Sholom M. Weiss; Nitin Indurkhya
    We present results for automated text categorization of the Reuters-810000 collection of news stories. Our experiments use the entire one-year collection of 810,000 stories and the entire subject index. We divide the data into monthly groups and provide an initial benchmark of text categorization performance on the complete collection. Experimental results show that efficient sparse-feature implementations of linear methods and decision trees, using a global unstemmed dictionary, can readily handle applications of this size. Predictive performance is approximately as strong as the best results for the much smaller older Reuters collections. Detailed results are provided over time periods. It is shown that a smaller time horizon does not diminish predictive quality, implying reduced demands for retraining when sample size is large.
    The relationship between ASK and relevance criteria BIBFull-Text 359-360
      Xiao-Jun Yuan; Nicholas J. Belkin; Ja-Young Kim
    ICA and SOM in text document analysis BIBAFull-Text 361-362
      Ella Bingham; Jukka Kuusisto; Krista Lagus
    In this study we show experimental results on using Independent Component Analysis (ICA) and the Self-Organizing Map (SOM) in document analysis. Our documents are segments of spoken dialogues carried out over the telephone in a customer service, transcribed into text. The task is to analyze the topics of the discussions, and to group the discussions into meaningful subsets. The quality of the grouping is studied by comparing to a manual topical classification of the documents.
    Improving hierarchical text classification using unlabeled data BIBFull-Text 363-364
      Vijay Boyapati
    Do thumbnail previews help users make better relevance decisions about web search results? BIBAFull-Text 365-366
      Susan Dziadosz; Raman Chandrasekar
    We describe an empirical evaluation of the utility of thumbnail previews in web search results. Results pages were constructed to show text-only summaries, thumbnail previews only, or the combination of text summaries and thumbnail previews. We found that in the combination case, users were able to make more accurate decisions about the potential relevance of results than in either of the other versions, with hardly any increase in speed of processing the page as a whole.
    Amilcare: adaptive information extraction for document annotation BIBFull-Text 367-368
      Fabio Ciravegna; Alexiei Dingli; Yorick Wilks; Daniela Petrelli
    The impact of corpus size on question answering performance BIBAFull-Text 369-370
      C. L. A. Clarke; G. V. Cormack; M. Laszlo; T. R. Lynam; E. L. Terra
    Using our question answering system, questions from the TREC 2001 evaluation were executed over a series of Web data collections, with the sizes of the collections increasing from 25 gigabytes up to nearly a terabyte.
    Effective collection metasearch in a hierarchical environment: global vs. localized retrieval performance BIBAFull-Text 371-372
      Jack G. Conrad; Changwen Yang; Joanne S. Claussen
    We compare standard global IR searching with user-centric localized techniques to address the database selection problem. We conduct a series of experiments to compare the retrieval effectiveness of three separate search modes applied to a hierarchically structured data environment of textual database representations. The data environment is represented as a tree-like directory containing over 15,000 unique databases and over 100,000 total leaf nodes. Our search modes consist of varying degrees of browse and search, from a global search at the root node to a refined search at a sub-node using dynamically-calculated inverse document frequencies (idfs) to score candidate databases for probable relevance. Our findings indicate that a browse and search approach that relies upon localized searching from sub-nodes is capable of producing the most effective results.
    Experimenting with graphical user interfaces for structured document retrieval BIBFull-Text 373-374
      Fabio Crestani; Pablo de la Fuente; Jesus Vegas
    The web retrieval task and its evaluation in the third NTCIR workshop BIBAFull-Text 375-376
      Koji Eguchi; Keizo Oyama; Emi Ishida; Kazuko Kuriyama; Noriko Kando
    This paper gives an overview of the evaluation method used for the Web Retrieval Task in the Third NTCIR Workshop, which is currently in progress. In the Web Retrieval Task, we try to assess the retrieval effectiveness of each Web search engine system using a common data set, and attempt to build a re-usable test collection suitable for evaluating Web search engine systems. With these objectives, we have built 100-gigabyte and 10-gigabyte document sets, mainly gathered from the '.jp' domain. Relevance judgment is performed on the retrieved documents, which are written in Japanese or English.
    How Many Bits are Needed to Store Term Frequencies? BIBAFull-Text 377-378
      Martin Franz; J. Scott McCarley
    Search algorithms in most current text retrieval systems use index data structures extracted from the original text documents. In this paper we focus on reducing the size of the indices by reducing the amount of space dedicated to store term frequencies. In experiments using TREC Ad Hoc [2, 3] corpora and query sets, we show that it is possible to store the term frequency in only two bits without decreasing retrieval performance.
    Non-linear reading for a structured web indexation BIBAFull-Text 379-380
      Mathias Gery
    The growth of the Web has posed new challenges for Information Retrieval (IR). Most of the current systems are based on traditional models, which have been developed for atomic and independents documents and are not adapted to the Web. A promising research orientation consists of studying the impact of the Web structure on indexing. The HyperDocument model presented in this article is based on essential aspects of information comprehension: content, composition and linear/non-linear reading.
    Document normalization revisited BIBAFull-Text 381-382
      Abdur Chowdhury; M. Catherine McCabe; David Grossman; Ophir Frieder
    Cosine Pivoted Document Length Normalization has reached a point of stability where many researchers indiscriminately apply a specific value of 0.2 regardless of the collection. Our efforts, however, demonstrate that applying this specific value without tuning for the document collection degrades average precision by as much as 20%.
    User-centered interface design for cross-language information retrieval BIBAFull-Text 383-384
      Preben Hansen; Daniela Petrelli; Jussi Karlgren; Micheline Beaulieu; Mark Sanderson
    This paper reports on the user-centered design methodology and techniques used for the elicitation of user requirements and how these requirements informed the first phase of the user interface design for a Cross-Language Information Retrieval System. We describe a set of factors involved in analysis of the data collected and, finally discuss the implications for user interface design based on the findings.
    Implementation of relevance feedback for content-based music retrieval based on user preferences BIBFull-Text 385-386
      Keiichiro Hoashi; Erik Zeitler; Naomi Inoue
    Spatial information retrieval and geographical ontologies an overview of the SPIRIT project BIBFull-Text 387-388
      Christopher B. Jones; R. Purves; A. Ruas; M. Sanderson; M. Sester; M. van Kreveld; R. Weibel
    A visualisation tool for topic tracking analysis and development BIBAFull-Text 389-390
      Gareth J. F. Jones; Steven M. Gabb
    Topic Detection and Tracking (TDT) research explores the development of algorithms to detect novel events and track their development over time for online reports. Development of these methods requires careful evaluation and analysis. Traditional reductive methods of evaluation only represent some of the available information of algorithm behaviour. We describe a visualisation tool for topic tracking which makes it easy to analysis and compare the temporal behaviour of tracking algorithms.
    A new method of parameter estimation for multinomial naive bayes text classifiers BIBAFull-Text 391-392
      Sang-Bum Kim; Hae-Chang Rim; Heui-Seok Lim
    Multinomial naive Bayes classifiers have been widely used for the probabilistic text classification. However, their parameter estimation method sometimes generates inappropriate probabilities. In this paper, we propose a topic document model approach for naive Bayes text classification, where their parameters are estimated with an expectation from the training documents. Experiments are conducted on Reuters 21578 and 20 Newsgroup collection, and our proposed approach obtained a significant improvement in performance over the conventional approach.
    Study of category score algorithms for k-NN classifier BIBAFull-Text 393-394
      Huaizhong Kou; Georges Gardarin
    We analyzes category score algorithms for k-NN classifier found in the literature, including majority voting algorithm (MVA), simple sum algorithm (SSA). MVA and SSA are two mainly used algorithms to estimate score for candidate categories in k-NN classifier systems. Based on the hypothesis that utilization of internal relation between documents and categories could improve system performance, two new weighting score models: concept-based weighting (CBW) score model and term independence-based weighting (IBW) score model are proposed. Our experimental results confirm our hypothesis and show that in the term of precision average IBW and CBW are better than the other score models, while SSA is higher than MVA. According to macro-average F1 CBW performs best. Rocchio-based algorithm (RBA) always performs worst.
    Higher precision for two-word queries BIBAFull-Text 395-396
      K. L. Kwok
    Queries have specific properties, and may need individualized methods and parameters to optimize retrieval. Length is one property. We look at how two-word queries may attain higher precision by re-ranking using word co-occurrence evidence in retrieved documents. Co-occurrence within document context is not sufficient, but window context including sentence context evidence can provide precision improvements at low recall region of 4 to 10% using initial retrieval results, and positively affects pseudo-relevance feedback.
    The boomerang effect: retrieving scientific documents via the network of references and citations BIBFull-Text 397-398
      Birger Larsen; Peter Ingwersen
    A logistic regression approach to distributed IR BIBAFull-Text 399-400
      Ray R. Larson
    This poster session examines a probabilistic approach to distributed information retrieval using a Logistic Regression algorithm for estimation of collection relevance. The algorithm is compared to other methods for distributed search using test collections developed for distributed search evaluation.
    Automatic metadata generation & evaluation BIBAFull-Text 401-402
      Elizabeth D. Liddy; Eileen Allen; Sarah Harwell; Susan Corieri; Ozgur Yilmazel; N. Ercan Ozgencil; Anne Diekema; Nancy McCracken; Joanne Silverstein; Stuart Sutton
    The poster reports on a project in which we are investigating methods for breaking the human metadata-generation bottleneck that plagues Digital Libraries. The research question is whether metadata elements and values can be automatically generated from the content of educational resources, and correctly assigned to mathematics and science educational materials. Natural Language Processing and Machine Learning techniques were implemented to automatically assign values of the GEMgenerate metadata element set tofor learning resources provided by the Gateway for Education (GEM), a service that offers web access to a wide range of educational materials. In a user study, education professionals evaluated the metadata assigned to learning resources by either automatic tagging or manual assignment. Results show minimal difference in the eyes of the evaluators between automatically generated metadata and manually assigned metadata.
    A critical examination of TDT's cost function BIBAFull-Text 403-404
      R. Manmatha; Ao Feng; James Allan
    Topic Detection and Tracking (TDT) tasks are evaluated using a cost function. The standard TDT cost function assumes a constant probability of relevance P(rel) across all topics. In practice, P(rel) varies widely across topics. We argue using both theoretical and experimental evidence that the cost function should be modified to account for the varying P(rel).
    Converting on-line bilingual dictionaries from human-readable to machine-readable form BIBAFull-Text 405-406
      James Mayfield; Paul McNamee
    We describe a language called ABET that allows rapid conversion of on-line human-readable bilingual dictionaries to machine-readable form.
    Modeling (in)variability of human judgments for text summarization BIBAFull-Text 407-408
      Tadashi Nomoto; Yuji Matsumoto
    The paper proposes and empirically motivates an integration of supervised learning with unsupervised learning to deal with human biases in summarization. In particular, we explore the use of probabilistic decision tree within the clustering framework to account for the variation as well as regularity in human created summaries.
    Content-based music indexing and organization BIBAFull-Text 409-410
      Andreas Rauber; Elias Pampalk; Dieter Merkl
    While electronic music archives are gaining popularity, access to and navigation within these archives is usually limited to text-based queries or manually predefined genre category browsing. We present a system that automatically organizes a music collection according to the perceived sound similarity resembling genres or styles of music. Audio signals are processed according to psychoacoustic models to obtain a time-invariant representation of its characteristics. Subsequent clustering provides an intuitive interface where similar pieces of music are grouped together on a map display.
    Relative and absolute term selection criteria: a comparative study for English and Japanese IR BIBFull-Text 411-412
      Tetsuya Sakai; Stephen E. Robertson
    Experiments on data fusion using headline information BIBAFull-Text 413-414
      Xiao Mang Shou; Mark Sanderson
    This poster describes initial work exploring a relatively unexamined area of data fusion: fusing the results of retrieval systems whose collections have no overlap between them. Many of the effective meta-search/data fusion strategies gain much of their success from exploiting document overlap across the source systems being merged. When the intersection of the collections is the empty set, the strategies generally degrade to a simpler form. In order to address such situations, two strategies were examined: re-ranking of merged results using a locally run search on the text fragments returned by the source search engines; and re-ranking based on cross document similarity, again using text fragments presented in the retrieved list. Results, from experiments, which go beyond previous work, indicate that both strategies improve fusion effectiveness.
    Building thematic lexical resources by term categorization BIBAFull-Text 415-416
      Alberto Lavelli; Bernardo Magnini; Fabrizio Sebastiani
    We discuss the automatic generation of thematic lexicons by means of term categorization, a novel task employing techniques from information retrieval (IR) and machine learning (ML). Specifically, we view the generation of such lexicons as an iterative process of learning previously unknown associations between terms and themes (i.e. disciplines, or fields of activity). The process is iterative, in that it generates, for each ci in a set C = {c1,...,cm} of themes, a sequence Li0? Li1? ... ? Lin of lexicons, bootstrapping from an initial lexicon Li0 and a set of text corpora Θ = {θ0,...,θn-1} given as input. The method is inspired by text categorization, the discipline concerned with labelling natural language texts with labels from a predefined set of themes, or categories. However, while text categorization deals with documents represented as vectors in a space of terms, term categorization deals (dually) with terms represented as vectors in a space of documents, and labels terms (instead of documents) with themes. As a learning device we adopt boosting, since (a) it has demonstrated state-of-the-art effectiveness in a variety of text categorization applications, and (b) it naturally allows for a form of "data cleaning", thereby making the process of generating a thematic lexicon an iteration of generate-and-test steps.
    Topic structure modeling BIBAFull-Text 417-418
      David A. Evans; James G. Shanahan; Victor Sheftel
    In this paper, we present a method based on document probes to quantify and diagnose topic structure, distinguishing topics as monolithic, structured, or diffuse. The method also yields a structure analysis that can be used directly to optimize filter (classifier) creation. Preliminary results illustrate the predictive value of the approach on TREC/Reuters-96 topics.
    Language model for IR using collection information BIBAFull-Text 419-420
      Rong Jin; Luo Si; Alex G. Hauptmann; Jamie Callan
    Information retrieval using meta data can be traced back to the early age of IR where documents are represented by the controlled vocabulary. In this paper, we explore the usage of meta-data information under the framework of language model. We present a new language model that is able to take advantage of the category information for documents to improve the retrieval accuracy. We compare the new language model with the traditional language model over the TREC4 dataset where the collection information for documents is obtained using the k-means clustering method. The new language model outperforms the traditional language model, which verifies our statement.
    Automatic evaluation of world wide web search services BIBAFull-Text 421-422
      Abdur Chowdhury; Ian Soboroff
    Users of the World-Wide Web are not only confronted by an immense overabundance of information, but also by a plethora of tools for searching for the web pages that suit their information needs. Web search engines differ widely in interface, features, coverage of the web, ranking methods, delivery of advertising, and more. In this paper, we present a method for comparing search engines automatically based on how they rank known item search results. Because the engines perform their search on overlapping (but different) subsets of the web collected at different points in time, evaluation of search engines poses significant challenges to the traditional information retrieval methodology. Our method uses known item searching; comparing the relative ranks of the items in the search engines' rankings. Our approach automatically constructs known item queries using query log analysis and automatically constructs the result via analysis of editor comments from the ODP (Open Directory Project). Additionally, we present our comparison on five (Lycos, Netscape, Fast, Google, HotBot) well-known search services and find that some services perform known item searches better than others, but the majority are statistically equivalent.
    Does WT10g look like the web? BIBAFull-Text 423-424
      Ian Soboroff
    We measure the WT10g test collection, used in the TREC-9 and TREC 2001 Web Tracks, with common measures used in the web topology community, in order to see if WT10g "looks like" the web. This is not an idle question; characteristics of the web, such as power law relationships, diameter, and connected components have all been observed within the scope of general web crawls, constructed by blindly following links. In contrast, WT10g was carved out from a larger crawl specifically to be a web search test collection within the reach of university researchers. Does such a collection retain the properties of the larger web? In the case of WT10g, yes.
    Biterm language models for document retrieval BIBFull-Text 425-426
      Munirathnam Srikanth; Rohini Srihari
    Selecting indexing strings using adaptation BIBAFull-Text 427-428
      Yoshiyuki Takeda; Kyoji Umemura
    It is not easy to tokenize agglutinative languages like Japanese and Chinese into words. Many IR systems start with a dictionary-based morphology program like ChaSen [4]. Unfortunately, dictionaries cannot cover all possible words; unknown words such as proper nouns are important for IR. This paper proposes a statistical dictionary-free method for selecting index strings based on recent work on adaptive language modeling.
    Error correction in a Chinese OCR test collection BIBAFull-Text 429-430
      Yuen-Hsien Tseng
    This article proposes a technique for correcting Chinese OCR errors to support retrieval of scanned documents. The technique uses a completely automatic technique (no manually constructed lexicons or confusion resources) to identify both keywords and confusable terms. Improved retrieval effectiveness on a single term query experiment is demonstrated.
    User interface effects in past batch versus user experiments BIBFull-Text 431-432
      Andrew Turpin; William Hersh
    K-tree/forest: efficient indexes for boolean queries BIBAFull-Text 433-434
      Rakesh M. Verma; Sanjiv Behl
    In Information Retrieval it is well-known that the complexity of processing boolean queries depends on the size of the intermediate results, which could be huge (and are typically on disk) even though the size of the final result may be quite small. In the case of inverted files the most time consuming operation is the merging or intersection of the list of occurrences [1]. We propose, the Keyword tree (K-tree) and forest, efficient structures to handle boolean queries in keyword-based information retrieval. Extensive simulations show that K-tree is orders-of-magnitude faster (i.e., far fewer I/O's) for boolean queries than the usual approach of merging the lists of occurrences and incurs only a small overhead for single keyword queries. The K-tree can be efficiently parallelized as well. The construction cost of K-tree is comparable to the cost of building inverted files.
    Example-based phrase translation in Chinese-English CLIR BIBAFull-Text 435-436
      Bin Wang; Xueqi Cheng; Shuo Bai
    This paper proposes an example-based phrase translation method in a Chinese to English cross-language information retrieval (CLIR) system. The method can generate much more accurate query translations than dictionary-based and common MT-based methods, and then improves the retrieval performance of our CLIR system.
    Probabilistic multimedia retrieval BIBAFull-Text 437-438
      Thijs Westerveld
    We present a framework in which probabilistic models for textual and visual information retrieval can be integrated seamlessly. The framework facilitates searching for imagery using textual descriptions and visual examples simultaneously. The underlying Language Models for text and Gaussian Mixture Models for images have proven successful in various retrieval tasks.
    Chinese keyword extraction based on max-duplicated strings of the documents BIBAFull-Text 439-440
      Wenfeng Yang
    The corpus analysis methods in Chinese keyword extraction look on the corpus as a single sample of language stochastic process. But the distributions of keywords in the whole corpus and in each document are very different from each other. The extraction based on global statistical information only can get significant keywords in the whole corpus. Max-duplicated strings contain the local significant keywords in each document. In this paper, we designed an efficient algorithm to extract the max-duplicated strings by building PAT-tree for the document, so that the keywords can be picked out from the max-duplicated strings by their SIG values in the corpus.
    A hierarchical approach: query large music database by acoustic input BIBFull-Text 441-442
      Yazhong Feng; Yueting Zhuang; Yunhe Pan
    Correlating multilingual documents via bipartite graph modeling BIBAFull-Text 443-444
      Hongyuan Zha; Xiang Ji
    There is enormous amount of multilingual documents from various sources and possibly from different countries describing a single event or a set of related events. It is desirable to construct text mining methods that can compare and highlight similarities and differences of those multilingual documents. We discuss our ongoing research that seeks to model a pair of multilingual documents as a weighted bipartite graph with the edge weights computed by means of machine translation. We use spectral method to identify dense subgraphs of the weighted bipartite graph which can be considered as corresponding to sentences that correlate well in textual contents. We illustrate our approach using English and German texts.

    Demo session

    A system using implicit feedback and top ranking sentences to help users find relevant web documents BIBAFull-Text 446
      Ryen W. White; Joemon M. Jose; Ian Ruthven
    We present a web search interface designed to encourage users to interact more fully with the results of a web search. Wrapping around a major commercial search engine, the system combines three main features; real-time query-biased web document summarisation, the presentation of sentences highly relevant to the searcher's query, and evidence captured from searcher interaction with the retrieval results.
    Indexing, searching, and retrieving of recorded live presentations with the AOF (authoring on the fly) search engine BIBAFull-Text 447
      Wolfgang Hurst
    The tremendous amount of data resulting from the regular usage of tools for automatic presentation recording demand for elaborate search functionality. A detailed analysis of the according multimedia documents is required to allow search at a very detailed level. Unfortunately, the produced data differs significantly from traditional documents. In this demo, we discuss the problems appearing in the presentation retrieval scenario and introduce aofSE, a search engine to study and illustrate these problems as well as to develop and present according solutions and new approaches for this task.
    UTACLIR: general query translation framework for several language pairs BIBFull-Text 448
      Heikki Keskustalo; Turid Hedlund; Eija Airio
    HyREX: hyper-media retrieval engine for XML BIBFull-Text 449
      Norbert Fuhr; Norbert Govert; Kai Grossjohann
    Query performance analyser: a web-based tool for IR research and instruction BIBAFull-Text 450
      Eero Sormunen; Sakari Hokkanen; Petteri Kangaslampi; Petri Pyy; Bemmu Sepponen
    The Interactive Query Performance Analyser (QPA) for information retrieval systems is a Web-based tool for analysing and comparing the performance of individual queries. On top of a standard test collection, it gives an instant visualisation of the performance achieved in a given search topic by any user-generated query. In addition to experimental IR research, QPA can be used in user training to demonstrate the characteristics of and compare differences between IR systems and searching strategies. The first prototype (versions 3.0 and 3.5) of the Query Performance Analyser was developed at the Department of Information Studies, University of Tampere, to serve as a tool for rapid query performance analysis, comparison and visualisation [4,5]. Later, it has been applied to interactive optimisation of queries [2,3]. The analyser has served also in learning environments for IR [1].
       The demonstration is based on the newest version of the Query Performance Analyser (v. 5.1). It is interfaced to a traditional Boolean IR system (TRIP) and a probabilistic IR system (Inquery) providing access to the TREC collection and two Finnish test collections. Version 5.1 supports multigraded relevance scales, new types of performance visualisations, and query conversions based on mono- and multi-lingual dictionaries. The motivation in developing the analyser is to emphasise the necessity of analysing the behaviour of individual queries. Information retrieval experiments usually measure the average effectiveness of IR methods developed. The analysis of individual queries is neglected although test results may contain individual test topics where general findings do not hold. For the real user of an IR system, the study of variation in results is even more important than averages.
    Adaptive information extraction for document annotation in amilcare BIBAFull-Text 451
      Fabio Ciravegna; Alexiei Dingli; Yorick Wilks; Daniela Petrelli
    Amilcare is a tool for Adaptive Information Extraction (IE) designed for supporting active annotation of documents for the Semantic Web (SW). It can be used either for unsupervised document annotation or as a support for human annotation. Amilcare is portable to new applications/domains without any knowledge of IE, as it just requires users to annotate a small training corpus with the information to be extracted. It is based on (LP)2, a supervised learning strategy for IE able to cope with different texts types, from newspaper-like texts, to rigidly formatted Web pages and even a mixture of them[1][5].
       Adaptation starts with the definition of a tag set for annotation, possibly organized as an ontology. Then users have to manually annotate a small training corpus. Amilcare provides a default mouse-based interface called Melita, where annotations are inserted by first selecting a tag from the ontology and then identifying the text area to annotate with the mouse. Differently from similar annotation tools [4, 5], Melita actively supports training corpus annotation. While users annotate texts, Amilcare runs in the background learning how to reproduce the inserted annotation. Induced rules are silently applied to new texts and their results are compared with the user annotation. When its rules reach a (user-defined) level of accuracy, Melita presents new texts with a preliminary annotation derived by the rule application. In this case users have just to correct mistakes and add missing annotations. User corrections are inputted back to the learner for retraining. This technique focuses the slow and expensive user activity on uncovered cases, avoiding requiring annotating cases where a satisfying effectiveness is already reached. Moreover validating extracted information is a much simpler task than tagging bare texts (and also less error prone), speeding up the process considerably. At the end of the corpus annotation process, the system is trained and the application can be delivered. MnM [6] and Ontomat annotizer [7] are two annotation tools adopting Amilcare's learner.
       In this demo we simulate the annotation of a small corpus and we show how and when Amilcare is able to support users in the annotation process, focusing on the way the user can control the tool's proactivity and intrusivity. We will also quantify such support with data derived from a number of experiments on corpora. We will focus on training corpus size and correctness of suggestions when the corpus is increased.
    ExWrap: semi-automatic wrapper generation by example BIBFull-Text 452
      Bethina Schmitt; Michael Christoffel; Jurgen Schneider
    Souvenir: flexible note-taking tool to pinpoint and share media highlights BIBAFull-Text 453
      Anselm Spoerri
    Digital media audio/video can be difficult to search and share in a personal way. Souvenir is a software system that offers users a flexible and comprehensive way to use their handwritten or text notes to retrieve and share specific media moments. Users can take notes on a variety of devices, such as the paper-based CrossPad, the Palm Pilot and standard keyboard devices. Souvenir segments handwritten notes into an effective media index without the need for handwriting recognition. Users can use their notes to create hyperlinks to random-access media stored in a digital library. Souvenir also has web publishing and email capabilities to enable anyone to access or email media moments directly from a web page. Souvenir annotations capture information that can not be easily inferred by automatic media indexing tools.
    Hierarchical approach to term suggestion device BIBAFull-Text 454
      Hideo Joho; Mark Sanderson; Micheline Beaulieu
    Our demonstration shows the hierarchy system working on a locally run search engine. Hierarchies are dynamically generated from the retrieved documents, and visualised on the menus. When a user selects a term from the hierarchy, the documents linked to the term are listed, and the term is then added to the initial query to rerun a search. Through the demonstration we illustrate how hierarchical presentation of expansion terms is achieved, and how our approach supports users to articulate their information needs using the hierarchy.
    Translingual vocabulary mappings for multilingual information access BIBFull-Text 455
      Fredric C. Gey; Aitao Chen; Michael Buckland; Ray Larson
    GS textplorer: adaptive framework for information retrieval BIBFull-Text 456
      Jukka Honkela; Ville H. Tuulos
    CuTeX: a system for extracting data from text tables BIBAFull-Text 457
      Hasan Davulcu; Saikat Mukherjee; Arvind Seth; I. V. Ramakrishnan
    A wealth of information relevant for e-commerce often appears in text form. This includes specification and performance data sheets of products, financial statements, product offerings etc. Typically these types of product and financial data are published in tabular form. The only separators between items in the table are white spaces and line separators. We will refer to such tables as text tables. Due to the lack of structure in such tables, the information present is not readily queriable using traditional database query languages like SQL. One way to make it amenable to standard database querying techniques is to extract the data items in the tables and create a database out of the extracted data. But extraction from text tables poses difficulties due to the irregularity of the data in the column.
    YellowPager: a tool for ontology-based mining of service directories from web sources BIBAFull-Text 458
      Prashant Choudhari; Hasan Davulcu; Abhishek Joglekar; Akshay More; Saikat Mukherjee; Supriya Patil; I. V. Ramakrishnan
    The web has established itself as the dominant medium for doing electronic commerce. Realizing that its global reach provides significant market and business opportunities, service providers, both large and small are advertising their services on the web. A number of them operate their own web sites promoting their services at length while others are merely listed in a referral site. Aggregating all of the providers into a queriable service directory makes it easy for customers to locate the one most suited for his/her needs.
       YellowPager is a tool for creating service directories by mining web sources. Service directories created by YellowPager have several merits compared to those generated by existing practices, which typically require participation by service providers (e.g. Verizon's SuperYellowPages.com). Firstly, the information content will be rich. Secondly since the process is automated and repeatable the content can always be kept current. Finally the same process can be readily adapted to different domains.
       YellowPager builds service directories by mining the web through a combination of keyword-based search engines, web agents, text classifiers and novel extraction algorithms.
       The extraction is driven by a services ontology consisting of a taxonomy of service concepts and their associated attributes (such as names and addresses) and type descriptions for the attributes. In addition the ontology also associates an extractor function with each attribute. Applying the function to a web page will identify all the occurrences of the attribute in that page.
       YellowPager's mining algorithm consists of a training step followed by classification and extraction steps. In the training step a classifier is trained to identify web pages relevant to the service of interest. The classification step proceeds by doing a search for the particular service of interest using a keyword based web search engine and retrieves all the matching web pages. From these pages the relevant ones are identified using the classifier. The final step is extraction of attribute values, associated with the service, from these pages. Each web page is parsed into a DOM tree and the extractor functions are applied. All of the attributes corresponding to a service provider are then correctly aggregated. This can pose difficulties especially in the presence of multiple service providers in a page. Using a novel concept of scoring and conflict resolution to prevent erroneous associations of attributes with service provider entities in the page, the algorithm aggregates all the attribute occurrences correctly. The extractor function may not be complete in the sense that it cannot always identify all the attributes in a page. By exploiting the regularity of the sequence in which attributes occur in referral pages, the mining algorithm automatically learns generalized patterns to locate attributes that the extractor function misses. The distinguishing aspects of YellowPager's extraction algorithm are: (i) it is unsupervised, and (ii) the attribute values in the pages are extracted independent of any page-specific relationships that may exist among the markup tags.
       YellowPager has been used by a large pet food producer to build a directory of veterinarian service providers in the United States. The resulting database was found to be much larger and richer than that found in Vetquest, Vetworld, and the Super Yellow pages.
       YellowPager is implemented in JAVA and is interfaced to Rainbow, a library utility in C that is used for classification. The tool will demonstrate the creation of a service directory for any service domain by mining web sources.