HCI Bibliography Home | HCI Conferences | IR Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
IR Tables of Contents: 959697989900010203040506070809101112131415

Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

Fullname:Proceedings of the 28th International ACM SIGIR Conference on Research and Development in Information Retrieval
Editors:Gary Marchionini; Alistair Moffat; John Tait; Ricardo Baeza-Yates; Nivio Ziviani
Location:Salvador, Brazil
Dates:2005-Aug-15 to 2005-Aug-19
Standard No:ISBN 1-59593-034-5; ACM Order Number: 534052; ACM DL: Table of Contents hcibib: IR05
  1. Keynote
  2. Theory 1
  3. Relevance feedback
  4. Distributed
  5. Filtering
  6. Categorization and classification
  7. Evaluation
  8. Web search 1
  9. Summarization
  10. Keynote
  11. Efficiency
  12. Categorization and supervised machine learning
  13. Theory 2
  14. Structured data
  15. NLP
  16. Multimedia
  17. Question answering
  18. Web search 2
  19. Keynote
  20. User studies
  21. Theory 3
  22. Web search 3
  23. Cross-language
  24. Video and image
  25. Posters
  26. Demos


The Portinari project: IR helps art and culture BIBAFull-Text 1-2
  Joao Candido Portinari
In May 30, 1983, The New York Times published the article "Brazil Gathers Archive On Its Painter, Portinari". The author, Warren Hoge, narrates: "/The late Candido Portinari is considered here to be the greatest artist Brazil has ever produced, yet all but a few of his 4,000 paintings are out of public view. They have become dispersed in private collections in so many places that his biographer compared their fate to that of Brazil's 18th-century revolutionary hero Tiradentes, whose body was dismembered and strewn along a 300-mile turnpike. The inaccessibility of Portinari's work is particularly vexing to his enthusiasts because his own dedication to producing an epic view of Brazil for his countrymen was such that he continued painting even after doctors warned that exposure to paint was killing him. He died of lead poisoning at the age of 58. Now, in a pioneering effort for Latin America, a team of experts in Rio is busy assembling the far-flung pieces of Portinari's obra into an exhaustive computerized archive. "/We are trying to rescue what is authentically ours'/, said Jose Candido Portinari, the painter's 44-year-old son, who is the coordinator of the group of researchers who make their headquarters on the leafy campus of Rio's Pontifical Catholic University. A telecommunications engineer with a Ph.D. from the Massachusetts Institute of Technology and a former chairman of the university's mathematics department, Mr. Portinari has brought exacting technical standards to the task. Now four years into the project, the 14-member team has compiled photographic, technical.

Theory 1

Orthogonal locality preserving indexing BIBAFull-Text 3-10
  Deng Cai; Xiaofei He
We consider the problem of document indexing and representation. Recently, Locality Preserving Indexing (LPI) was proposed for learning a compact document subspace. Different from Latent Semantic Indexing which is optimal in the sense of global Euclidean structure, LPI is optimal in the sense of local manifold structure. However, LPI is extremely sensitive to the number of dimensions. This makes it difficult to estimate the intrinsic dimensionality, while inaccurately estimated dimensionality would drastically degrade its performance. One reason leading to this problem is that LPI is non-orthogonal. Non-orthogonality distorts the metric structure of the document space. In this paper, we propose a new algorithm called Orthogonal LPI. Orthogonal LPI iteratively computes the mutually orthogonal basis functions which respect the local geometrical structure. Moreover, our empirical study shows that OLPI can have more locality preserving power than LPI. We compare the new algorithm to LSI and LPI. Extensive experimental results show that Orthogonal LPI obtains better performance than both LSI and LPI. More crucially, it is insensitive to the number of dimensions, which makes it an efficient data preprocessing method for text clustering, classification, retrieval, etc.
Why spectral retrieval works BIBAFull-Text 11-18
  Holger Bast; Debapriyo Majumdar
We argue that the ability to identify pairs of related terms is at the heart of what makes spectral retrieval work in practice. Schemes such as latent semantic indexing (LSI) and its descendants have this ability in the sense that they can be viewed as computing a matrix of term-term relatedness scores which is then used to expand the given documents (not the queries). For almost all existing spectral retrieval schemes, this matrix of relatedness scores depends on a fixed low-dimensional subspace of the original term space. We instead vary the dimension and study for each term pair the resulting curve of relatedness scores. We find that it is actually the shape of this curve which is indicative for the term-pair relatedness, and not any of the individual relatedness scores on the curve. We derive two simple, parameterless algorithms that detect this shape and that consistently outperform previous methods on a number of test collections. Our curves also shed light on the effectiveness of three fundamental types of variations of the basic LSI scheme.
Better than the real thing?: iterative pseudo-query processing using cluster-based language models BIBAFull-Text 19-26
  Oren Kurland; Lillian Lee; Carmel Domshlak
We present a novel approach to pseudo-feedback-based ad hoc retrieval that uses language models induced from both documents and clusters. First, we treat the pseudo-feedback documents produced in response to the original query as a set of pseudo-query that themselves can serve as input to the retrieval process. Observing that the documents returned in response to the pseudo-query can then act as pseudo-query for subsequent rounds, we arrive at a formulation of pseudo-query-based retrieval as an iterative process. Experiments show that several concrete instantiations of this idea, when applied in conjunction with techniques designed to heighten precision, yield performance results rivaling those of a number of previously-proposed algorithms, including the standard language-modeling approach. The use of cluster-based language models is a key contributing factor to our algorithms' success.
The maximum entropy method for analyzing retrieval measures BIBAFull-Text 27-34
  Javed A. Aslam; Emine Yilmaz; Virgiliu Pavlu
We present a model, based on the maximum entropy method, for analyzing various measures of retrieval performance such as average precision, R-precision, and precision-at-cutoffs. Our methodology treats the value of such a measure as a constraint on the distribution of relevant documents in an unknown list, and the maximum entropy distribution can be determined subject to these constraints. For good measures of overall performance (such as average precision), the resulting maximum entropy distributions are highly correlated with actual distributions of relevant documents in lists as demonstrated through TREC data; for poor measures of overall performance, the correlation is weaker. As such, the maximum entropy method can be used to quantify the overall quality of a retrieval measure. Furthermore, for good measures of overall performance (such as average precision), we show that the corresponding maximum entropy distributions can be used to accurately infer precision-recall curves and the values of other measures of performance, and we demonstrate that the quality of these inferences far exceeds that predicted by simple retrieval measure correlation, as demonstrated through TREC data.

Relevance feedback

A study of factors affecting the utility of implicit relevance feedback BIBAFull-Text 35-42
  Ryen W. White; Ian Ruthven; Joemon M. Jose
Implicit relevance feedback (IRF) is the process by which a search system unobtrusively gathers evidence on searcher interests from their interaction with the system. IRF is a new method of gathering information on user interest and, if IRF is to be used in operational IR systems, it is important to establish when it performs well and when it performs poorly. In this paper we investigate how the use and effectiveness of IRF is affected by three factors: search task complexity, the search experience of the user and the stage in the search. Our findings suggest that all three of these factors contribute to the utility of IRF.
Context-sensitive information retrieval using implicit feedback BIBAFull-Text 43-50
  Xuehua Shen; Bin Tan; ChengXiang Zhai
A major limitation of most existing retrieval models and systems is that the retrieval decision is made based solely on the query and document collection; information about the actual user and search context is largely ignored. In this paper, we study how to exploit implicit feedback information, including previous queries and clickthrough information, to improve retrieval accuracy in an interactive information retrieval setting. We propose several context-sensitive retrieval algorithms based on statistical language models to combine the preceding queries and clicked document summaries with the current query for better ranking of documents. We use the TREC AP data to create a test collection with search context information, and quantitatively evaluate our models using this test set. Experiment results show that using implicit feedback, especially the clicked document summaries, can improve retrieval performance substantially.
User term feedback in interactive text-based image retrieval BIBAFull-Text 51-58
  Chen Zhang; Joyce Y. Chai; Rong Jin
To alleviate the vocabulary problem, this paper investigates the role of user term feedback in interactive text-based image retrieval. Term feedback refers to the feedback from a user on specific terms regarding their relevance to a target image. Previous studies have indicated the effectiveness of term feedback in interactive text retrieval [14]. However, the term feedback has not shown to be effective in our experiments on text-based image retrieval. Our results indicate that, although term feedback has a positive effect by allowing users to identify more relevant terms, it also has a strong negative effect by providing more opportunities for users to specify irrelevant terms. To understand these different effects and their implications on the potential of term feedback, this paper further presents analysis of important factors that contribute to the utility of term feedback and discusses the outlook of term feedback in interactive text-based image retrieval.
Active feedback in ad hoc information retrieval BIBAFull-Text 59-66
  Xuehua Shen; ChengXiang Zhai
Information retrieval is, in general, an iterative search process, in which the user often has several interactions with a retrieval system for an information need. The retrieval system can actively probe a user with questions to clarify the information need instead of just passively responding to user queries. A basic question is thus how a retrieval system should propose questions to the user so that it can obtain maximum benefits from the feedback on these questions. In this paper, we study how a retrieval system can perform active feedback, i.e., how to choose documents for relevance feedback so that the system can learn most from the feedback information. We present a general framework for such an active feedback problem, and derive several practical algorithms as special cases. Empirical evaluation of these algorithms shows that the performance of traditional relevance feedback (presenting the top K documents) is consistently worse than that of presenting documents with more diversity. With a diversity-based selection algorithm, we obtain fewer relevant documents, however, these fewer documents have more learning benefits.


Improving collection selection with overlap awareness in P2P search engines BIBAFull-Text 67-74
  Matthias Bender; Sebastian Michel; Peter Triantafillou; Gerhard Weikum; Christian Zimmer
Collection selection has been a research issue for years. Typically, in related work, precomputed statistics are employed in order to estimate the expected result quality of each collection, and subsequently the collections are ranked accordingly. Our thesis is that this simple approach is insufficient for several applications in which the collections typically overlap. This is the case, for example, for the collections built by autonomous peers crawling the web. We argue for the extension of existing quality measures using estimators of mutual overlap among collections and present experiments in which this combination outperforms CORI, a popular approach based on quality estimation. We outline our prototype implementation of a P2P web search engine, coined MINERVA, that allows handling large amounts of data in a distributed and self-organizing manner. We conduct experiments which show that taking overlap into account during collection selection can drastically decrease the number of collections that have to be contacted in order to reach a satisfactory level of recall, which is a great step toward the feasibility of distributed web search.
Server selection methods in hybrid portal search BIBAFull-Text 75-82
  David Hawking; Paul Thomas
The TREC.GOV collection makes a valuable web testbed for distributed information retrieval methods because it is naturally partitioned and includes 725 web-oriented queries with judged answers. It can usefully model aspects of government and large corporate portals. Analysis of the .gov data shows that a purely distributed approach would not be feasible for providing search on a.gov portal because of the large number (17,000+) of web sites and the high proportion that do not provide a search interface. An alternative hybrid approach, combining both distributed and centralized techniques, is proposed and server selection methods are evaluated within this framework using web-oriented evaluation methodology. A number of well-known algorithms are compared against representatives (highest anchor ranked page (HARP) and anchor weighted sum (AWSUM)) of a family of new selection methods which use link anchortext extracted from an auxiliary crawl to provide descriptions of sites which are not themselves crawled. Of the previously published methods, ReDDE substantially outperformed three variants of CORI and also outperformed a method based on Kullback-Leibler Divergence (extended) except on topic distillation. HARP and AWSUM performed best overall but were outperformed on the topic distillation task by extended KL Divergence.
Modeling search engine effectiveness for federated search BIBAFull-Text 83-90
  Luo Si; Jamie Callan
Federated search links multiple search engines into a single, virtual search system. Most prior research of federated search focused on selecting search engines that have the most relevant contents, but ignored the retrieval effectiveness of individual search engines. This omission can cause serious problems when federating search engines of different qualities.
   This paper proposes a federated search technique that uses utility maximization to model the retrieval effectiveness of each search engine in a federated search environment. The new algorithm ranks the available resources by explicitly estimating the amount of relevant material that each resource can return, instead of the amount of relevant material that each resource contains. An extensive set of experiments demonstrates the effectiveness of the new algorithm.
A utility theoretic approach to determining optimal wait times in distributed information retrieval BIBAFull-Text 91-97
  Kartik Hosanagar
Distributed IR systems query a large number of IR servers, merge the retrieved results and display them to users. Since different servers handle collections of different sizes, have different processing and bandwidth capacities, there can be considerable heterogeneity in their response times. The broker in the distributed IR system thus has to make decisions regarding terminating searches based on perceived value of waiting -- retrieving more documents -- and the costs imposed on users by waiting for more responses. In this paper, we apply utility theory to formulate the broker's decision problem. The problem is a stochastic nonlinear program. We use Monte Carlo simulations to demonstrate how the optimal wait time may be determined in the context of a comparison shopping engine that queries multiple store websites for price and product information. We use data gathered from 30 stores for a set of 60 books. Our research demonstrates how a broker can leverage information about past retrievals regarding distributions of server response time and relevance scores to optimize its performance. Our main contribution is the formulation of the decision model for optimal wait time and proposal of a solution method. Our results suggest that the optimal wait time is highly sensitive to the manner in which users value from a set of retrieved results differs from the sum of user value from each result evaluated independently. We also find that the optimal wait time increases with the size of the distributed collections, but only if user utility from a set of results is nearly equal to the sum of utilities from each result.


Robustness of adaptive filtering methods in a cross-benchmark evaluation BIBAFull-Text 98-105
  Yiming Yang; Shinjae Yoo; Jian Zhang; Bryan Kisiel
This paper reports a cross-benchmark evaluation of regularized logistic regression (LR) and incremental Rocchio for adaptive filtering. Using four corpora from the Topic Detection and Tracking (TDT) forum and the Text Retrieval Conferences (TREC) we evaluated these methods with non-stationary topics at various granularity levels, and measured performance with different utility settings. We found that LR performs strongly and robustly in optimizing T11SU (a TREC utility function) while Rocchio is better for optimizing Ctrk (the TDT tracking cost), a high-recall oriented objective function. Using systematic cross-corpus parameter optimization with both methods, we obtained the best results ever reported on TDT5, TREC10 and TREC11. Relevance feedback on a small portion (0.05~0.2%) of the TDT5 test documents yielded significant performance improvements, measuring up to a 54% reduction in Ctrk and a 20.9% increase in T11SU (with b=0.1), compared to the results of the top-performing system in TDT2004 without relevance feedback information.
A probabilistic model for retrospective news event detection BIBAFull-Text 106-113
  Zhiwei Li; Bin Wang; Mingjing Li; Wei-Ying Ma
Retrospective news event detection (RED) is defined as the discovery of previously unidentified events in historical news corpus. Although both the contents and time information of news articles are helpful to RED, most researches focus on the utilization of the contents of news articles. Few research works have been carried out on finding better usages of time information. In this paper, we do some explorations on both directions based on the following two characteristics of news articles. On the one hand, news articles are always aroused by events; on the other hand, similar articles reporting the same event often redundantly appear on many news sources. The former hints a generative model of news articles, and the latter provides data enriched environments to perform RED. With consideration of these characteristics, we propose a probabilistic model to incorporate both content and time information in a unified framework. This model gives new representations of both news articles and news events. Furthermore, based on this approach, we build an interactive RED system, HISCOVERY, which provides additional functions to present events, Photo Story and Chronicle.
Scalable collaborative filtering using cluster-based smoothing BIBAFull-Text 114-121
  Gui-Rong Xue; Chenxi Lin; Qiang Yang; WenSi Xi; Hua-Jun Zeng; Yong Yu; Zheng Chen
Memory-based approaches for collaborative filtering identify the similarity between two users by comparing their ratings on a set of items. In the past, the memory-based approach has been shown to suffer from two fundamental problems: data sparsity and difficulty in scalability. Alternatively, the model-based approach has been proposed to alleviate these problems, but this approach tends to limit the range of users. In this paper, we present a novel approach that combines the advantages of these two approaches by introducing a smoothing-based method. In our approach, clusters generated from the training data provide the basis for data smoothing and neighborhood selection. As a result, we provide higher accuracy as well as increased efficiency in recommendations. Empirical studies on two datasets (EachMovie and MovieLens) show that our new proposed approach consistently outperforms other state-of-art collaborative filtering algorithms.

Categorization and classification

OCFS: optimal orthogonal centroid feature selection for text categorization BIBAFull-Text 122-129
  Jun Yan; Ning Liu; Benyu Zhang; Shuicheng Yan; Zheng Chen; Qiansheng Cheng; Weiguo Fan; Wei-Ying Ma
Text categorization is an important research area in many Information Retrieval (IR) applications. To save the storage space and computation time in text categorization, efficient and effective algorithms for reducing the data before analysis are highly desired. Traditional techniques for this purpose can generally be classified into feature extraction and feature selection. Because of efficiency, the latter is more suitable for text data such as web documents. However, many popular feature selection techniques such as Information Gain (IG) and?2-test (CHI) are all greedy in nature and thus may not be optimal according to some criterion. Moreover, the performance of these greedy methods may be deteriorated when the reserved data dimension is extremely low. In this paper, we propose an efficient optimal feature selection algorithm by optimizing the objective function of Orthogonal Centroid (OC) subspace learning algorithm in a discrete solution space, called Orthogonal Centroid Feature Selection (OCFS). Experiments on 20 Newsgroups (20NG), Reuters Corpus Volume 1 (RCV1) and Open Directory Project (ODP) data show that OCFS is consistently better than IG and CHI with smaller computation time especially when the reduced dimension is extremely small.
SimFusion: measuring similarity using unified relationship matrix BIBAFull-Text 130-137
  Wensi Xi; Edward A. Fox; Weiguo Fan; Benyu Zhang; Zheng Chen; Jun Yan; Dong Zhuang
In this paper we use a Unified Relationship Matrix (URM) to represent a set of heterogeneous data objects (e.g., web pages, queries) and their interrelationships (e.g., hyperlinks, user click-through sequences). We claim that iterative computations over the URM can help overcome the data sparseness problem and detect latent relationships among heterogeneous data objects, thus, can improve the quality of information applications that require combination of information from heterogeneous sources. To support our claim, we present a unified similarity-calculating algorithm, SimFusion. By iteratively computing over the URM, SimFusion can effectively integrate relationships from heterogeneous sources when measuring the similarity of two data objects. Experiments based on a web search engine query log and a web page collection demonstrate that SimFusion can improve similarity measurement of web objects over both traditional content based algorithms and the cutting edge SimRank algorithm.
An application of text categorization methods to gene ontology annotation BIBAFull-Text 138-145
  Kazuhiro Seki; Javed Mostafa
This paper describes an application of IR and text categorization methods to a highly practical problem in biomedicine, specifically, Gene Ontology (GO) annotation. GO annotation is a major activity in most model organism database projects and annotates gene functions using a controlled vocabulary. As a first step toward automatic GO annotation, we aim to assign GO domain codes given a specific gene and an article in which the gene appears, which is one of the task challenges at the TREC 2004 Genomics Track. We approached the task with careful consideration of the specialized terminology and paid special attention to dealing with various forms of gene synonyms, so as to exhaustively locate the occurrences of the target gene. We extracted the words around the gene occurrences and used them to represent the gene for GO domain code annotation. As a classifier, we adopted a variant of k-Nearest Neighbor (kNN) with supervised term weighting schemes to improve the performance, making our method among the top-performing systems in the TREC official evaluation. Moreover, it is demonstrated that our proposed framework is successfully applied to another task of the Genomics Track, showing comparable results to the best performing system.


Combining eye movements and collaborative filtering for proactive information retrieval BIBAFull-Text 146-153
  Kai Puolamaki; Jarkko Salojarvi; Eerika Savia; Jaana Simola; Samuel Kaski
We study a new task, proactive information retrieval by combining implicit relevance feedback and collaborative filtering. We have constructed a controlled experimental setting, a prototype application, in which the users try to find interesting scientific articles by browsing their titles. Implicit feedback is inferred from eye movement signals, with discriminative hidden Markov models estimated from existing data in which explicit relevance feedback is available. Collaborative filtering is carried out using the User Rating Profile model, a state-of-the-art probabilistic latent variable model, computed using Markov Chain Monte Carlo techniques. For new document titles the prediction accuracy with eye movements, collaborative filtering, and their combination was significantly better than by chance. The best prediction accuracy still leaves room for improvement but shows that proactive information retrieval and combination of many sources of relevance feedback is feasible.
Accurately interpreting clickthrough data as implicit feedback BIBAFull-Text 154-161
  Thorsten Joachims; Laura Granka; Bing Pan; Helene Hembrooke; Geri Gay
This paper examines the reliability of implicit feedback generated from clickthrough data in WWW search. Analyzing the users' decision process using eyetracking and comparing implicit feedback against manual relevance judgments, we conclude that clicks are informative but biased. While this makes the interpretation of clicks as absolute relevance judgments difficult, we show that relative preferences derived from clicks are reasonably accurate on average.
Information retrieval system evaluation: effort, sensitivity, and reliability BIBAFull-Text 162-169
  Mark Sanderson; Justin Zobel
The effectiveness of information retrieval systems is measured by comparing performance on a common set of queries and documents. Significance tests are often used to evaluate the reliability of such comparisons. Previous work has examined such tests, but produced results with limited application. Other work established an alternative benchmark for significance, but the resulting test was too stringent. In this paper, we revisit the question of how such tests should be used. We find that the t-test is highly reliable (more so than the sign or Wilcoxon test), and is far more reliable than simply showing a large percentage difference in effectiveness measures between IR systems. Our results show that past empirical work on significance tests over-estimated the error of such tests. We also re-consider comparisons between the reliability of precision at rank 10 and mean average precision, arguing that past comparisons did not consider the assessor effort required to compute such measures. This investigation shows that assessor effort would be better spent building test collections with more topics, each assessed in less detail.

Web search 1

Detecting phrase-level duplication on the world wide web BIBAFull-Text 170-177
  Dennis Fetterly; Mark Manasse; Marc Najork
Two years ago, we conducted a study on the evolution of web pages over time. In the course of that study, we discovered a large number of machine-generated "spam" web pages emanating from a handful of web servers in Germany. These spam web pages were dynamically assembled by stitching together grammatically well-formed German sentences drawn from a large collection of sentences. This discovery motivated us to develop techniques for finding other instances of such "slice and dice" generation of web pages, where pages are automatically generated by stitching together phrases drawn from a limited corpus. We applied these techniques to two data sets, a set of 151 million web pages collected in December 2002 and a set of 96 million web pages collected in June 2004. We found a number of other instances of large-scale phrase-level replication within the two data sets. This paper describes the algorithms we used to discover this type of replication, and highlights the results of our data mining.
Using ODP metadata to personalize search BIBAFull-Text 178-185
  Paul Alexandru Chirita; Wolfgang Nejdl; Raluca Paiu; Christian Kohlschutter
The Open Directory Project is clearly one of the largest collaborative efforts to manually annotate web pages. This effort involves over 65,000 editors and resulted in metadata specifying topic and importance for more than 4 million web pages. Still, given that this number is just about 0.05 percent of the Web pages indexed by Google, is this effort enough to make a difference? In this paper we discuss how these metadata can be exploited to achieve high quality personalized web search. First, we address this by introducing an additional criterion for web page ranking, namely the distance between a user profile defined using ODP topics and the sets of ODP topics covered by each URL returned in regular web search. We empirically show that this enhancement yields better results than current web search using Google. Then, in the second part of the paper, we investigate the boundaries of biasing PageRank on subtopics of the ODP in order to automatically extend these metadata to the whole web.
Exploiting the hierarchical structure for link analysis BIBAFull-Text 186-193
  Gui-Rong Xue; Qiang Yang; Hua-Jun Zeng; Yong Yu; Zheng Chen
Link analysis algorithms have been extensively used in Web information retrieval. However, current link analysis algorithms generally work on a flat link graph, ignoring the hierarchal structure of the Web graph. They often suffer from two problems: the sparsity of link graph and biased ranking of newly-emerging pages. In this paper, we propose a novel ranking algorithm called Hierarchical Rank as a solution to these two problems, which considers both the hierarchical structure and the link structure of the Web. In this algorithm, Web pages are first aggregated based on their hierarchical structure at directory, host or domain level and link analysis is performed on the aggregated graph. Then, the importance of each node on the aggregated graph is distributed to individual pages belong to the node based on the hierarchical structure. This algorithm allows the importance of linked Web pages to be distributed in the Web page space even when the space is sparse and contains new pages. Experimental results on the .GOV collection of TREC 2003 and 2004 show that hierarchical ranking algorithm consistently outperforms other well-known ranking algorithms, including the PageRank, BlockRank and LayerRank. In addition, experimental results show that link aggregation at the host level is much better than link aggregation at either the domain or directory levels.


Web-page summarization using clickthrough data BIBAFull-Text 194-201
  Jian-Tao Sun; Dou Shen; Hua-Jun Zeng; Qiang Yang; Yuchang Lu; Zheng Chen
Most previous Web-page summarization methods treat a Web page as plain text. However, such methods fail to uncover the full knowledge associated with a Web page needed in building a high-quality summary, because many of these methods do not consider the hidden relationships in the Web. Uncovering the hidden knowledge is important in building good Web-page summarizers. In this paper, we extract the extra knowledge from the clickthrough data of a Web search engine to improve Web-page summarization. Wefirst analyze the feasibility in utilizing the clickthrough data to enhance Web-page summarization and then propose two adapted summarization methods that take advantage of the relationships discovered from the clickthrough data. For those pages that are not covered by the clickthrough data, we design a thematic lexicon approach to generate implicit knowledge for them. Our methods are evaluated on a dataset consisting of manually annotated pages as well as a large dataset that is crawled from the Open Directory Project website. The experimental results indicate that significant improvements can be achieved through our proposed summarizer as compared to the summarizers that do not use the clickthrough data.
Topic themes for multi-document summarization BIBAFull-Text 202-209
  Sanda Harabagiu; Finley Lacatusu
The problem of using topic representations for multi-document summarization (MDS) has received considerable attention recently. In this paper, we describe five different topic representations and introduce a novel representation of topics based on topic themes. We present eight different methods of generating MDS and evaluate each of these methods on a large set of topics used in past DUC workshops. Our evaluation results show a significant improvement in the quality of summaries based on topic themes over MDS methods that use other alternative topic representations.
Do summaries help? BIBAFull-Text 210-217
  Kathleen McKeown; Rebecca J. Passonneau; David K. Elson; Ani Nenkova; Julia Hirschberg
We describe a task-based evaluation to determine whether multi-document summaries measurably improve user performance when using online news browsing systems for directed research. We evaluated the multi-document summaries generated by Newsblaster, a robust news browsing system that clusters online news articles and summarizes multiple articles on each event. Four groups of subjects were asked to perform the same time-restricted fact-gathering tasks, reading news under different conditions: no summaries at all, single sentence summaries drawn from one of the articles, Newsblaster multi-document summaries, and human summaries. Our results show that, in comparison to source documents only, the quality of reports assembled using Newsblaster summaries was significantly better and user satisfaction was higher with both Newsblaster and human summaries.


The future of media, blogs and innovation: new IR challenges? BIBAFull-Text 218
  Fernando Flores
An axiom of every good investor is not to buy shares when the goodness of them is already into newspapers. Before, the information was circulating in some form, for example from mouth-to-mouth, closed circles, or newsletters. Nowadays the news can also occur in blogs that point to public or private communities that discuss topics that traditional media do not carry or even hide. Nowadays, standard communication media are trapped in a Cartesian or Platonic correspondence assumption. They want to tell us how things really are, how they have occurred, and how they will happen, disregarding a concrete world of problems where opportunities and threats live in real time for people. Searching and exploring the world of blogs can create an acceleration of innovation and a dissolution of the previous status quo. Here, the search unit is not a word, but actions, worries, opportunities, threats, etc. That is, people living and pursuing shared goals with others. Which new searching tools can help to find trends, innovations and ideas taking consciousness in the context described above? Can IR help to end with this illusion of pseudo-objectivity and manipulation of passive individuals.


Optimization strategies for complex queries BIBAFull-Text 219-225
  Trevor Strohman; Howard Turtle; W. Bruce Croft
Previous research into the efficiency of text retrieval systems has dealt primarily with methods that consider inverted lists in sequence; these methods are known as term-at-a-time methods. However, the literature for optimizing document-at-a-time systems remains sparse.
   We present an improvement to the max_score optimization, which is the most efficient known document-at-a-time scoring method. Like max_score, our technique, called term bounded max_score, is guaranteed to return exactly the same scores and documents as an unoptimized evaluation, which is particularly useful for query model research. We simulated our technique to explore the problem space, then implemented it in Indri, our large scale language modeling search engine. Tests with the GOV2 corpus on title queries show our method to be 23% faster than max_score alone, and 61% faster than our document-at-a-time baseline. Our optimized query times are competitive with conventional term-at-a-time systems on this year's TREC Terabyte task.
Simplified similarity scoring using term ranks BIBAFull-Text 226-233
  Vo Ngoc Anh; Alistair Moffat
We propose a method for document ranking that combines a simple document-centric view of text, and fast evaluation strategies that have been developed in connection with the vector space model. The new method defines the importance of a term within a document qualitatively rather than quantitatively, and in doing so reduces the need for tuning parameters. In addition, the method supports very fast query processing, with most of the computation carried out on small integers, and dynamic pruning an effective option. Experiments on a wide range of TREC data show that the new method provides retrieval effectiveness as good as or better than the Okapi BM25 formulation, and variants of language models.
Efficiently decodable and searchable natural language adaptive compression BIBAFull-Text 234-241
  Nieves R. Brisaboa; Antonio Farina; Gonzalo Navarro; Jose R. Parama
We address the problem of adaptive compression of natural language text, focusing on the case where low bandwidth is available and the receiver has little processing power, as in mobile applications. Our technique achieves compression ratios around 32% and requires very little effort from the receiver. This tradeoff, not previously achieved with alternative techniques, is obtained by breaking the usual symmetry between sender and receiver dominant in statistical adaptive compression. Moreover, we show that our technique can be adapted to avoid decompression at all in cases where the receiver only wants to detect the presence of some keywords in the document. This is useful in scenarios such as selective dissemination of information, news clipping, alert systems, text categorization, and clustering. Thanks to the asymmetry we introduce, the receiver can search the compressed text much faster than the plain text. This was previously achieved only in semistatic compression scenarios.
Efficient and self-tuning incremental query expansion for top-k query processing BIBAFull-Text 242-249
  Martin Theobald; Ralf Schenkel; Gerhard Weikum
We present a novel approach for efficient and self-tuning query expansion that is embedded into a top-k query processor with candidate pruning. Traditional query expansion methods select expansion terms whose thematic similarity to the original query terms is above some specified threshold, thus generating a disjunctive query with much higher dimensionality. This poses three major problems: 1) the need for hand-tuning the expansion threshold, 2) the potential topic dilution with overly aggressive expansion, and 3) the drastically increased execution cost of a high-dimensional query. The method developed in this paper addresses all three problems by dynamically and incrementally merging the inverted lists for the potential expansion terms with the lists for the original query terms. A priority queue is used for maintaining result candidates, the pruning of candidates is based on Fagin's family of top-k algorithms, and optionally probabilistic estimators of candidate scores can be used for additional pruning. Experiments on the TREC collections for the 2004 Robust and Terabyte tracks demonstrate the increased efficiency, effectiveness, and scalability of our approach.

Categorization and supervised machine learning

Title extraction from bodies of HTML documents and its application to web page retrieval BIBAFull-Text 250-257
  Yunhua Hu; Guomao Xin; Ruihua Song; Guoping Hu; Shuming Shi; Yunbo Cao; Hang Li
This paper is concerned with automatic extraction of titles from the bodies of HTML documents. Titles of HTML documents should be correctly defined in the title fields; however, in reality HTML titles are often bogus. It is desirable to conduct automatic extraction of titles from the bodies of HTML documents. This is an issue which does not seem to have been investigated previously. In this paper, we take a supervised machine learning approach to address the problem. We propose a specification on HTML titles. We utilize format information such as font size, position, and font weight as features in title extraction. Our method significantly outperforms the baseline method of using the lines in largest font size as title (20.9%-32.6% improvement in F1 score). As application, we consider web page retrieval. We use the TREC Web Track data for evaluation. We propose a new method for HTML documents retrieval using extracted titles. Experimental results indicate that the use of both extracted titles and title fields is almost always better than the use of title fields alone; the use of extracted titles is particularly helpful in the task of named page finding (23.1% -29.0% improvements).
Multi-label informed latent semantic indexing BIBAFull-Text 258-265
  Kai Yu; Shipeng Yu; Volker Tresp
Latent semantic indexing (LSI) is a well-known unsupervised approach for dimensionality reduction in information retrieval. However if the output information (i.e. category labels) is available, it is often beneficial to derive the indexing not only based on the inputs but also on the target values in the training data set. This is of particular importance in applications with multiple labels, in which each document can belong to several categories simultaneously. In this paper we introduce the multi-label informed latent semantic indexing (MLSI) algorithm which preserves the information of inputs and meanwhile captures the correlations between the multiple outputs. The recovered "latent semantics" thus incorporate the human-annotated category information and can be used to greatly improve the prediction accuracy. Empirical study based on two data sets, Reuters-21578 and RCV1, demonstrates very encouraging results.
Text classification with kernels on the multinomial manifold BIBAFull-Text 266-273
  Dell Zhang; Xi Chen; Wee Sun Lee
Support Vector Machines (SVMs) have been very successful in text classification. However, the intrinsic geometric structure of text data has been ignored by standard kernels commonly used in SVMs. It is natural to assume that the documents are on the multinomial manifold, which is the simplex of multinomial models furnished with the Riemannian structure induced by the Fisher information metric. We prove that the Negative Geodesic Distance (NGD) on the multinomial manifold is conditionally positive definite (cpd), thus can be used as a kernel in SVMs. Experiments show the NGD kernel on the multinomial manifold to be effective for text classification, significantly outperforming standard kernels on the ambient Euclidean space.
Multi-labelled classification using maximum entropy method BIBAFull-Text 274-281
  Shenghuo Zhu; Xiang Ji; Wei Xu; Yihong Gong
Many classification problems require classifiers to assign each single document into more than one category, which is called multi-labelled classification. The categories in such problems usually are neither conditionally independent from each other nor mutually exclusive, therefore it is not trivial to directly employ state-of-the-art classification algorithms without losing information of relation among categories. In this paper, we explore correlations among categories with maximum entropy method and derive a classification algorithm for multi-labelled documents. Our experiments show that this method significantly outperforms the combination of single label approach.

Theory 2

Relevance information: a loss of entropy but a gain for IDF? BIBAFull-Text 282-289
  Arjen P. de Vries; Thomas Roelleke
When investigating alternative estimates for term discriminativeness, we discovered that relevance information and idf are much closer related than formulated in classical literature. Therefore, we revisited the justification of idf as it follows from the binary independent retrieval (BIR) model. The main result is a formal framework uncovering the close relationship of a generalised idf and the BIR model. The framework makes explicit how to incorporate relevance information into any retrieval function that involves an idf-component.
   In addition to the idf-based formulation of the BIR model, we propose Poisson-based estimates as an alternative to the classical estimates, this being motivated by the superiority of Poisson-based estimates for the within-document term frequencies. The main experimental finding is that a Poisson-based idf is superior to the classical idf, where the superiority is particularly evident for long queries.
Linear discriminant model for information retrieval BIBAFull-Text 290-297
  Jianfeng Gao; Haoliang Qi; Xinsong Xia; Jian-Yun Nie
This paper presents a new discriminative model for information retrieval (IR), referred to as linear discriminant model (LDM), which provides a flexible framework to incorporate arbitrary features. LDM is different from most existing models in that it takes into account a variety of linguistic features that are derived from the component models of HMM that is widely used in language modeling approaches to IR. Therefore, LDM is a means of melding discriminative and generative models for IR. We present two algorithms of parameter learning for LDM. One is to optimize the average precision (AP) directly using an iterative procedure. The other is a perceptron-based algorithm that minimizes the number of discordant document-pairs in a rank list. The effectiveness of our approach has been evaluated on the task of ad hoc retrieval using six English and Chinese TREC test sets. Results show that (1) in most test sets, LDM significantly outperforms the state-of-the-art language modeling approaches and the classical probabilistic retrieval model; (2) it is more appropriate to train LDM using a measure of AP rather than likelihood if the IR system is graded on AP; and (3) linguistic features (e.g. phrases and dependences) are effective for IR if they are incorporated properly.
Integrating word relationships into language models BIBAFull-Text 298-305
  Guihong Gao; Jian-Yun Nie; Jing Bai
In this paper, we propose a novel dependency language modeling approach for information retrieval. The approach extends the existing language modeling approach by relaxing the independence assumption. Our goal is to build a language model in which various word relationships can be integrated. In this work, we integrate two types of relationship extracted from WordNet and co-occurrence relationships respectively. The integrated model has been tested on several TREC collections. The results show that our model achieves substantial and significant improvements with respect to the models without these relationships. These results clearly show the benefit of integrating word relationships into language models for IR.
PageRank without hyperlinks: structural re-ranking using links induced by language models BIBAFull-Text 306-313
  Oren Kurland; Lillian Lee
Inspired by the PageRank and HITS (hubs and authorities) algorithms for Web search, we propose a structural re-ranking approach to ad hoc information retrieval: we reorder the documents in an initially retrieved set by exploiting asymmetric relationships between them. Specifically, we consider generation links, which indicate that the language model induced from one document assigns high probability to the text of another; in doing so, we take care to prevent bias against long documents. We study a number of re-ranking criteria based on measures of centrality in the graphs formed by generation links, and show that integrating centrality into standard language-model-based retrieval is quite effective at improving precision at top ranks.

Structured data

Controlling overlap in content-oriented XML retrieval BIBAFull-Text 314-321
  Charles L. A. Clarke
The direct application of standard ranking techniques to retrieve individual elements from a collection of XML documents often produces a result set in which the top ranks are dominated by a large number of elements taken from a small number of highly relevant documents. This paper presents and evaluates an algorithm that re-ranks this result set, with the aim of minimizing redundant content while preserving the benefits of element retrieval, including the benefit of identifying topic-focused components contained within relevant documents. The test collection developed by the INitiative for the Evaluation of XML Retrieval (INEX) forms the basis for the evaluation.
Publish/subscribe functionality in IR environments using structured overlay networks BIBAFull-Text 322-329
  Christos Tryfonopoulos; Stratos Idreos; Manolis Koubarakis
We study the problem of offering publish/subscribe functionality on top of structured overlay networks using data models and languages from IR. We show how to achieve this by extending the distributed hash table Chord and present a detailed experimental evaluation of our proposals.
Learning to extract information from semi-structured text using a discriminative context free grammar BIBAFull-Text 330-337
  Paul Viola; Mukund Narasimhan
In recent work, conditional Markov chain models (CMM) have been used to extract information from semi-structured text (one example is the Conditional Random Field [10]). Applications range from finding the author and title in research papers to finding the phone number and street address in a web page. The CMM framework combines a priori knowledge encoded as features with a set of labeled training data to learn an efficient extraction process. We will show that similar problems can be solved more effectively by learning a discriminative context free grammar from training data. The grammar has several distinct advantages: long range, even global, constraints can be used to disambiguate entity labels; training data is used more efficiently; and a set of new more powerful features can be introduced. The grammar based approach also results in semantic information (encoded in the form of a parse tree) which could be used for IR applications like question answering. The specific problem we consider is of extracting personal contact, or address, information from unstructured sources such as documents and emails. While linear-chain CMMs perform reasonably well on this task, we show that a statistical parsing approach results in a 50% reduction in error rate. This system also has the advantage of being interactive, similar to the system described in [9]. In cases where there are multiple errors, a single user correction can be propagated to correct multiple errors automatically. Using a discriminatively trained grammar, 93.71% of all tokens are labeled correctly (compared to 88.43% for a CMM) and 72.87% of records have all tokens labeled correctly (compared to 45.29% for the CMM).


Web-based acquisition of Japanese katakana variants BIBAFull-Text 338-344
  Takeshi Masuyama; Hiroshi Nakagawa
This paper describes a method of detecting Japanese Katakana variants from a large corpus. Katakana words, which are mainly used as loanwords, cause problems with information retrieval and so on, because transliteration creates several variations in spelling and all of these can be orthographic. Previous works manually defined Katakana rewrite rules such as "%Y" (be) and "%t%'" (ve) being replaceable with each other, for generating variants and also defined the weight of each operation to edit one string into another to detect these variants. However, these previous researches have not been able to keep up with the ever-increasing number of loanwords and their variants. With our method proposed in this paper, the weight of each edit operation is mechanically assigned based on Web data. In experiments, it performed almost as well as one with manually determined weights. Thus, the advantages of our method are: 1) need no expertise in linguistics to determine weight of each operation, and 2) able to keep up with new Katakana loanwords only by collecting text data from Web and acquiring new weights of edit operations automatically. It also achieved 98.6% recall and 86.3% precision in the task of extracting Katakana variant pairs from 38 year's worth of corpora of Japanese newspaper articles.
On the collective classification of email "speech acts" BIBAFull-Text 345-352
  Vitor R. Carvalho; William W. Cohen
We consider classification of email messages as to whether or not they contain certain "email acts", such as a request or a commitment. We show that exploiting the sequential correlation among email messages in the same thread can improve email-act classification. More specifically, we describe a new text-classification algorithm based on a dependency-network based collective classification method, in which the local classifiers are maximum entropy models based on words and certain relational features. We show that statistically significant improvements over a bag-of-words baseline classifier can be obtained for some, but not all, email-act classes. Performance improvements obtained by collective classification appears to be consistent across many email acts suggested by prior speech-act theory.
Using term informativeness for named entity detection BIBAFull-Text 353-360
  Jason D. M. Rennie; Tommi Jaakkola
Informal communication (e-mail, bulletin boards) poses a difficult learning environment because traditional grammatical and lexical information are noisy. Other information is necessary for tasks such as named entity detection. How topic-centric, or informative, a word is can be valuable information. It is well known that informative words are best modeled by "heavy-tailed" distributions, such as mixture models. However, informativeness scores do not take full advantage of this fact. We introduce a new informativeness score that directly utilizes mixture model likelihood to identify informative words. We use the task of extracting restaurant names from bulletin board posts as a way to determine effectiveness. We find that our "mixture score" is weakly effective alone and highly effective when combined with Inverse Document Frequency. We compare against other informativeness criteria and find that only Residual IDF is competitive against our combined IDF/Mixture score.


Automatic music video summarization based on audio-visual-text analysis and alignment BIBAFull-Text 361-368
  Changsheng Xu; Xi Shao; Namunu C. Maddage; Mohan S. Kankanhalli
In this paper, we propose a novel approach for automatic music video summarization based on audio-visual-text analysis and alignment. The music video is separated into the music and video tracks. For the music track, the chorus is detected based on music structure analysis. For the video track, we first segment the shots and classify the shots into close-up face shots and non-face shots, then we extract the lyrics and detect the most repeated lyrics from the shots. The music video summary is generated based on the alignment of boundaries of the detected chorus, shot class and the most repeated lyrics from the music video. The experiments on chorus detection, shot classification, and lyrics detection using 20 English music videos are described. Subjective user studies have been conducted to evaluate the quality and effectiveness of summary. The comparisons with the summaries based on our previous method and the manual method indicate that the results of summarization using the proposed method are better at meeting users' expectations.
A phonotactic-semantic paradigm for automatic spoken document classification BIBAFull-Text 369-376
  Bin Ma; Haizhou Li
We demonstrate a phonotactic-semantic paradigm for spoken document categorization. In this framework, we define a set of acoustic words instead of lexical words to represent acoustic activities in spoken languages. The strategy for acoustic vocabulary selection is studied by comparing different feature selection methods. With an appropriate acoustic vocabulary, a voice tokenizer converts a spoken document into a text-like document of acoustic words. Thus, a spoken document can be represented by a count vector, named a bag-of-sounds vector, which characterizes a spoken document's semantic domain. We study two phonotactic-semantic classifiers, the support vector machine classifier and the latent semantic analysis classifier, and their properties. The phonotactic-semantic framework constitutes a new paradigm in spoken document classification, as demonstrated by its success in the spoken language identification task. It achieves 18.2% error reduction over state-of-the-art benchmark performance on the 1996 NIST Language Recognition Evaluation database.
Boosted decision trees for word recognition in handwritten document retrieval BIBAFull-Text 377-383
  Nicholas R. Howe; Toni M. Rath; R. Manmatha
Recognition and retrieval of historical handwritten material is an unsolved problem. We propose a novel approach to recognizing and retrieving handwritten manuscripts, based upon word image classification as a key step. Decision trees with normalized pixels as features form the basis of a highly accurate AdaBoost classifier, trained on a corpus of word images that have been resized and sampled at a pyramid of resolutions. To stem problems from the highly skewed distribution of class frequencies, word classes with very few training samples are augmented with stochastically altered versions of the originals. This increases recognition performance substantially. On a standard corpus of 20 pages of handwritten material from the George Washington collection the recognition performance shows a substantial improvement in performance over previous published results (75% vs 65%). Following word recognition, retrieval is done using a language model over the recognized words. Retrieval performance also shows substantially improved results over previously published results on this database. Recognition/retrieval results on a more challenging database of 100 pages from the George Washington collection are also presented.

Question answering

Generic soft pattern models for definitional question answering BIBAFull-Text 384-391
  Hang Cui; Min-Yen Kan; Tat-Seng Chua
This paper explores probabilistic lexico-syntactic pattern matching, also known as soft pattern matching. While previous methods in soft pattern matching are ad hoc in computing the degree of match, we propose two formal matching models: one based on bigrams and the other on the Profile Hidden Markov Model (PHMM). Both models provide a theoretically sound method to model pattern matching as a probabilistic process that generates token sequences. We demonstrate the effectiveness of these models on definition sentence retrieval for definitional question answering. We show that both models significantly outperform state-of-the-art manually constructed patterns. A critical difference between the two models is that the PHMM technique handles language variations more effectively but requires more training data to converge. We believe that both models can be extended to other areas where lexico-syntactic pattern matching can be applied.
Evaluation of resources for question answering evaluation BIBAFull-Text 392-399
  Jimmy Lin
Controlled and reproducible laboratory experiments, enabled by reusable test collections, represent a well-established methodology in modern information retrieval research. In order to confidently draw conclusions about the performance of different retrieval methods using test collections, their reliability and trustworthiness must first be established. Although such studies have been performed for ad hoc test collections, currently available resources for evaluating question answering systems have not been similarly analyzed. This study evaluates the quality of answer patterns and lists of relevant documents currently employed in automatic question answering evaluation, and concludes that they are not suitable for post-hoc experimentation. These resources, created from runs submitted by TREC QA track participants, do not produce fair and reliable assessments of systems that did not participate in the original evaluations. Potential solutions for addressing this evaluation gap and their shortcomings are discussed.
Question answering passage retrieval using dependency relations BIBAFull-Text 400-407
  Hang Cui; Renxu Sun; Keya Li; Min-Yen Kan; Tat-Seng Chua
State-of-the-art question answering (QA) systems employ term-density ranking to retrieve answer passages. Such methods often retrieve incorrect passages as relationships among question terms are not considered. Previous studies attempted to address this problem by matching dependency relations between questions and answers. They used strict matching, which fails when semantically equivalent relationships are phrased differently. We propose fuzzy relation matching based on statistical models. We present two methods for learning relation mapping scores from past QA pairs: one based on mutual information and the other on expectation maximization. Experimental results show that our method significantly outperforms state-of-the-art density-based passage retrieval methods by up to 78% in mean reciprocal rank. Relation matching also brings about a 50% improvement in a system enhanced by query expansion.

Web search 2

A study of relevance propagation for web search BIBAFull-Text 408-415
  Tao Qin; Tie-Yan Liu; Xu-Dong Zhang; Zheng Chen; Wei-Ying Ma
Different from traditional information retrieval, both content and structure are critical to the success of Web information retrieval. In recent years, many relevance propagation techniques have been proposed to propagate content information between web pages through web structure to improve the performance of web search. In this paper, we first propose a generic relevance propagation framework, and then provide a comparison study on the effectiveness and efficiency of various representative propagation models that can be derived from this generic framework. We come to many conclusions that are useful for selecting a propagation model in real-world search applications, including 1) sitemap-based propagation models outperform hyperlink-based models in sense of both effectiveness and efficiency, and 2) sitemap-based term propagation is easier to be integrated into real-world search engines because of its parallel offline implementation and acceptable complexity. Some other more detailed study results are also reported in the paper.
Relevance weighting for query independent evidence BIBAFull-Text 416-423
  Nick Craswell; Stephen Robertson; Hugo Zaragoza; Michael Taylor
A query independent feature, relating perhaps to document content, linkage or usage, can be transformed into a static, per-document relevance weight for use in ranking. The challenge is to find a good function to transform feature values into relevance scores. This paper presents FLOE, a simple density analysis method for modelling the shape of the transformation required, based on training data and without assuming independence between feature and baseline. For a new query independent feature, it addresses the questions: is it required for ranking, what sort of transformation is appropriate and, after adding it, how successful was the chosen transformation? Based on this we apply sigmoid transformations to PageRank, indegree, URL Length and ClickDistance, tested in combination with a BM25 baseline.
Detecting dominant locations from search queries BIBAFull-Text 424-431
  Lee Wang; Chuang Wang; Xing Xie; Josh Forman; Yansheng Lu; Wei-Ying Ma; Ying Li
Accurately and effectively detecting the locations where search queries are truly about has huge potential impact on increasing search relevance. In this paper, we define a search query's dominant location (QDL) and propose a solution to correctly detect it. QDL is geographical location(s) associated with a query in collective human knowledge, i.e., one or few prominent locations agreed by majority of people who know the answer to the query. QDL is a subjective and collective attribute of search queries and we are able to detect QDLs from both queries containing geographical location names and queries not containing them. The key challenges to QDL detection include false positive suppression (not all contained location names in queries mean geographical locations), and detecting implied locations by the context of the query. In our solution, a query is recursively broken into atomic tokens according to its most popular web usage for reducing false positives. If we do not find a dominant location in this step, we mine the top search results and/or query logs (with different approaches discussed in this paper) to discover implicit query locations. Our large-scale experiments on recent MSN Search queries show that our query location detection solution has consistent high accuracy for all query frequency ranges.


Challenges in running a commercial search engine BIBAFull-Text 432
  Amit Singhal
These are exciting times for Information Retrieval. Web search engines have brought IR to the masses. It now affects the lives of hundreds of millions of people, and growing, as Internet search companies launch ever more products based on techniques developed in These are exciting times for Information Retrieval. Web search engines have brought IR to the masses. It now affects the lives of hundreds of millions of people, and growing, as Internet search companies launch ever more products based on techniques developed in IR research.
   The real world poses unique challenges for search algorithms. They operate at unprecedented scales, and over a wide diversity of information. In addition, we have entered an unprecedented world of "Adversarial Information Retrieval". The lure of billions of dollars of commerce, guided by search engines, motivates all kinds of people to try all kinds of tricks to get their sites to the top of the search results.
   What techniques do people use to defeat IR algorithms? What are the evaluation challenges for a web search engine? How much impact has IR had on search engines? How does Google serve over 250 Million queries a day, often with sub-second response times? This talk will show that the world of algorithm and system design for commercial search engines can be described by two of Murphy's Laws: a) If anything can go wrong, it will, and b) If anything cannot go wrong, it will anyway.

User studies

When will information retrieval be "good enough"? BIBAFull-Text 433-440
  James Allan; Ben Carterette; Joshua Lewis
We describe a user study that examined the relationship between the quality of an Information Retrieval system and the effectiveness of its users in performing a task. The task involves finding answer facets of questions pertaining to a collection of newswire documents over a six month period. We artificially created sets of ranked lists at increasing levels of quality by blending the output of a state-of-the-art retrieval system with truth data created by annotators. Subjects performed the task by using these ranked lists to guide their labeling of answer passages in the retrieved articles. We found that as system accuracy improves, subject time on task and error rate decrease, and the rate of finding new correct answers increases. There is a large intermediary region in which the utility difference is not significant; our results suggest that there is some threshold of accuracy for this task beyond which user utility improves rapidly, but more experiments are needed to examine the area around that threshold closely.
Modeling task-genre relationships for IR in the workplace BIBAFull-Text 441-448
  Luanne Freund; Elaine G. Toms; Charles L. A. Clarke
Context influences the search process, but to date research has not definitively identified which aspects of context are the most influential for information retrieval, and thus are worthy of integration in today's retrieval systems. In this research, we isolated for examination two aspects of context: task and document genre and examined the relationship between them within a software engineering work domain. In this domain, the nature of the task has an impact on decisions of relevance and usefulness, and the document collection contains a distinctive set of genre. Our data set was a document repository created and used by our target population. The document surrogates were meta-tagged by purpose and document type. Correspondence analysis of this categorical data identified some specific relationships between genres and tasks, as well as four broad dimensions of variability underlying these relationships. These results have the potential to inform the design of a contextual retrieval system by refining search results for this domain.
Personalizing search via automated analysis of interests and activities BIBAFull-Text 449-456
  Jaime Teevan; Susan T. Dumais; Eric Horvitz
We formulate and study search algorithms that consider a user's prior interactions with a wide variety of content to personalize that user's current Web search. Rather than relying on the unrealistic assumption that people will precisely specify their intent when searching, we pursue techniques that leverage implicit information about the user's interests. This information is used to re-rank Web search results within a relevance feedback framework. We explore rich models of user interests, built from both search-related information, such as previously issued queries and previously visited Web pages, and other information about the user such as documents and email the user has read and created. Our research suggests that rich representations of the user and the corpus are important for personalization, but that it is possible to approximate these representations and provide efficient client-side algorithms for personalizing search. We show that such personalization algorithms can significantly improve on current Web search.
The loquacious user: a document-independent source of terms for query expansion BIBAFull-Text 457-464
  Diane Kelly; Vijay Deepak Dollu; Xin Fu
In this paper we investigate the effectiveness of a document-independent technique for eliciting feedback from users about their information problems. We propose that such a technique can be used to elicit terms from users for use in query expansion and as a follow-up when ambiguous queries are initially posed by users. We design a feedback form to obtain additional information from users, administer the form to users after initial querying, and create a series of experimental runs based on the information that we obtained from the form. Results demonstrate that the form was successful at eliciting more information from users and that this additional information significantly improved retrieval performance. Our results further demonstrate a strong relationship between query length and performance.

Theory 3

A study of the dirichlet priors for term frequency normalisation BIBAFull-Text 465-471
  Ben He; Iadh Ounis
In Information Retrieval (IR), the Dirichlet Priors have been applied to the smoothing technique of the language modeling approach. In this paper, we apply the Dirichlet Priors to the term frequency normalisation of the classical BM25 probabilistic model and the Divergence from Randomness PL2 model. The contributions of this paper are twofold. First, through extensive experiments on four TREC collections, we show that the newly generated models, to which the Dirichlet Priors normalisation is applied, provide robust and effective performance. Second, we propose a novel theoretically-driven approach to the automatic parameter tuning of the Dirichlet Priors normalisation. Experiments show that this tuning approach optimises the retrieval performance of the newly generated Dirichlet Priors-based weighting models.
A Markov random field model for term dependencies BIBAFull-Text 472-479
  Donald Metzler; W. Bruce Croft
This paper develops a general, formal framework for modeling term dependencies via Markov random fields. The model allows for arbitrary text features to be incorporated as evidence. In particular, we make use of features based on occurrences of single terms, ordered phrases, and unordered phrases. We explore full independence, sequential dependence, and full dependence variants of the model. A novel approach is developed to train the model that directly maximizes the mean average precision rather than maximizing the likelihood of the training data. Ad hoc retrieval experiments are presented on several newswire and web collections, including the GOV2 collection used at the TREC 2004 Terabyte Track. The results show significant improvements are possible by modeling dependencies, especially on the larger web collections.
An exploration of axiomatic approaches to information retrieval BIBAFull-Text 480-487
  Hui Fang; ChengXiang Zhai
Existing retrieval models generally do not offer any guarantee for optimal retrieval performance. Indeed, it is even difficult, if not impossible, to predict a model's empirical performance analytically. This limitation is at least partly caused by the way existing retrieval models are developed where relevance is only coarsely modeled at the level of documents and queries as opposed to a finer granularity level of terms. In this paper, we present a new axiomatic approach to developing retrieval models based on direct modeling of relevance with formalized retrieval constraints defined at the level of terms. The basic idea of this axiomatic approach is to search in a space of candidate retrieval functions for one that can satisfy a set of reasonable retrieval constraints. To constrain the search space, we propose to define a retrieval function inductively and decompose a retrieval function into three component functions. Inspired by the analysis of the existing retrieval functions with the inductive definition, we derive several new retrieval functions using the axiomatic retrieval framework. Experiment results show that the derived new retrieval functions are more robust and less sensitive to parameter settings than the existing retrieval functions with comparable optimal performance.
Gravitation-based model for information retrieval BIBAFull-Text 488-495
  Shuming Shi; Ji-Rong Wen; Qing Yu; Ruihua Song; Wei-Ying Ma
This paper proposes GBM (gravitation-based model), a physical model for information retrieval inspired by Newton's theory of gravitation. A mapping is built in this model from concepts of information retrieval (documents, queries, relevance, etc) to those of physics (mass, distance, radius, attractive force, etc). This model actually provides a new perspective on IR problems. A family of effective term weighting functions can be derived from it, including the well-known BM25 formula. This model has some advantages over most existing ones: First, because it is directly based on basic physical laws, the derived formulas and algorithms can have their explicit physical interpretation. Second, the ranking formulas derived from this model satisfy more intuitive heuristics than most of existing ones, thus have the potential to behave empirically better and to be used safely on various settings. Finally, a new approach for structured document retrieval derived from this model is more reasonable and behaves better than existing ones.

Web search 3

Impedance coupling in content-targeted advertising BIBAFull-Text 496-503
  Berthier Ribeiro-Neto; Marco Cristo; Paulo B. Golgher; Edleno Silva de Moura
The current boom of the Web is associated with the revenues originated from on-line advertising. While search-based advertising is dominant, the association of ads with a Web page (during user navigation) is becoming increasingly important. In this work, we study the problem of associating ads with a Web page, referred to as content-targeted advertising, from a computer science perspective. We assume that we have access to the text of the Web page, the keywords declared by an advertiser, and a text associated with the advertiser's business. Using no other information and operating in fully automatic fashion, we propose ten strategies for solving the problem and evaluate their effectiveness. Our methods indicate that a matching strategy that takes into account the semantics of the problem (referred to as AAK for "ads and keywords") can yield gains in average precision figures of 60% compared to a trivial vector-based strategy. Further, a more sophisticated impedance coupling strategy, which expands the text of the Web page to reduce vocabulary impedance with regard to an advertisement, can yield extra gains in average precision of 50%. These are first results. They suggest that great accuracy in content-targeted advertising can be attained with appropriate algorithms.
Improving web search results using affinity graph BIBAFull-Text 504-511
  Benyu Zhang; Hua Li; Yi Liu; Lei Ji; Wensi Xi; Weiguo Fan; Zheng Chen; Wei-Ying Ma
In this paper, we propose a novel ranking scheme named Affinity Ranking (AR) to re-rank search results by optimizing two metrics: (1) diversity -- which indicates the variance of topics in a group of documents; (2) information richness -- which measures the coverage of a single document to its topic. Both of the two metrics are calculated from a directed link graph named Affinity Graph (AG). AG models the structure of a group of documents based on the asymmetric content similarities between each pair of documents. Experimental results in Yahoo! Directory, ODP Data, and Newsgroup data demonstrate that our proposed ranking algorithm significantly improves the search performance. Specifically, the algorithm achieves 31% improvement in diversity and 12% improvement in information richness relatively within the top 10 search results.
Learning to estimate query difficulty: including applications to missing content detection and distributed information retrieval BIBAFull-Text 512-519
  Elad Yom-Tov; Shai Fine; David Carmel; Adam Darlow
In this article we present novel learning methods for estimating the quality of results returned by a search engine in response to a query. Estimation is based on the agreement between the top results of the full query and the top results of its sub-queries. We demonstrate the usefulness of quality estimation for several applications, among them improvement of retrieval, detecting queries for which no relevant content exists in the document collection, and distributed information retrieval. Experiments on TREC data demonstrate the robustness and the effectiveness of our learning algorithms.


Iterative translation disambiguation for cross-language information retrieval BIBAFull-Text 520-527
  Christof Monz; Bonnie J. Dorr
Finding a proper distribution of translation probabilities is one of the most important factors impacting the effectiveness of a cross-language information retrieval system. In this paper we present a new approach that computes translation probabilities for a given query by using only a bilingual dictionary and a monolingual corpus in the target language. The algorithm combines term association measures with an iterative machine learning approach based on expectation maximization. Our approach considers only pairs of translation candidates and is therefore less sensitive to data-sparseness issues than approaches using higher n-grams. The learned translation probabilities are used as query term weights and integrated into a vector-space retrieval system. Results for English-German cross-lingual retrieval show substantial improvements over a baseline using dictionary lookup without term weighting.
Bootstrapping dictionaries for cross-language information retrieval BIBAFull-Text 528-535
  Kornel Marko; Stefan Schulz; Olena Medelyan; Udo Hahn
The bottleneck for dictionary-based cross-language information retrieval is the lack of comprehensive dictionaries, in particular for many different languages. We here introduce a methodology by which multilingual dictionaries (for Spanish and Swedish) emerge automatically from simple seed lexicons. These seed lexicons are automatically generated, by cognate mapping, from (previously manually constructed) Portuguese and German as well as English sources. Lexical and semantic hypotheses are then validated and new ones iteratively generated by making use of co-occurrence patterns of hypothesized translation synonyms in parallel corpora. We evaluate these newly derived dictionaries on a large medical document collection within a cross-language retrieval setting.
A maximum coherence model for dictionary-based cross-language information retrieval BIBAFull-Text 536-543
  Yi Liu; Rong Jin; Joyce Y. Chai
One key to cross-language information retrieval is how to efficiently resolve the translation ambiguity of queries given their short length. This problem is even more challenging when only bilingual dictionaries are available, which is the focus of this paper. In the previous research of cross-language information retrieval using bilingual dictionaries, the word co-occurrence statistics is used to determine the most likely translations of queries. In this paper, we propose a novel statistical model, named "maximum coherence models", which estimates the translation probabilities of query words that are consistent with the word co-occurrence statistics. Unlike the previous work, where a binary decision is made for the selection of translations, the new model maintains the uncertainty in translating query words when their sense ambiguity is difficult to resolve. Furthermore, this new model is able to estimate translations of multiple query words simultaneously. This is in contrast to many previous approaches where translations of individual query words are determined independently. Empirical studies with TREC datasets have shown that the maximum coherence model achieves a relative 10% - 40% improvement in cross-language information retrieval, comparing to other approaches that also use word co-occurrence statistics for sense disambiguation.

Video and image

Hidden Markov models for automatic annotation and content-based retrieval of images and video BIBAFull-Text 544-551
  Arnab Ghoshal; Pavel Ircing; Sanjeev Khudanpur
This paper introduces a novel method for automatic annotation of images with keywords from a generic vocabulary of concepts or objects for the purpose of content-based image retrieval. An image, represented as sequence of feature-vectors characterizing low-level visual features such as color, texture or oriented-edges, is modeled as having been stochastically generated by a hidden Markov model, whose states represent concepts. The parameters of the model are estimated from a set of manually annotated (training) images. Each image in a large test collection is then automatically annotated with the a posteriori probability of concepts present in it. This annotation supports content-based search of the image-collection via keywords. Various aspects of model parameterization, parameter estimation, and image annotation are discussed. Empirical retrieval results are presented on two image-collections | COREL and key-frames from TRECVID. Comparisons are made with two other recently developed techniques on the same datasets.
Exploiting ontologies for automatic image annotation BIBAFull-Text 552-558
  Munirathnam Srikanth; Joshua Varner; Mitchell Bowden; Dan Moldovan
Automatic image annotation is the task of automatically assigning words to an image that describe the content of the image. Machine learning approaches have been explored to model the association between words and images from an annotated set of images and generate annotations for a test image. The paper proposes methods to use a hierarchy defined on the annotation words derived from a text ontology to improve automatic image annotation and retrieval. Specifically, the hierarchy is used in the context of generating a visual vocabulary for representing images and as a framework for the proposed hierarchical classification approach for automatic image annotation. The effect of using the hierarchy in generating the visual vocabulary is demonstrated by improvements in the annotation performance of translation models. In addition to performance improvements, hierarchical classification approaches yield well to constructing multimedia ontologies.
A database centric view of semantic image annotation and retrieval BIBAFull-Text 559-566
  Gustavo Carneiro; Nuno Vasconcelos
We introduce a new model for semantic annotation and retrieval from image databases. The new model is based on a probabilistic formulation that poses annotation and retrieval as classification problems, and produces solutions that are optimal in the minimum probability of error sense. It is also database centric, by establishing a one-to-one mapping between semantic classes and the groups of database images that share the associated semantic labels. In this work we show that, under the database centric probabilistic model, optimal annotation and retrieval can be implemented with algorithms that are conceptually simple, computationally efficient, and do not require prior semantic segmentation of training images. Due to its simplicity, the annotation and retrieval architecture is also amenable to sophisticated parameter tuning, a property that is exploited to investigate the role of feature selection in the design of optimal annotation and retrieval systems. Finally, we demonstrate the benefits of simply establishing a one-to-one mapping between keywords and the states of the semantic classification problem over the more complex, and currently popular, joint modeling of keyword and visual feature distributions. The database centric probabilistic retrieval model is compared to existing semantic labeling and retrieval methods, and shown to achieve higher accuracy than the previously best published results, at a fraction of their computational cost.


Analysis of factoid questions for effective relation extraction BIBAFull-Text 567-568
  Eugene Agichtein; Silviu Cucerzan; Eric Brill
We present an analysis of the structured relationships observed in a randomly sampled set of question-like queries submitted to a search engine for a popular online encyclopedic document collection. Our study shows that a relatively small number of binary relationships account for most of the queries in the sample. This empirically validates an approach of analyzing query logs to identify the relationships most relevant to user needs and populating corresponding fact tables from the collection for factoid question answering. Our analysis shows that such an approach can lead to substantial coverage of user questions.
A testbed for people searching strategies in the WWW BIBAFull-Text 569-570
  Javier Artiles; Julio Gonzalo; Felisa Verdejo
This paper describes the creation of a testbed to evaluate people searching strategies on the World-Wide-Web. This task involves resolving person names' ambiguity and locating relevant information characterising every individual under the same name.
Measure-based metasearch BIBAFull-Text 571-572
  Javed A. Aslam; Virgiliu Pavlu; Emine Yilmaz
We propose a simple method for converting many standard measures of retrieval performance into metasearch algorithms. Our focus is both on the analysis of retrieval measures themselves and on the development of new metasearch algorithms. Given the conversion method proposed, our experimental results using TREC data indicate that system-oriented measures of overall retrieval performance (such as average precision) yield good metasearch algorithms whose performance equals or exceeds that of benchmark techniques such as CombMNZ and Condorcet.
A geometric interpretation of r-precision and its correlation with average precision BIBAFull-Text 573-574
  Javed A. Aslam; Emine Yilmaz; Virgiliu Pavlu
We consider two of the most commonly cited measures of retrieval performance: average precision and R-precision. It is well known that average precision and R-precision are highly correlated and similarly robust measures of performance, though the reasons for this are not entirely clear. In this paper, we give a geometric argument which shows that under a very reasonable set of assumptions, average precision and R-precision both approximate the area under the precision-recall curve, thus explaining their high correlation. We further demonstrate through the use of TREC data that the similarity or difference between average precision and R-precision is largely governed by the adherence to, or violation of, these reasonable assumptions.
Probabilistic hyperspace analogue to language BIBAFull-Text 575-576
  Leif Azzopardi; Mark Girolami; Malcolm Crowe
Song and Bruza [6] introduce a framework for Information Retrieval (IR) based on Gardenfor's three tiered cognitive model; Conceptual Spaces[4]. They instantiate a conceptual space using Hyperspace Analogue to Language (HAL[3] to generate higher order concepts which are later used for ad-hoc retrieval. In this poster, we propose an alternative implementation of the conceptual space by using a probabilistic HAL space (pHAL). To evaluate whether converting to such an implementation is beneficial we have performed an initial investigation comparing the concept combination of HAL against pHAL for the task of query expansion. Our experiments indicate that pHAL outperforms the original HAL method and that better query term selection methods can improve performance on both HAL and pHAL.
Basic issues on the processing of web queries BIBAFull-Text 577-578
  Claudine Badue; Ramurti Barbosa; Paulo Golgher; Berthier Ribeiro-Neto; Nivio Ziviani
In this paper we study three basic and key issues related to Web query processing: load balance, broker behavior, and performance by individual index servers. Our study, while preliminary, does reveal interesting tradeoffs: (1) load unbalance at low query arrival rates can be controlled with a simple measure of randomizing the distribution of documents among the index servers, (2) the broker is not a bottleneck, and (3) disk utilization is higher than CPU utilization.
An interface to search human movements based on geographic and chronological metadata BIBAFull-Text 579-580
  Wilma Bainbridge; Ryen W. White; Douglas W. Oard
Historians and scholars can better understand historic events by studying the geographic and chronological activity of individuals who witnessed them. A lack of adequate tools to help users study these activities can hinder the process of learning and discovery. In this paper we present an interface to address this problem that contains three components: a map, a timeline, and a text representation of a survivor's movements. These components simultaneously provide query input (where users can specify their needs) and dynamic results display (where users can immediately see the effect of their decisions). The results of a pilot study show that users reacted positively to the interface.
Automatic web query classification using labeled and unlabeled training data BIBAFull-Text 581-582
  Steven M. Beitzel; Eric C. Jensen; Ophir Frieder; David Grossman; David D. Lewis; Abdur Chowdhury; Aleksandr Kolcz
Accurate topical categorization of user queries allows for increased effectiveness, efficiency, and revenue potential in general-purpose web search systems. Such categorization becomes critical if the system is to return results not just from a general web collection but from topic-specific databases as well. Maintaining sufficient categorization recall is very difficult as web queries are typically short, yielding few features per query. We examine three approaches to topical categorization of general web queries: matching against a list of manually labeled queries, supervised learning of classifiers, and mining of selectional preference rules from large unlabeled query logs. Each approach has its advantages in tackling the web query classification recall problem, and combining the three techniques allows us to classify a substantially larger proportion of queries than any of the individual techniques. We examine the performance of each approach on a real web query stream and show that our combined method accurately classifies 46% of queries, outperforming the recall of the best single approach by nearly 20%, with a 7% improvement in overall effectiveness.
Surrogate scoring for improved metasearch precision BIBAFull-Text 583-584
  Steven M. Beitzel; Eric C. Jensen; Ophir Frieder; Abdur Chowdhury; Greg Pass
We describe a method for improving the precision of metasearch results based upon scoring the visual features of documents' surrogate representations. These surrogate scores are used during fusion in place of the original scores or ranks provided by the underlying search engines. Visual features are extracted from typical search result surrogate information, such as title, snippet, URL, and rank. This approach specifically avoids the use of search engine-specific scores and collection statistics that are required by most traditional fusion strategies. This restriction correctly reflects the use of metasearch in practice, in which knowledge of the underlying search engines' strategies cannot be assumed. We evaluate our approach using a precision-oriented test collection of manually-constructed binary relevance judgments for the top ten results from ten web search engines over 896 queries. We show that our visual fusion approach significantly outperforms the rCombMNZ fusion algorithm by 5.71%, with 99% confidence, and the best individual web search engine by 10.9%, with 99% confidence.
Detecting action-items in e-mail BIBKFull-Text 585-586
  Paul N. Bennett; Jaime Carbonell
Keywords: n-grams, SVMs, e-mail, text classification
Characterization of a simple case of the reassignment of document identifiers as a pattern sequencing problem BIBAFull-Text 587-588
  Roi Blanco; Alvaro Barreiro
In this poster, we analyze recent work in the document identifiers reassignment problem. After that, we present a formalization of a simple case of the problem as a PSP (Pattern Sequencing Problem). This may facilitate future work as it opens a new research line to solve the general problem.
Testing algorithms is like testing students BIBAFull-Text 589-590
  David Bodoff; Pu Li
In this paper, we apply methods from educational testing to measure the reliability of an IR collection.
Evaluating the impact of selection noise in community-based web search BIBAFull-Text 591-592
  A Oisin-Boydell; Barry Smyth; Cathal Gurrin; Alan F. Smeaton
The I-SPY meta-search engine uses a technique called collaborative Web search to leverage the past search behaviour (queries and selections) of a community of users in order to promote search results that are relevant to the community. In this paper we describe recent studies to clarify the benefits of this approach in situations when the behaviour of users cannot be relied upon in terms of their ability to consistently select relevant results during search sessions.
Expectation of f-measures: tractable exact computation and some empirical observations of its properties BIBAFull-Text 593-594
  Kian Ming Adam Chai
We derive a tractable and exact computation for the expectation of F-measures. We also demonstrate the non-convexity of this expectation, and investigate errors of approximating the expectation under different settings.
Search engines and how students think they work BIBAFull-Text 595-596
  E. N. Efthimiadis; D. G. Hendry
To investigate the nature of people's understandings for how search engines work, we collected data from 232 undergraduate and graduate students. Students were asked to "draw a labeled sketch of how search engines work." A reference model was constructed and each sketch was analyzed and compared against it for completeness. The paper presents preliminary results and discusses the implications for educational assessment and curriculum design on the one hand, and information system design on the other.
On evaluation of adaptive topic tracking systems BIBAFull-Text 597-598
  Tamer Elsayed; Douglas W. Oard
Summative evaluation methods for supervised adaptive topic tracking systems convolve the effect of system decisions on present utility with the effect on future utility. This paper describes a new formative evaluation approach that focuses on future utility for use in the design stage of adaptive systems. Topic model quality is assessed at a predefined set of points using a fixed document set to enhance comparability. Experiments using a vector-space topic tracking system illustrate the utility of this approach to formative evaluation.
Top subset retrieval on large collections using sorted indices BIBAFull-Text 599-600
  Paul Ferguson; Alan F. Smeaton; Cathal Gurrin; Peter Wilkins
In this poster we describe alternative inverted index structures that reduce the time required to process queries, produce a higher query throughput and still return high quality results to the end user. We give results based upon the TREC Terabyte dataset showing improvements that these indices give in terms of effectiveness and efficiency.
Relation between PLSA and NMF and implications BIBAFull-Text 601-602
  Eric Gaussier; Cyril Goutte
Non-negative Matrix Factorization (NMF, [5]) and Probabilistic Latent Semantic Analysis (PLSA, [4]) have been successfully applied to a number of text analysis tasks such as document clustering. Despite their different inspirations, both methods are instances of multinomial PCA [1]. We further explore this relationship and first show that PLSA solves the problem of NMF with KL divergence, and then explore the implications of this relationship.
The impact of evaluation on multilingual text retrieval BIBAFull-Text 603-604
  Julio Gonzalo; Carol Peters
We summarize the impact of the first five years of activity of the Cross-Language Evaluation Forum (CLEF) on multilingual text retrieval system performance and show how the CLEF evaluation campaigns have contributed to advances in the state-of-the-art.
Using Oracle for natural language document retrieval an automatic query reformulation approach BIBAFull-Text 605-606
  Jens Grivolla
In corporate applications, vast amounts of data are often stored in database systems such as Oracle. Apart from structured information this can include text documents which cannot easily be retrieved using traditional SQL queries.
   Oracle includes means to deal with full text document retrieval (called Oracle Text) that offer special query operators for searches inside text fields. We have explored the effect of these different operators for queries derived from natural language queries. This article compares the retrieval performances achieved with different automatic reformulations from natural language to Oracle SQL queries.
Customizing information access according to domain and task knowledge: the ontoExplo system BIBAFull-Text 607-608
  Nathalie Hernandez; Josiane Mothe; Sandra Poulain
In this paper we present a system that allows a user to explore or mine a document collection. This system is based on domain and task knowledge modelled in the form of ontologies and allows direct access both to information as it is stored and to information that is built from it. The system has been developed in Java.
Evaluating semantic indexing techniques through cross-language fingerprinting BIBAFull-Text 609-610
  Eduard Hoenkamp; Sander van Dijk
Users in search of on-line document sources are usually looking for content, not words. Hence, IR researchers generally agree that search techniques should be geared toward the meaning underlying documents rather than toward the text itself. The most visible examples of such techniques are Latent Semantic Analysis (LSA), and the Hyperspace Analog to Language (HAL). If these techniques really uncover semantic dependencies, then they should be applicable across languages. We investigated this using electronic versions of three kinds of translated material: a novel, a popular treatise about cosmology, and a data base of technical specifications. We used the analogy of fingerprinting used in forensics to establish if individuals are related. Genetic fingerprinting uses enzymes to split the DNA and then compare the resulting band patterns. Likewise, in our research we use queries to split a document into fragments. If a search technique really isolates fragments related to the query, then a document and its translation should have similar band patterns. In this paper we (1) present the fingerprinting technique, (2) introduce the material used, and (3) report preliminary results of an evaluation for two semantic indexing techniques.
Live visual relevance feedback for query formulation BIBAFull-Text 611-612
  Eduard Hoenkamp; Gijs van Dinther
Users browsing the Internet seem relatively satisfied with the performance of search engines. An optimistic explanation would be the high quality of search engines. A more pessimistic one would be that people just adapt easily to any new technology. A third explanation is people's ignorance about recall: as they simply don't know what relevant documents are missed, they can hardly be expected to worry about them. And so they easily conceive the result as the best they can get. To allow the user to better assess the quality of the search results, an algorithm was developed that computes a visual representation of the document space in the neighborhood of the user's query.
   The paper (1) outlines the algorithm, (2) shows how users can explore the neighborhood of a query, and (3) demonstrates how users can guess more judiciously whether they need to further elaborate their query to improve retrieval results.
A dual index model for contextual information retrieval BIBAFull-Text 613-614
  Xiangji Huang; Yan Rui Huang; Miao Wen
In this paper, we propose a dual index model for contextual IR. For each query, we search against both document level and passage level indexes, and use the corresponding merge function to update the weights for both documents and paragraphs by combining the results from both indexes according to the granularity information in metadata. Experiments on 2004 TREC data show that a significant improvement can be made by using the dual index model.
Predicting query difficulty on the web by learning visual clues BIBAFull-Text 615-616
  Eric C. Jensen; Steven M. Beitzel; David Grossman; Ophir Frieder; Abdur Chowdhury
We describe a method for predicting query difficulty in a precision-oriented web search task. Our approach uses visual features from retrieved surrogate document representations (titles, snippets, etc.) to predict retrieval effectiveness for a query. By training a supervised machine learning algorithm with manually evaluated queries, visual clues indicative of relevance are discovered. We show that this approach has a moderate correlation of 0.57 with precision at 10 scores from manual relevance judgments of the top ten documents retrieved by ten web search engines over 896 queries. Our findings indicate that difficulty predictors which have been successful in recall-oriented ad-hoc search, such as clarity metrics, are not nearly as correlated with engine performance in precision-oriented tasks such as this, yielding a maximum correlation of 0.3. Additionally, relying only on visual clues avoids the need for collection statistics that are required by these prior approaches. This enables our approach to be employed in environments where these statistics are unavailable or costly to retrieve, such as metasearch.
Finding semantically similar questions based on their answers BIBAFull-Text 617-618
  Jiwoon Jeon; W. Bruce Croft; Joon Ho Lee
A large number of question and answer pairs can be collected from question and answer boards and FAQ pages on the Web. This paper proposes an automatic method of finding the questions that have the same meaning. The method can detect semantically similar questions that have little word overlap because it calculates question-question similarities by using the corresponding answers as well as the questions. We develop two different similarity measures based on language modeling and compare them with the traditional similarity measures. Experimental results show that semantically similar questions pairs can be effectively found with the proposed similarity measures.
Study of cross lingual information retrieval using on-line translation systems BIBAFull-Text 619-620
  Rong Jin; Joyce Y. Chai
Typical cross language retrieval requires special linguistic resources, such as bilingual dictionaries and parallel corpus. In this study, we focus on the cross lingual retrieval problem that only uses online translation systems. We compare two approaches: a translation-based approach that directly translates queries into the language of documents and then applies traditional information retrieval techniques; and a model-based approach that first learns a statistical translation model from the translations acquired from an online translation system and then applies the learned statistical model to cross lingual information retrieval. Our empirical study with ImageCLEF has shown the model-based approach performs significantly better than the translation-based approach.
3D viewpoint-based photo search and information browsing BIBAFull-Text 621-622
  Rieko Kadobayashi; Katsumi Tanaka
We propose a new photo search method that uses three-dimensional (3D) viewpoints as queries. 3D viewpoint-based image retrieval is especially useful for searching collections of archaeological photographs, which contain many different images of the same object. Our method is designed to enable users to retrieve images that contain the same object but show a different view, and to browse groups of images taken from a similar viewpoint. We also propose using 3D scenes to query by example, which means that users do not have the problem of trying to formulate appropriate queries. This combination gives users an easy way of accessing not only photographs but also archived information.
Examination and enhancement of a ring-structured graphical search interface based on usability testing BIBAFull-Text 623-624
  Tomoko Kajiyama; Noriko Kand; Shin'ichi Satoh
We evaluated the interactive retrieval functionality of the Concentric Ring View according to a series of usability studies. This is a ring structure-based graphical user interface, like a planisphere, for image retrieval with multi-faceted metadata. Attribute values for each facet are arranged on a ring, and retrieved images are displayed inside using search keys derived from the attribute values on the bottom part of the rings. By rotating the rings, users can browse retrieved images while adjusting search keys. The first usability test conducted with thirty six participants confirmed that: (i) novice users, even junior high school students, could use this interface; (ii) users could find images better than anticipated; and (iii) the interface was good at choosing the first relevant image, but users could not refine retrieval because they were unable to reuse retrieved results. To solve this problem, we added two functionalities, personal history for reuse and relevance feedback. With these improvements, we named the new version of the interface Concentric Ring View F+. A second usability test with seven participants confirmed the effectiveness of this newer interface.
Short comings of latent models in supervised settings BIBAFull-Text 625-626
  Vijay Krishnan
The Aspect Model [1, 2] and the Latent Dirichlet Allocation Model [3, 4] are latent generative models proposed with the objective of modeling discrete data such as text. Though it is not explicitly published (to the best of our knowledge), it is reasonably well known in there search community that the Aspect Model does not perform very well in supervised settings and also that latent models are frequently not identifiable, i.e. their optimal parameters are not unique.
   In this paper, we make a much stronger claim about the pitfalls of commonly-used latent models. By constructing a small, synthetic, but by no means unrealistic corpus, we show that latent models have inherent limitations that prevent them from recovering semantically meaningful parameters from data generated from a reasonable generative distribution. In fact, our experiments with supervised classification using the Aspect Model, showed that its performance was rather poor, even worse than Naive Bayes, leading us to the synthetic study.
   We also analyze the scenario of using tempered EM and show that it would not plug the above shortcomings. Our analysis suggests that there is also some scope for improvement in the Latent Dirichlet Allocation Model (LDA) [3, 4]. We then use our insight into the shortcomings of these models, to come up with a promising variant of the LDA, that does not suffer from the aforesaid drawbacks. This could potentially lead to much better performance and model fit, in the supervised scenario.
Major topic detection and its application to opinion summarization BIBKFull-Text 627-628
  Lun-Wei Ku; Li-Ying Lee; Tung-Ho Wu; Hsin-Hsi Chen
Keywords: opinion summarization, sentence retrieval, topic detection
Using query term order for result summarisation BIBAFull-Text 629-630
  Shao Fen Liang; Siobhan Devlin; John Tait
We report on two experiments performed to test the importance of Term Order in automatic summarisation. Experiment one was undertaken as part of DUC 2004 to which three systems were submitted, each with a different summarisation approach. The system that used document Term Order outperformed those that did not use Term Order in the ROUGE evaluation. Experiment two made use of human evaluations of search engine results, comparing our Query Term Order summaries with a simulation of current Google search engine result summaries in terms of summary quality. Our QTO system's summaries aided users' relevance judgements to a significantly greater extent than Google's.
Profile-based event tracking BIBAFull-Text 631-632
  Baoli Li; Wenjie Li; Qin Lu; Mingli Wu
In this research, we focus on tracking topics that originate and evolve from a specific event. Intuitively, a few key elements of a target event, such as date, location, and persons involved, would be enough for making a decision on whether a test story is on-topic. Consequently, a profile-based event tracking method is proposed. We attempt to build an event profile from the given on-topic stories by robust information retrieval technologies. A feature selection metric and a recognized event clause are utilized to determine most (if not all) key semantic elements of the target event. Preliminary experiments on the TDT2 mandarin corpus show that this profile-based event tracking method is promising.
Analysis of recursive feature elimination methods BIBKFull-Text 633-634
  Fan Li; Yiming Yang
Keywords: feature selection, machine learning, text categorization
Assessing the term independence assumption in blind relevance feedback BIBAFull-Text 635-636
  Jimmy Lin; G. Craig Murray
When applying blind relevance feedback for ad hoc document retrieval, is it possible to identify, a priori, the set of query terms that will most improve retrieval performance? Can this complex problem be reduced into the simpler one of making independent decisions about the performance effects of each query term? Our experiments suggest that, for the selection of terms for blind relevance feedback, the term independence assumption may be empirically justified.
Revisiting the effect of topic set size on retrieval error BIBKFull-Text 637-638
  Wei-Hao Lin; Alexander Hauptmann
Keywords: measurement error, test collections
Information sharing through rational links and viewpoint retrieval BIBAFull-Text 639-640
  Bicheng Liu; David J. Harper; Stuart Watt
In this paper we present the concept of Federated Information Sharing Communities (FISC), which leverages organisational and social relationships with document content to provide community-centred information sharing and communication environments. Prominence is given to capabilities that go beyond the generic retrieval of documents to include the ability to retrieve people, their interests and inter-relationships. We focus on providing social awareness "in the large" to help users understand the members within community and the relationships between them. Within the FISC framework, we provide viewpoint retrieval to enable a user to construct member-specific view(s) of the community, based on their various topic interests. As proof of concept, we present the first FISC prototype based on the twenty-five year SIGIR collection and examples of operational results.
Mining multimedia salient concepts for incremental information extraction BIBAFull-Text 641-642
  Josao Magalhaes; Stefan Ruger
We propose a novel algorithm for extracting information by mining the feature space clusters and then assigning salient concepts to them. Bayesian techniques for extracting concepts from multimedia usually suffer either from lack of data or from too complex concepts to be represented by a single statistical model. An incremental information extraction approach, working at different levels of abstraction, would be able to handle concepts of varying complexities. We present the results of our research on the initial part of an incremental approach, the extraction of the most salient concepts from multimedia information.
Translating pieces of words BIBAFull-Text 643-644
  Paul McNamee; James Mayfield
Translation for cross-language information retrieval need not be word-based. We show that character n-grams in one language can be 'translated' into character n-grams of another language. We demonstrate that such translations produce retrieval results on par with, and often exceeding, those of word-based and stem-based translation.
Cross-language text classification BIBKFull-Text 645-646
  J. Scott Olsson; Douglas W. Oard; Jan Hajic
Keywords: cross-language text classification, topic classification
A temporally adaptive content-based relevance ranking algorithm BIBAFull-Text 647-648
  Jukka Perkio; Wray Buntine; Henry Tirri
In information retrieval relevance ranking of the results is one of the most important single tasks there are. There are many different ranking algorithms based on the content of the documents or on some external properties e.g. link structure of html documents.
   We present a temporally adaptive content-based relevance ranking algorithm that explicitly takes into account the temporal behavior of the underlying statistical properties of the documents in the form of a statistical topic model. more we state that our algorithm can be used on top of any ranking algorithm.
Automated evaluation of search engine performance via implicit user feedback BIBAFull-Text 649-650
  Himanshu Sharma; Bernard J. Jansen
Measuring the information retrieval effectiveness of Web search engines can be expensive if human relevance judgments are required to evaluate search results. Using implicit user feedback for search engine evaluation provides a cost and time effective manner of addressing this problem. Web search engines can use human evaluation of search results without the expense of human evaluators. An additional advantage of this approach is the availability of real time data regarding system performance. Wecapture user relevance judgments actions such as print, save and bookmark, sending these actions and the corresponding document identifiers to a central server via a client application. We use this implicit feedback to calculate performance metrics, such as precision. We can calculate an overall system performance metric based on a collection of weighted metrics.
Dependency relation matching for answer selection BIBKFull-Text 651-652
  Renxu Sun; Hang Cui; Keya Li; Min-Yen Kan; Tat-Seng Chua
Keywords: answer selection, dependency relation matching, question answering
Using dragpushing to refine centroid text classifiers BIBAFull-Text 653-654
  Songbo Tan; Xueqi Cheng; Bin Wang; Hongbo Xu; Moustafa M. Ghanem; Yike Guo
We present a novel algorithm, DragPushing, for automatic text classification. Using a training data set, the algorithm first calculates the prototype vectors, or centroids, for each of the available document classes. Using misclassified examples, it then iteratively refines these centroids; by dragging the centroid of a correct class towards a misclassified example and in the same time pushing the centroid of an incorrect class away from the misclassified example. The algorithm is simple to implement and is computationally very efficient. Evaluation experiments conducted on two benchmark collections show that its classification accuracy is comparable to that of more complex methods, such as support vector machines (SVM).
Scalable hierarchical topic detection: exploring a sample based approach BIBAFull-Text 655-656
  Dolf Trieschnigg; Wessel Kraaij
Hierarchical topic detection is a new task in the TDT 2004 evaluation program, which aims to organize an unstructured news collection in a directed acyclic graph (DAG) structure, reflecting the topics discussed. We present a scalable architecture for HTD and compare several alternative choices for agglomerative clustering and DAG optimization in order to minimize the HTD cost metric.
Noun sense induction using web search results BIBAFull-Text 657-658
  Goldee Udani; Shachi Dave; Anthony Davis; Tim Sibley
This paper presents an algorithm for unsupervised noun sense induction, based on clustering of Web search results. The algorithm does not utilize labeled training instances or any other external knowledge source. Preliminary results on a small dataset show that this technique provides two advantages over other techniques in the literature: it detects real-world senses not found in dictionaries or other lexical resources, and it does not require that the number of word senses be specified in advance.
Self-organizing distributed collaborative filtering BIBAFull-Text 659-660
  Jun Wang; Marcel J. T. Reinders; Reginald L. Lagendijk; Johan Pouwelse
We propose a fully decentralized collaborative filtering approach that is self-organizing and operates in a distributed way. The relevances between downloading files (items) are stored locally at these items in so called item-based buddy tables and are updated each time that the items are downloaded. We then propose to use the language model to build recommendations for the different users based on the buddy tables of those items a user has downloaded previously. We have tested and compared our distributed collaborative filtering approach to centralized collaborative filtering and showed that it has similar performance. It is therefore a promising technique to facilitate recommendations in peer-to-peer networks.
Dirichlet PageRank BIBAFull-Text 661-662
  Xuanhui Wang; Azadeh Shakery; Tao Tao
PageRank has been known to be a successful algorithm in ranking web sources. In order to avoid the rank sink problem, PageRank assumes that a surfer, being in a page, jumps to a random page with a certain probability. In the standard PageRank algorithm, the jumping probabilities are assumed to be the same for all the pages, regardless of the page properties. This is not the case in the real world, since presumably a surfer would more likely follow the out-links of a high-quality hub page than follow the links of a low-quality one. In this poster, we propose a novel algorithm "Dirichlet PageRank" to address this problem by adapting exible jumping probabilities based on the number of out-links in a page. Empirical results on TREC data show that our method outperforms the standard PageRank algorithm.
A retrospective study of probabilistic context-based retrieval BIBAFull-Text 663-664
  H. C. Wu; R. W. P. Luk; K. F. Wong; K. L. Kwok; W. J. Li
We propose a novel probabilistic retrieval model which weights terms according to their contexts in documents. The term weighting function of our model is similar to the language model and the binary independence model. The retrospective experiments (i.e., relevance information is present) illustrate the potential of our probabilistic context-based retrieval where the precision at the top 30 documents is about 43% for TREC-6 data and 52% for TREC-7 data.
Indexing emails and email threads for retrieval BIBAFull-Text 665-666
  Yejun Wu; Douglas W. Oard
Electronic mail poses a number of unusual challenges for the design of information retrieval systems and test collections, including informal expression, conversational structure, variable document granularity (e.g., messages, threads, or longer-term interactions), a naturally occuring integration between free text and structural metadata, and incompletely characterized user needs. This paper reports on initial experiments with a large collection of public mailing lists from the World Wide Web consortium that will be used for the TREC 2005 Enterprise Search Track. Automatic subject-line threading and removal of duplicated text were found to have little effect in a small pilot study. Those observations motivated development of a question typology and more detailed analysis of collection characteristics; preliminary results for both are reported.
Intelligent fusion of structural and citation-based evidence for text classification BIBAFull-Text 667-668
  Baoping Zhang; Yuxin Chen; Weiguo Fan; Edward A. Fox; Marcos Andre Goncalves; Marco Cristo; Pavel Calado
This paper shows how different measures of similarity derived from the citation information and the structural content (e.g., title, abstract) of the collection can be fused to improve classification effectiveness. To discover the best fusion framework, we apply Genetic Programming (GP) techniques. Our experiments with the ACM Computing Classification Scheme, using documents from the ACM Digital Library, indicate that GP can discover similarity functions superior to those based solely on a single type of evidence. Effectiveness of the similarity functions discovered through simple majority voting is better than that of content-based as well as combination-based Support Vector Machine classifiers. Experiments also were conducted to compare the performance between GP techniques and other fusion techniques such as Genetic Algorithms (GA) and linear fusion. Empirical results show that GP was able to discover better similarity functions than other fusion techniques.
Mining translations of OOV terms from the web through cross-lingual query expansion BIBAFull-Text 669-670
  Ying Zhang; Fei Huang; Stephan Vogel
Translating out-of-vocabulary (OOV) terms is a great challenge for the Cross-lingual Information Retrieval and Data-driven Machine Translation systems. Several approaches have been proposed to mine translations for OOV terms from the web, especially from pages containing mixed languages. In this paper, we propose a novel approach to automatically translate OOV terms on the fly through cross-lingual query expansion. The proposed approach does not require any web crawling and has achieved an inclusion rate of 95% and overall translation accuracy of 90%, outperforming state-of-the-art OOV translation techniques.
On redundancy of training corpus for text categorization: a perspective of geometry BIBKFull-Text 671-672
  Shuigeng Zhou; Jihong Guan
Keywords: kNN text categorization, redundancy, training corpus


An industrial-strength content-based music recommendation system BIBAFull-Text 673
  Pedro Cano; Markus Koppenberger; Nicolas Wack
We present a metadata free system for the interaction with massive collections of music, the MusicSurfer. MusicSurfer automatically extracts descriptions related to instrumentation, rhythm and harmony from music audio signals. Together with efficient similarity metrics, the descriptions allow navigation of multimillion track music collections in a flexible and efficient way without the need of metadata or human ratings.
SPIN: searching personal information networks BIBKFull-Text 674
  Soumen Chakrabarti; Jeetendra Mirchandani; Arnab Nandi
Keywords: graphical models for search, information extraction and integration, personal information management
A CLIR interface to a web search engine BIBKFull-Text 675
  Philipp Daumke; Stefan Schulz; Kornel Marko
Keywords: cross language information retrieval, medical information retrieval
Music-to-knowledge (M2K): a prototyping and evaluation environment for music information retrieval research BIBFull-Text 676
  J. Stephen Downie; Andreas F. Ehmann; David Tcheng
A wireless natural language search engine BIBAFull-Text 677
  Jochen L. Leidner
Web search using stationary (desktop) computers has become a pervasive activity. The mobile user in need of information, however, faces several problems in his or her quest to satisfy an information need. Mobile devices have small displays, and mobile user interfaces are often less then usable, because they impose the desktop Web search paradigm on the mobile user. We present a wireless search engine based on natural language queries transmitted via popular Small Message Service (SMS) text messages. Besides traditional keyword based queries, the system can accept questions or phrases and returns responses that contain likely answers (Figure 1) instead of traditional lists of hyperlinks. The additional precision gained from performing a linguistic analysis of the query helps extracting answers from Web pages directly, which requires no navigation. The system is implemented using a NLIR system residing on a server, which can translate questions or phrases into search engine queries or queries to SOAP Web services, where a gateway mediates between the mobile network and the Internet (Figure 2). Whereas on the desktop keyboard-based search still prevails, we find that in a mobile context question answering techniques can help overcome the output constraints.
The recap system for identifying information flow BIBKFull-Text 678
  Donald Metzler; Yaniv Bernstein; W. Bruce Croft; Alistair Moffat; Justin Zobel
Keywords: information flow, statistical translation, text reuse
Hierarchical text summarization for WAP-enabled mobile devices BIBAFull-Text 679
  Dragomir Radev; Omer Kareem; Jahna Otterbacher
We present WAP MEAD, a WAP-enabled text summarization system. It incorporates a state-of-the art text summarizer enhanced to produce hierarchical summaries that are appropriate for various types of mobile devices, including cellular phones.
Manjal: a text mining system for MEDLINE BIBKFull-Text 680
  Aditya Kumar Sehgal; Padmini Srinivasan
Keywords: closed discovery, open discovery, text mining, topic profile
UCAIR: a personalized search toolbar BIBKFull-Text 681
  Xuehua Shen; Bin Tan; ChengXiang Zhai
Keywords: contextual search, implicit feedback, personalization
A web mining research platform BIBAFull-Text 682
  David Sherfesee; Niall O'Driscoll
We demonstrate the Alexa Web Mining Platform, a data mining and web service publication platform designed to enable analysis of Alexa's massive web data store. The system provides researchers and developers high speed access to our web crawl, crawl metadata, long term storage, and data publication utilities. We demonstrate the system's capabilities and user interface.
Multi-faceted information retrieval system for large scale email archives BIBKFull-Text 683
  Ville H. Tuulos; Jukka Perkio; Henry Tirri
Keywords: content-based ranking, email