HCI Bibliography Home | HCI Conferences | IR Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
IR Tables of Contents: 949596979899000102030405060708091011121314

Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

Fullname:Proceedings of the 27th International ACM SIGIR Conference on Research and Development in Information Retrieval
Editors:Kalervo Jarvelin; James Allan; Peter Bruza; Mark Sanderson
Location:Sheffield, United Kingdom
Dates:2004-Jul-25 to 2004-Jul-29
Standard No:ISBN 1-58113-881-4; ACM Order Number: 534042; ACM DL: Table of Contents hcibib: IR04
  1. Keynote
  2. Opening session
  3. Test collections
  4. Formal models-1
  5. XML retrieval
  6. Dimensionality reduction
  7. Formal models-2
  8. Cross-language information retrieval
  9. Language models
  10. Clustering
  11. Text classification
  12. Disambiguation
  13. Recognising and using named entities
  14. Efficiency and scaling
  15. Content-based filtering & collaborative filtering
  16. Image retrieval, users and usability
  17. Keynote
  18. Machine learning for IR
  19. Natural language processing
  20. Web structure
  21. Posters
  22. Doctorial consortium


Challenges in using lifetime personal information stores BIBAFull-Text 1
  Gordon Bell; Jim Gemmell; Roger Lueder
Within five years, our personal computers with terabyte disk drives will be able to store everything we read, write, hear, and many of the images we see including video. Vannevar Bush outlined such a system in his famous 1945 Memex article [1]. For the last four years we have worked on MyLifeBits www.
   MyLifeBits.com http://www.MyLifeBits.com, a system to digitally store everything from one's life, including books, articles, personal financial records, memorabilia, email, written correspondence, photos (time, location taken), telephone calls, video, television programs, and web pages visited. We recently added content from personal devices that automatically record photos and audio.
   The project started with the capture of Bell's content [2], followed by an effort to explore the use of the SQL database for storage and retrieval. Work has continued along these lines to extend content capture from every useful source e.g. a meeting capture system. The second phase of the project includes the design of tools and links for annotation, collections, cluster analysis, facets for characterizing the content, creation of timelines and stories, and other inherent database related capabilities, e.g. the ability to pivot on an event or photo or person to retrieve linked information [3]. Ideally we would like to have a system that would read every document, extract meta-data (e.g. Dublin Core) and classify it using multiple ontologies, faceted classifications, or the relevant.
   While such a system has implications for future computing devices and their users, these systems will only exist if we can effectively utilize the vast personal stores. Although our system is exploratory, the Stuff I've Seen system [4] demonstrates the utility and necessity of easy search and access to one's own data. Other research efforts with similar goals relating to personal information include Haystack [5], LifeStreams [6], and the UK "Memories for Life" Grand Challenge.
   There are serious research issues beyond the problem of making the information useful through rapid and easy retrieval.
   The "Dear Appy" problem ("Dear Appy, My application, or platform, or media left me unreadable. Signed, Lost Data") is unsettling to archivists and computer professionals -- and must be solved.
   Just navigating the stored life of individual would at first glance appear to take almost a lifetime to sift through. While we are making progress in the capture of less traditionally archived content (e.g. meetings, phone calls & video), automatic interpretation and index of voice are illusive. MyLifeBits is currently focused on retrieval including the hopefully automatic, addition of meta-data e.g. document type identification, high level knowledge. While such data is essential for the archivist, it is unclear how useful such meta-data is to a one's own information; without such higher level knowledge and concepts, the vast amount of raw bits may be completely unusable.
   The most cited problem of personal archives is the control of the content including personal security, together with joint ownership of content by other individuals and organizations. In many corporations, periodic expunging of documents is the standard. Similarly, the aspects of a person's life not available in public documents is owned by the organization and all documents may have to be tagged in such a way that it can be expunged, if necessary, when an individual is no longer part of the organization. The HPPA law in the US and even more stringent privacy laws in other counties have major implications for personal stores.

Opening session

Evaluating high accuracy retrieval techniques BIBAFull-Text 2-9
  Chirag Shah; W. Bruce Croft
Although information retrieval research has always been concerned with improving the effectiveness of search, in some applications, such as information analysis, a more specific requirement exists for high accuracy retrieval. This means that achieving high precision in the top document ranks is paramount. In this paper we present work aimed at achieving high accuracy in ad-hoc document retrieval by incorporating approaches from question answering (QA). We focus on getting the first relevant result as high as possible in the ranked list and argue that traditional precision and recall are not appropriate measures for evaluating this task. We instead use the mean reciprocal rank (MRR) of the first relevant result. We evaluate three different methods for modifying queries to achieve high accuracy. The experiments done on TREC data provide support for the approach of using MRR and incorporating QA techniques for getting high accuracy in ad-hoc retrieval task.
Scaling IR-system evaluation using term relevance sets BIBAFull-Text 10-17
  Einat Amitay; David Carmel; Ronny Lempel; Aya Soffer
This paper describes an evaluation method based on Term Relevance Sets Trels that measures an IR system's quality by examining the content of the retrieved results rather than by looking for pre-specified relevant pages. Trels consist of a list of terms believed to be relevant for a particular query as well as a list of irrelevant terms. The proposed method does not involve any document relevance judgments, and as such is not adversely affected by changes to the underlying collection. Therefore, it can better scale to very large, dynamic collections such as the Web. Moreover, this method can evaluate a system's effectiveness on an updatable "live" collection, or on collections derived from different data sources. Our experiments show that the proposed method is very highly correlated with official TREC measures.
Using temporal profiles of queries for precision prediction BIBAFull-Text 18-24
  Fernando Diaz; Rosie Jones
A key missing component in information retrieval systems is self-diagnostic tests to establish whether the system can provide reasonable results for a given query on a document collection. If we can measure properties of a retrieved set of documents which allow us to predict average precision, we can automate the decision of whether to elicit relevance feedback, or modify the retrieval system in other ways. We use meta-data attached to documents in the form of time stamps to measure the distribution of documents retrieved in response to a query, over the time domain, to create a temporal profile for a query. We define some useful features over this temporal profile. We find that using these temporal features, together with the content of the documents retrieved, we can improve the prediction of average precision for a query.

Test collections

Retrieval evaluation with incomplete information BIBAFull-Text 25-32
  Chris Buckley; Ellen M. Voorhees
This paper examines whether the Cranfield evaluation methodology is robust to gross violations of the completeness assumption (i.e., the assumption that all relevant documents within a test collection have been identified and are present in the collection). We show that current evaluation measures are not robust to substantially incomplete relevance judgments. A new measure is introduced that is both highly correlated with existing measures when complete judgments are available and more robust to incomplete judgment sets. This finding suggests that substantially larger or dynamic test collections built using current pooling practices should be viable laboratory tools, despite the fact that the relevance information will be incomplete and imperfect.
Forming test collections with no system pooling BIBAFull-Text 33-40
  Mark Sanderson; Hideo Joho
Forming test collection relevance judgments from the pooled output of multiple retrieval systems has become the standard process for creating resources such as the TREC, CLEF, and NTCIR test collections. This paper presents a series of experiments examining three different ways of building test collections where no system pooling is used. First, a collection formation technique combining manual feedback and multiple systems is adapted to work with a single retrieval system. Second, an existing method based on pooling the output of multiple manual searches is re-examined: testing a wider range of searchers and retrieval systems than has been examined before. Third, a new approach is explored where the ranked output of a single automatic search on a single retrieval system is assessed for relevance: no pooling whatsoever. Using established techniques for evaluating the quality of relevance judgments, in all three cases, test collections are formed that are as good as TREC.
Building an information retrieval test collection for spontaneous conversational speech BIBAFull-Text 41-48
  Douglas W. Oard; Dagobert Soergel; David Doermann; Xiaoli Huang; G. Craig Murray; Jianqiang Wang; Bhuvana Ramabhadran; Martin Franz; Samuel Gustman; James Mayfield; Liliya Kharevych; Stephanie Strassel
Test collections model use cases in ways that facilitate evaluation of information retrieval systems. This paper describes the use of search-guided relevance assessment to create a test collection for retrieval of spontaneous conversational speech. Approximately 10,000 thematically coherent segments were manually identified in 625 hours of oral history interviews with 246 individuals. Automatic speech recognition results, manually prepared summaries, controlled vocabulary indexing, and name authority control are available for every segment. Those features were leveraged by a team of four relevance assessors to identify topically relevant segments for 28 topics developed from actual user requests. Search-guided assessment yielded sufficient inter-annotator agreement to support formative evaluation during system development. Baseline results for ranked retrieval are presented to illustrate use of the collection.

Formal models-1

A formal study of information retrieval heuristics BIBAFull-Text 49-56
  Hui Fang; Tao Tao; ChengXiang Zhai
Empirical studies of information retrieval methods show that good retrieval performance is closely related to the use of various retrieval heuristics, such as TF-IDF weighting. One basic research question is thus what exactly are these "necessary" heuristics that seem to cause good retrieval performance. In this paper, we present a formal study of retrieval heuristics. We formally define a set of basic desirable constraints that any reasonable retrieval function should satisfy, and check these constraints on a variety of representative retrieval functions. We find that none of these retrieval functions satisfies all the constraints unconditionally. Empirical results show that when a constraint is not satisfied, it often indicates non-optimality of the method, and when a constraint is satisfied only for a certain range of parameter values, its performance tends to be poor when the parameter is out of the range. In general, we find that the empirical performance of a retrieval formula is tightly related to how well it satisfies these constraints. Thus the proposed constraints provide a good explanation of many empirical observations and make it possible to evaluate any existing or new retrieval formula analytically.
Probabilistic model for contextual retrieval BIBAFull-Text 57-63
  Ji-Rong Wen; Ni Lao; Wei-Ying Ma
Contextual retrieval is a critical technique for facilitating many important applications such as mobile search, personalized search, PC troubleshooting, etc. Despite of its importance, there is no comprehensive retrieval model to describe the contextual retrieval process. We observed that incompatible context, noisy context and incomplete query are several important issues commonly existing in contextual retrieval applications. However, these issues have not been previously explored and discussed. In this paper, we propose probabilistic models to address these problems. Our study clearly shows that query log is the key to build effective contextual retrieval models. We also conduct a case study in the PC troubleshooting domain to testify the performance of the proposed models and experimental results show that the models can achieve very good retrieval precision.
Discriminative models for information retrieval BIBAFull-Text 64-71
  Ramesh Nallapati
Discriminative models have been preferred over generative models in many machine learning problems in the recent past owing to some of their attractive theoretical properties. In this paper, we explore the applicability of discriminative classifiers for IR. We have compared the performance of two popular discriminative models, namely the maximum entropy model and support vector machines with that of language modeling, the state-of-the-art generative model for IR. Our experiments on ad-hoc retrieval indicate that although maximum entropy is significantly worse than language models, support vector machines are on par with language models. We argue that the main reason to prefer SVMs over language models is their ability to learn arbitrary features automatically as demonstrated by our experiments on the home-page finding task of TREC-10.

XML retrieval

The overlap problem in content-oriented XML retrieval evaluation BIBAFull-Text 72-79
  Gabriella Kazai; Mounia Lalmas; Arjen P. de Vries
Within the INitiative for the Evaluation of XML Retrieval (INEX) a number of metrics to evaluate the effectiveness of content-oriented XML retrieval approaches were developed. Although these metrics provide a solution towards addressing the problem of overlapping result elements, they do not consider the problem of overlapping reference components within the recall-base, thus leading to skewed effectiveness scores. We propose alternative metrics that aim to provide a solution to both overlap issues.
Length normalization in XML retrieval BIBAFull-Text 80-87
  Jaap Kamps; Maarten de Rijke; Borkur Sigurbjornsson
XML retrieval is a departure from standard document retrieval in which each individual XML element, ranging from italicized words or phrases to full blown articles, is a potentially retrievable unit. The distribution of XML element lengths is unlike what we usually observe in standard document collections, prompting us to revisit the issue of document length normalization. We perform a comparative analysis of arbitrary elements versus relevant elements, and show the importance of length as a parameter for XML retrieval. Within the language modeling framework, we investigate a range of techniques that deal with length either directly or indirectly. We observe a length bias introduced by the amount of smoothing, and show the importance of extreme length priors for XML retrieval. We also show that simply removing shorter elements from the index (by introducing a cut-off value) does not create an appropriate document length normalization. Even after increasing the minimal size of XML elements occurring in the index, the importance of an extreme length bias remains.
Configurable indexing and ranking for XML information retrieval BIBAFull-Text 88-95
  Shaorong Liu; Qinghua Zou; Wesley W. Chu
Indexing and ranking are two key factors for efficient and effective XML information retrieval. Inappropriate indexing may result in false negatives and false positives, and improper ranking may lead to low precisions. In this paper, we propose a configurable XML information retrieval system, in which users can configure appropriate index types for XML tags and text contents. Based on users' index configurations, the system transforms XML structures into a compact tree representation, Ctree, and indexes XML text contents. To support XML ranking, we propose the concepts of "weighted term frequency" and "inverted element frequency," where the weight of a term depends on its frequency and location within an XML element as well as its popularity among similar elements in an XML dataset. We evaluate the effectiveness of our system through extensive experiments on the INEX 03 dataset and 30 content and structure (CAS) topics. The experimental results reveal that our system has significantly high precision at low recall regions and achieves the highest average precision (0.3309) as compared with 38 official INEX 03 submissions using the strict evaluation metric.

Dimensionality reduction

Locality preserving indexing for document representation BIBAFull-Text 96-103
  Xiaofei He; Deng Cai; Haifeng Liu; Wei-Ying Ma
Document representation and indexing is a key problem for document analysis and processing, such as clustering, classification and retrieval. Conventionally, Latent Semantic Indexing (LSI) is considered effective in deriving such an indexing. LSI essentially detects the most representative features for document representation rather than the most discriminative features. Therefore, LSI might not be optimal in discriminating documents with different semantics. In this paper, a novel algorithm called Locality Preserving Indexing (LPI) is proposed for document indexing. Each document is represented by a vector with low dimensionality. In contrast to LSI which discovers the global structure of the document space, LPI discovers the local structure and obtains a compact document representation subspace that best detects the essential semantic structure. We compare the proposed LPI approach with LSI on two standard databases. Experimental results show that LPI provides better representation in the sense of semantic structure.
Polynomial filtering in latent semantic indexing for information retrieval BIBAFull-Text 104-111
  E. Kokiopoulou; Y. Saad
Latent Semantic Indexing (LSI) is a well established and effective framework for conceptual information retrieval. In traditional implementations of LSI the semantic structure of the collection is projected into the k-dimensional space derived from a rank-k approximation of the original term-by-document matrix. This paper discusses a new way to implement the LSI methodology, based on polynomial filtering. The new framework does not rely on any matrix decomposition and therefore its computational cost and storage requirements are low relative to traditional implementations of LSI. Additionally, it can be used as an effective information filtering technique when updating LSI models based on user feedback.
On scaling latent semantic indexing for large peer-to-peer systems BIBAFull-Text 112-121
  Chunqiang Tang; Sandhya Dwarkadas; Zhichen Xu
The exponential growth of data demands scalable infrastructures capable of indexing and searching rich content such as text, music, and images. A promising direction is to combine information retrieval with peer-to-peer technology for scalability, fault-tolerance, and low administration cost. One pioneering work along this direction is pSearch [32, 33]. pSearch places documents onto a peer-to-peer overlay network according to semantic vectors produced using Latent Semantic Indexing (LSI). The search cost for a query is reduced since documents related to the query are likely to be co-located on a small number of nodes. Unfortunately, because of its reliance on LSI, pSearch also inherits the limitations of LSI. (1) When the corpus is large and heterogeneous, LSI's retrieval quality is inferior to methods such as Okapi. (2) The Singular Value Decomposition (SVD) used in LSI is unscalable in terms of both memory consumption and computation time.
   This paper addresses the above limitations of LSI and makes the following contributions. (1) To reduce the cost of SVD, we reduce the size of its input matrix through document clustering and term selection. Our method retains the retrieval quality of LSI but is several orders of magnitude more efficient. (2) Through extensive experimentation, we found that proper normalization of semantic vectors for terms and documents improves recall by 76%. (3) To further improve retrieval quality, we use low-dimensional subvectors of semantic vectors to cluster documents in the overlay and then use Okapi to guide the search and document selection.

Formal models-2

GaP: a factor model for discrete data BIBAFull-Text 122-129
  John Canny
We present a probabilistic model for a document corpus that combines many of the desirable features of previous models. The model is called "GaP" for Gamma-Poisson, the distributions of the first and last random variable. GaP is a factor model, that is it gives an approximate factorization of the document-term matrix into a product of matrices A and X. These factors have strictly non-negative terms. GaP is a generative probabilistic model that assigns finite probabilities to documents in a corpus. It can be computed with an efficient and simple EM recurrence. For a suitable choice of parameters, the GaP factorization maximizes independence between the factors. So it can be used as an independent-component algorithm adapted to document data. The form of the GaP model is empirically as well as analytically motivated. It gives very accurate results as a probabilistic model (measured via perplexity) and as a retrieval model. The GaP model projects documents and terms into a low-dimensional space of "themes," and models texts as "passages" of terms on the same theme.
Belief revision for adaptive information retrieval BIBAFull-Text 130-137
  Raymond Y. K. Lau; Peter D. Bruza; Dawei Song
Applying Belief Revision logic to model adaptive information retrieval is appealing since it provides a rigorous theoretical foundation to model partiality and uncertainty inherent in any information retrieval (IR) processes. In particular, a retrieval context can be formalised as a belief set and the formalised context is used to disambiguate vague user queries. Belief revision logic also provides a robust computational mechanism to revise an IR system's beliefs about the users' changing information needs. In addition, information flow is proposed as a text mining method to automatically acquire the initial IR contexts. The advantage of a belief-based IRsystem is that its IR behaviour is more predictable and explanatory. However, computational efficiency is often a concern when the belief revision formalisms are applied to large real-life applications. This paper describes our belief-based adaptive IR system which is underpinned by an efficient belief revision mechanism. Our initial experiments show that the belief-based symbolic IR model is more effective than a classical quantitative IR model. To our best knowledge, this is the first successful empirical evaluation of a logic-based IR model based on large IR benchmark collections.
Tuning before feedback: combining ranking discovery and blind feedback for robust retrieval BIBAFull-Text 138-145
  Weiguo Fan; Ming Luo; Li Wang; Wensi Xi; Edward A. Fox
Both ranking functions and user queries are very important factors affecting a search engine's performance. Prior research has looked at how to improve ad-hoc retrieval performance for existing queries while tuning the ranking function, or modify and expand user queries using a fixed ranking scheme using blind feedback. However, almost no research has looked at how to combine ranking function tuning and blind feedback together to improve ad-hoc retrieval performance. In this paper, we look at the performance improvement for ad-hoc retrieval from a more integrated point of view by combining the merits of both techniques. In particular, we argue that the ranking function should be tuned first, using user-provided queries, before applying the blind feedback technique. The intuition is that highly-tuned ranking offers more high quality documents at the top of the hit list, thus offers a stronger baseline for blind feedback. We verify this integrated model in a large scale heterogeneous collection and the experimental results show that combining ranking function tuning and blind feedback can improve search performance by almost 30% over the baseline Okapi system.

Cross-language information retrieval

Translating unknown queries with web corpora for cross-language information retrieval BIBAFull-Text 146-153
  Pu-Jen Cheng; Jei-Wen Teng; Ruei-Cheng Chen; Jenq-Haur Wang; Wen-Hsiang Lu; Lee-Feng Chien
It is crucial for cross-language information retrieval (CLIR) systems to deal with the translation of unknown queries due to that real queries might be short. The purpose of this paper is to investigate the feasibility of exploiting the Web as the corpus source to translate unknown queries for CLIR. We propose an online translation approach to determine effective translations for unknown query terms via mining of bilingual search-result pages obtained from Web search engines. This approach can alleviate the problem of the lack of large bilingual corpora, translate many unknown query terms, provide flexible query specifications, and extract semantically-close translations to benefit CLIR tasks -- especially for cross-language Web search.
Resource selection for domain-specific cross-lingual IR BIBAFull-Text 154-161
  Monica Rogati; Yiming Yang
An under-explored question in cross-language information retrieval (CLIR) is to what degree the performance of CLIR methods depends on the availability of high-quality translation resources for particular domains. To address this issue, we evaluate several competitive CLIR methods -- with different training corpora -- on test documents in the medical domain. Our results show severe performance degradation when using a general-purpose training corpus or a commercial machine translation system (SYSTRAN), versus a domain-specific training corpus. A related unexplored question is whether we can improve CLIR performance by systematically analyzing training resources and optimally matching them to target collections. We start exploring this problem by suggesting a simple criterion for automatically matching training resources to target corpora. By using cosine similarity between training and target corpora as resource weights we obtained an average of 5.6% improvement over using all resources with no weights. The same metric yields 99.4% of the performance obtained when an oracle chooses the optimal resource every time.
Using the web for automated translation extraction in cross-language information retrieval BIBAFull-Text 162-169
  Ying Zhang; Phil Vines
There have been significant advances in Cross-Language Information Retrieval (CLIR) in recent years. One of the major remaining reasons that CLIR does not perform as well as monolingual retrieval is the presence of out of vocabulary (OOV) terms. Previous work has either relied on manual intervention or has only been partially successful in solving this problem. We use a method that extends earlier work in this area by augmenting this with statistical analysis, and corpus-based translation disambiguation to dynamically discover translations of OOV terms. The method can be applied to both Chinese-English and English-Chinese CLIR, correctly extracting translations of OOV terms from the Web automatically, and thus is a significant improvement on earlier work.

Language models

Dependence language model for information retrieval BIBAFull-Text 170-177
  Jianfeng Gao; Jian-Yun Nie; Guangyuan Wu; Guihong Cao
This paper presents a new dependence language modeling approach to information retrieval. The approach extends the basic language modeling approach based on unigram by relaxing the independence assumption. We integrate the linkage of a query as a hidden variable, which expresses the term dependencies within the query as an acyclic, planar, undirected graph. We then assume that a query is generated from a document in two stages: the linkage is generated first, and then each term is generated in turn depending on other related terms according to the linkage. We also present a smoothing method for model parameter estimation and an approach to learning the linkage of a sentence in an unsupervised manner. The new approach is compared to the classical probabilistic retrieval model and the previously proposed language models with and without taking into account term dependencies. Results show that our model achieves substantial and significant improvements on TREC collections.
Parsimonious language models for information retrieval BIBAFull-Text 178-185
  Djoerd Hiemstra; Stephen Robertson; Hugo Zaragoza
We systematically investigate a new approach to estimating the parameters of language models for information retrieval, called parsimonious language models. Parsimonious language models explicitly address the relation between levels of language models that are typically used for smoothing. As such, they need fewer (non-zero) parameters to describe the data. We apply parsimonious models at three stages of the retrieval process: 1) at indexing time; 2) at search time; 3) at feedback time. Experimental results show that we are able to build models that are significantly smaller than standard models, but that still perform at least as well as the standard approaches.
Cluster-based retrieval using language models BIBAFull-Text 186-193
  Xiaoyong Liu; W. Bruce Croft
Previous research on cluster-based retrieval has been inconclusive as to whether it does bring improved retrieval effectiveness over document-based retrieval. Recent developments in the language modeling approach to IR have motivated us to re-examine this problem within this new retrieval framework. We propose two new models for cluster-based retrieval and evaluate them on several TREC collections. We show that cluster-based retrieval can perform consistently across collections of realistic size, and significant improvements over document-based retrieval can be obtained in a fully automatic manner and without relevance information provided by human.
Corpus structure, language models, and ad hoc information retrieval BIBAFull-Text 194-201
  Oren Kurland; Lillian Lee
Most previous work on the recently developed language-modeling approach to information retrieval focuses on document-specific characteristics, and therefore does not take into account the structure of the surrounding corpus. We propose a novel algorithmic framework in which information provided by document-based language models is enhanced by the incorporation of information drawn from clusters of similar documents. Using this framework, we develop a suite of new algorithms. Even the simplest typically outperforms the standard language-modeling approach in precision and recall, and our new interpolation algorithm posts statistically significant improvements for both metrics over all three corpora tested.


Document clustering by concept factorization BIBAFull-Text 202-209
  Wei Xu; Yihong Gong
In this paper, we propose a new data clustering method called concept factorization that models each concept as a linear combination of the data points, and each data point as a linear combination of the concepts. With this model, the data clustering task is accomplished by computing the two sets of linear coefficients, and this linear coefficients computation is carried out by finding the non-negative solution that minimizes the reconstruction error of the data points. The cluster label of each data point can be easily derived from the obtained linear coefficients. This method differs from the method of clustering based on non-negative matrix factorization (NMF) \citeXu03 in that it can be applied to data containing negative values and the method can be implemented in the kernel space. Our experimental results show that the proposed data clustering method and its variations performs best among 11 algorithms and their variations that we have evaluated on both TDT2 and Reuters-21578 corpus. In addition to its good performance, the new method also has the merit in its easy and reliable derivation of the clustering results.
Learning to cluster web search results BIBAFull-Text 210-217
  Hua-Jun Zeng; Qi-Cai He; Zheng Chen; Wei-Ying Ma; Jinwen Ma
Organizing Web search results into clusters facilitates users' quick browsing through search results. Traditional clustering techniques are inadequate since they don't generate clusters with highly readable names. In this paper, we reformalize the clustering problem as a salient phrase ranking problem. Given a query and the ranked list of documents (typically a list of titles and snippets) returned by a certain Web search engine, our method first extracts and ranks salient phrases as candidate cluster names, based on a regression model learned from human labeled training data. The documents are assigned to relevant salient phrases to form candidate clusters, and the final clusters are generated by merging these candidate clusters. Experimental results verify our method's feasibility and effectiveness.
Document clustering via adaptive subspace iteration BIBAFull-Text 218-225
  Tao Li; Sheng Ma; Mitsunori Ogihara
Document clustering has long been an important problem in information retrieval. In this paper, we present a new clustering algorithm ASI, which uses explicitly modeling of the subspace structure associated with each cluster. ASI simultaneously performs data reduction and subspace identification via an iterative alternating optimization procedure. Motivated from the optimization procedure, we then provide a novel method to determine the number of clusters. We also discuss the connections of ASI with various existential clustering approaches. Finally, extensive experimental results on real data sets show the effectiveness of ASI algorithm.
Restrictive clustering and metaclustering for self-organizing document collections BIBAFull-Text 226-233
  Stefan Siersdorfer; Sergej Sizov
This paper addresses the problem of automatically structuring heterogenous document collections by using clustering methods. In contrast to traditional clustering, we study restrictive methods and ensemble-based meta methods that may decide to leave out some documents rather than assigning them to inappropriate clusters with low confidence. These techniques result in higher cluster purity, better overall accuracy, and make unsupervised self-organization more robust. Our comprehensive experimental studies on three different real-world data collections demonstrate these benefits. The proposed methods seem particularly suitable for automatically substructuring personal email folders or personal Web directories that are populated by focused crawlers, and they can be combined with supervised classification techniques.

Text classification

Feature selection using linear classifier weights: interaction with classification models BIBAFull-Text 234-241
  Dunja Mladenic; Janez Brank; Marko Grobelnik; Natasa Milic-Frayling
This paper explores feature scoring and selection based on weights from linear classification models. It investigates how these methods combine with various learning models. Our comparative analysis includes three learning algorithms: Naive Bayes, Perceptron, and Support Vector Machines (SVM) in combination with three feature weighting methods: Odds Ratio, Information Gain, and weights from linear models, the linear SVM and Perceptron. Experiments show that feature selection using weights from linear SVMs yields better classification performance than other feature weighting methods when combined with the three explored learning algorithms. The results support the conjecture that it is the sophistication of the feature weighting method rather than its apparent compatibility with the learning algorithm that improves classification performance.
Web-page classification through summarization BIBAFull-Text 242-249
  Dou Shen; Zheng Chen; Qiang Yang; Hua-Jun Zeng; Benyu Zhang; Yuchang Lu; Wei-Ying Ma
Web-page classification is much more difficult than pure-text classification due to a large variety of noisy information embedded in Web pages. In this paper, we propose a new Web-page classification algorithm based on Web summarization for improving the accuracy. We first give empirical evidence that ideal Web-page summaries generated by human editors can indeed improve the performance of Web-page classification algorithms. We then propose a new Web summarization-based classification algorithm and evaluate it along with several other state-of-the-art text summarization algorithms on the LookSmart Web directory. Experimental results show that our proposed summarization-based classification algorithm achieves an approximately 8.8% improvement as compared to pure-text-based classification algorithm. We further introduce an ensemble classifier using the improved summarization algorithm and show that it achieves about 12.9% improvement over pure-text based methods.
Parameterized generation of labeled datasets for text categorization based on a hierarchical directory BIBAFull-Text 250-257
  Dmitry Davidov; Evgeniy Gabrilovich; Shaul Markovitch
Although text categorization is a burgeoning area of IR research, readily available test collections in this field are surprisingly scarce. We describe a methodology and system (named ACCIO) for automatically acquiring labeled datasets for text categorization from the World Wide Web, by capitalizing on the body of knowledge encoded in the structure of existing hierarchical directories such as the Open Directory. We define parameters of categories that make it possible to acquire numerous datasets with desired properties, which in turn allow better control over categorization experiments. In particular, we develop metrics that estimate the difficulty of a dataset by examining the host directory structure. These metrics are shown to be good predictors of categorization accuracy that can be achieved on a dataset, and serve as efficient heuristics for generating datasets subject to user's requirements. A large collection of automatically generated datasets are made available for other researchers to use.


Information retrieval using word senses: root sense tagging approach BIBAFull-Text 258-265
  Sang-Bum Kim; Hee-Cheol Seo; Hae-Chang Rim
Information retrieval using word senses is emerging as a good research challenge on semantic information retrieval. In this paper, we propose a new method using word senses in information retrieval: root sense tagging method. This method assigns coarse-grained word senses defined in WordNet to query terms and document terms by unsupervised way using co-occurrence information constructed automatically. Our sense tagger is crude, but performs consistent disambiguation by considering only the single most informative word as evidence to disambiguate the target word. We also allow multiple-sense assignment to alleviate the problem caused by incorrect disambiguation.
   Experimental results on a large-scale TREC collection show that our approach to improve retrieval effectiveness is successful, while most of the previous work failed to improve performances even on small text collection. Our method also shows promising results when is combined with pseudo relevance feedback and state-of-the-art retrieval function such as BM25.
An effective approach to document retrieval via utilizing WordNet and recognizing phrases BIBAFull-Text 266-272
  Shuang Liu; Fang Liu; Clement Yu; Weiyi Meng
Noun phrases in queries are identified and classified into four types: proper names, dictionary phrases, simple phrases and complex phrases. A document has a phrase if all content words in the phrase are within a window of a certain size. The window sizes for different types of phrases are different and are determined using a decision tree. Phrases are more important than individual terms. Consequently, documents in response to a query are ranked with matching phrases given a higher priority. We utilize WordNet to disambiguate word senses of query terms. Whenever the sense of a query term is determined, its synonyms, hyponyms, words from its definition and its compound words are considered for possible additions to the query. Experimental results show that our approach yields between 23% and 31% improvements over the best-known results on the TREC 9, 10 and 12 collections for short (title only) queries, without using Web data.
Web-a-where: geotagging web content BIBAFull-Text 273-280
  Einat Amitay; Nadav Har'El; Ron Sivan; Aya Soffer
We describe Web-a-Where, a system for associating geography with Web pages. Web-a-Where locates mentions of places and determines the place each name refers to. In addition, it assigns to each page a geographic focus -- a locality that the page discusses as a whole. The tagging process is simple and fast, aimed to be applied to large collections of Web pages and to facilitate a variety of location-based applications and data analyses.
   Geotagging involves arbitrating two types of ambiguities: geo/non-geo and geo/geo. A geo/non-geo ambiguity occurs when a place name also has a non-geographic meaning, such as a person name (e.g., Berlin) or a common word (Turkey). Geo/geo ambiguity arises when distinct places have the same name, as in London, England vs. London, Ontario.
   An implementation of the tagger within the framework of the WebFountain data mining system is described, and evaluated on several corpora of real Web pages. Precision of up to 82% on individual geotags is achieved. We also evaluate the relative contribution of various heuristics the tagger employs, and evaluate the focus-finding algorithm using a corpus pretagged with localities, showing that as many as 91% of the foci reported are correct up to the country level.

Recognising and using named entities

Focused named entity recognition using machine learning BIBAFull-Text 281-288
  Li Zhang; Yue Pan; Tong Zhang
In this paper we study the problem of finding most topical named entities among all entities in a document, which we refer to as focused named entity recognition. We show that these focused named entities are useful for many natural language processing applications, such as document summarization, search result ranking, and entity detection and tracking. We propose a statistical model for focused named entity recognition by converting it into a classification problem. We then study the impact of various linguistic features and compare a number of classification algorithms. From experiments on an annotated Chinese news corpus, we demonstrate that the proposed method can achieve near human-level accuracy.
Learning phonetic similarity for matching named entity translations and mining new translations BIBAFull-Text 289-296
  Wai Lam; Ruizhang Huang; Pik-Shan Cheung
We propose a novel named entity matching model which considers both semantic and phonetic clues. The matching is formulated as an optimization problem. One major component is a phonetic matching model which exploits similarity at the phoneme level. We investigate three learning algorithms for obtaining the similarity information of basic phoneme units based on training examples. By applying this proposed named entity matching model, we also develop a mining framework for discovering new, unseen named entity translations from online daily Web news. This framework harvests comparable news in different languages using an existing bilingual dictionary. It is able to discover new name translations not found in the dictionary.
Text classification and named entities for new event detection BIBAFull-Text 297-304
  Giridhar Kumaran; James Allan
New Event Detection is a challenging task that still offers scope for great improvement after years of effort. In this paper we show how performance on New Event Detection (NED) can be improved by the use of text classification techniques as well as by using named entities in a new way. We explore modifications to the document representation in a vector space-based NED system. We also show that addressing named entities preferentially is useful only in certain situations. A combination of all the above results in a multi-stage NED system that performs much better than baseline single-stage NED systems.

Efficiency and scaling

Assigning identifiers to documents to enhance the clustering property of fulltext indexes BIBAFull-Text 305-312
  Fabrizio Silvestri; Salvatore Orlando; Raffaele Perego
Web Search Engines provide a large-scale text document retrieval service by processing huge Inverted File indexes. Inverted File indexes allow fast query resolution and good memory utilization since their d-gaps representation can be effectively and efficiently compressed by using variable length encoding methods. This paper proposes and evaluates some algorithms aimed to find an assignment of the document identifiers which minimizes the average values of d-gaps, thus enhancing the effectiveness of traditional compression methods. We ran several tests over the Google contest collection in order to validate the techniques proposed. The experiments demonstrated the scalability and effectiveness of our algorithms. Using the proposed algorithms, we were able to sensibly improve (up to 20.81%) the compression ratios of several encoding schemes.
Filtering algorithms for information retrieval models with named attributes and proximity operators BIBAFull-Text 313-320
  Christos Tryfonopoulos; Manolis Koubarakis; Yannis Drougas
In the selective dissemination of information (or publish/subscribe) paradigm, clients subscribe to a server with continuous queries (or profiles) that express their information needs. Clients can also publish documents to servers. Whenever a document is published, the continuous queries satisfying this document are found and notifications are sent to appropriate clients. This paper deals with the filtering problem that needs to be solved efficently by each server: Given a database of continuous queries db and a document d, find all queries q {epsilon} db that match d. We present data structures and indexing algorithms that enable us to solve the filtering problem efficiently for large databases of queries expressed in the model AWP which is based on named attributes with values of type text, and word proximity operators.
Hourly analysis of a very large topically categorized web query log BIBAFull-Text 321-328
  Steven M. Beitzel; Eric C. Jensen; Abdur Chowdhury; David Grossman; Ophir Frieder
We review a query log of hundreds of millions of queries that constitute the total query traffic for an entire week of a general-purpose commercial web search service. Previously, query logs have been studied from a single, cumulative view. In contrast, our analysis shows changes in popularity and uniqueness of topically categorized queries across the hours of the day. We examine query traffic on an hourly basis by matching it against lists of queries that have been topically pre-categorized by human editors. This represents 13% of the query traffic. We show that query traffic from particular topical categories differs both from the query stream as a whole and from other categories. This analysis provides valuable insight for improving retrieval effectiveness and efficiency. It is also relevant to the development of enhanced query disambiguation, routing, and caching algorithms.

Content-based filtering & collaborative filtering

A collaborative filtering algorithm and evaluation metric that accurately model the user experience BIBAFull-Text 329-336
  Matthew R. McLaughlin; Jonathan L. Herlocker
Collaborative Filtering (CF) systems have been researched for over a decade as a tool to deal with information overload. At the heart of these systems are the algorithms which generate the predictions and recommendations.
   In this article we empirically demonstrate that two of the most acclaimed CF recommendation algorithms have flaws that result in a dramatically unacceptable user experience.
   In response, we introduce a new Belief Distribution Algorithm that overcomes these flaws and provides substantially richer user modeling. The Belief Distribution Algorithm retains the qualities of nearest-neighbor algorithms which have performed well in the past, yet produces predictions of belief distributions across rating values rather than a point rating value.
   In addition, we illustrate how the exclusive use of the mean absolute error metric has concealed these flaws for so long, and we propose the use of a modified Precision metric for more accurately evaluating the user experience.
An automatic weighting scheme for collaborative filtering BIBAFull-Text 337-344
  Rong Jin; Joyce Y. Chai; Luo Si
Collaborative filtering identifies information interest of a particular user based on the information provided by other similar users. The memory-based approaches for collaborative filtering (e.g., Pearson correlation coefficient approach) identify the similarity between two users by comparing their ratings on a set of items. In these approaches, different items are weighted either equally or by some predefined functions. The impact of rating discrepancies among different users has not been taken into consideration. For example, an item that is highly favored by most users should have a smaller impact on the user-similarity than an item for which different types of users tend to give different ratings. Even though simple weighting methods such as variance weighting try to address this problem, empirical studies have shown that they are ineffective in improving the performance of collaborative filtering. In this paper, we present an optimization algorithm to automatically compute the weights for different items based on their ratings from training users. More specifically, the new weighting scheme will create a clustered distribution for user vectors in the item space by bringing users of similar interests closer and separating users of different interests more distant. Empirical studies over two datasets have shown that our new weighting scheme substantially improves the performance of the Pearson correlation coefficient method for collaborative filtering.
Using bayesian priors to combine classifiers for adaptive filtering BIBAFull-Text 345-352
  Yi Zhang
An adaptive information filtering system monitors a document stream to identify the documents that match information needs specified by user profiles. As the system filters, it also refines its knowledge about the user's information needs based on long-term observations of the document stream and periodic feedback (training data) from the user. Low variance profile learning algorithms, such as Rocchio, work well at the early stage of filtering when the system has very few training data. Low bias profile learning algorithms, such as Logistic Regression, work well at the later stage of filtering when the system has accumulated enough training data.
   However, an empirical system needs to works well consistently at all stages of filtering process. This paper addresses this problem by proposing a new technique to combine different text classification algorithms via a constrained maximum likelihood Bayesian prior. This technique provides a trade off between bias and variance, and the combined classifier may achieve a consistent good performance at different stages of filtering. We implemented the proposed technique to combine two complementary classification algorithms: Rocchio and logistic regression. The new algorithm is shown to compare favorably with Rocchio, Logistic Regression, and the best methods in the TREC-9 and TREC-11 adaptive filtering tracks.
A nonparametric hierarchical bayesian framework for information filtering BIBAFull-Text 353-360
  Kai Yu; Volker Tresp; Shipeng Yu
Information filtering has made considerable progress in recent years. The predominant approaches are content-based methods and collaborative methods. Researchers have largely concentrated on either of the two approaches since a principled unifying framework is still lacking. This paper suggests that both approaches can be combined under a hierarchical Bayesian framework. Individual content-based user profiles are generated and collaboration between various user models is achieved via a common learned prior distribution. However, it turns out that a parametric distribution (e.g. Gaussian) is too restrictive to describe such a common learned prior distribution. We thus introduce a nonparametric common prior, which is a sample generated from a Dirichlet process which assumes the role of a hyper prior. We describe effective means to learn this nonparametric distribution, and apply it to learn users' information needs. The resultant algorithm is simple and understandable, and offers a principled solution to combine content-based filtering and collaborative filtering. Within our framework, we are now able to interpret various existing techniques from a unifying point of view. Finally we demonstrate the empirical success of the proposed information filtering methods.

Image retrieval, users and usability

Automatic image annotation by using concept-sensitive salient objects for image content representation BIBAFull-Text 361-368
  Jianping Fan; Yuli Gao; Hangzai Luo; Guangyou Xu
Multi-level annotation of images is a promising solution to enable more effective semantic image retrieval by using various keywords at different semantic levels. In this paper, we propose a multi-level approach to annotate the semantics of natural scenes by using both the dominant image components and the relevant semantic concepts. In contrast to the well-known image-based and region-based approaches, we use the salient objects as the dominant image components to achieve automatic image annotation at the content level. By using the salient objects for image content representation, a novel image classification technique is developed to achieve automatic image annotation at the concept level. To detect the salient objects automatically, a set of detection functions are learned from the labeled image regions by using Support Vector Machine (SVM) classifiers with an automatic scheme for searching the optimal model parameters. To generate the semantic concepts, finite mixture models are used to approximate the class distributions of the relevant salient objects. An adaptive EM algorithm has been proposed to determine the optimal model structure and model parameters simultaneously. We have also demonstrated that our algorithms are very effective to enable multi-level annotation of natural scenes in a large-scale dataset.
A search engine for historical manuscript images BIBAFull-Text 369-376
  Toni M. Rath; R. Manmatha; Victor Lavrenko
Many museum and library archives are digitizing their large collections of handwritten historical manuscripts to enable public access to them. These collections are only available in image formats and require expensive manual annotation work for access to them. Current handwriting recognizers have word error rates in excess of 50% and therefore cannot be used for such material. We describe two statistical models for retrieval in large collections of handwritten manuscripts given a text query. Both use a set of transcribed page images to learn a joint probability distribution between features computed from word images and their transcriptions. The models can then be used to retrieve unlabeled images of handwritten documents given a text query. We show experiments with a training set of 100 transcribed pages and a test set of 987 handwritten page images from the George Washington collection. Experiments show that the precision at 20 documents is about 0.4 to 0.5 depending on the model. To the best of our knowledge, this is the first automatic retrieval system for historical manuscripts using text queries, without manual transcription of the original corpus.
Display time as implicit feedback: understanding task effects BIBAFull-Text 377-384
  Diane Kelly; Nicholas J. Belkin
Recent research has had some success using the length of time a user displays a document in their web browser as implicit feedback for document preference. However, most studies have been confined to specific search domains, such as news, and have not considered the effects of task on display time, and the potential impact of this relationship on the effectiveness of display time as implicit feedback. We describe the results of an intensive naturalistic study of the online information-seeking behaviors of seven subjects during a fourteen-week period. Throughout the study, subjects' online information-seeking activities were monitored with various pieces of logging and evaluation software. Subjects were asked to identify the tasks with which they were working, classify the documents that they viewed according to these tasks, and evaluate the usefulness of the documents. Results of a user-centered analysis demonstrate no general, direct relationship between display time and usefulness, and that display times differ significantly according to specific task, and according to specific user.
Human versus machine in the topic distillation task BIBAFull-Text 385-392
  Mingfang Wu; Gheorghe Muresan; Alistair McLean; Muh-Chyun (Morris) Tang; Ross Wilkinson; Yuelin Li; Hyuk-Jin Lee; Nichloas J. Belkin
This paper reports on and discusses a set of user experiments using the TREC 2003 Web interactive track protocol. The focus is on comparing humans and machine algorithms in terms of performance in a topic distillation task. We also investigated the effect of the search results layout in supporting the users' effort.
   We have demonstrated that machines can perform nearly as well as people on the topic distillation task. Given a system tailored to the task there is significant performance improvement and finally, given a presentation that supports the task, there is strong user satisfaction.


Chemoinformatics: an application domain for information retrieval techniques BIBAFull-Text 393
  Peter Willett
Chemoinformatics is the generic name for the techniques used to represent, store and process information about the two-dimensional (2D) and three-dimensional (3D) structures of chemical molecules [1, 2]. Chemoinformatics has attracted much recent prominence as a result of developments in the methods that are used to synthesize new molecules and then to test them for biological activity. These developments have resulted in a massive increase in the amounts of structural and biological information that is available to support discovery programmes in the pharmaceutical and agrochemical industries.
   Chemoinformatics may appear to be far removed from information retrieval (IR), and there are indeed many significant differences, most notably in the use of graph representations to encode chemical molecules, rather than the strings that are used to encode text; however, there are also many similarities between the two fields, and this paper will exemplify some of these relationships. The most obvious area of similarity is in the principal types of database search that are carried out, with both application domains making extensive use of exact match, partial match and best match searching procedures: in the IR context these are known-item searching, Boolean searching and ranked-output searching; in the chemical context, these are structure searching, substructure searching and similarity searching. In IR, there is a natural distinction between an initial ranked-output search and one in which relevance feedback can be employed, where the keywords in the query statement are assigned weights based on their differential occurrences in known-relevant and known-nonrelevant documents. In the chemoinformatics technique called substructural analysis, substructural fragments are assigned weights based on their occurrence in molecules that do possess, and molecules that do not possess, some desired biological activity [3]. The analogy between relevance and biological activity has also resulted in the development of measures to quantify the effectiveness of chemical searching procedures that are based on the standard IR concepts of recall and precision [4]. Analogies such as these have provided the basis for some of the chemoinformatics research carried out in Sheffield. The starting point was the recognition that techniques applicable to documents represented by keywords might also be applicable to molecules represented by substructural fragments. This led directly to the introduction of similarity searching, something that is now a standard tool in chemoinformatics software systems; in particular, its use for virtual screening, i.e., the ranking of a database in order of decreasing probability of activity so as to maximize the cost-effectiveness of biological testing [5]. Measures of inter-molecular structural similarity also lie at the heart of systems for clustering chemical databases: just as IR has the Cluster Hypothesis (similar documents tend to be relevant to the same requests) as a basis for document clustering, so the Similar Property Principle (similar molecules tend to have similar properties) has led to clustering becoming a well-established tool for the organization of large chemical databases [6]. More recently, we have applied another IR technique, the use of data fusion to combine different rankings of a database, to chemoinformatics and again found that it is equally applicable in this new domain [7]. The many similarities between IR and chemoinformatics that have already been identified suggest that chemoinformatics is a domain of which IR researchers should be aware when considering the applicability of new techniques that they have developed.

Machine learning for IR

Learning effective ranking functions for newsgroup search BIBAFull-Text 394-401
  Wensi Xi; Jesper Lind; Eric Brill
Web communities are web virtual broadcasting spaces where people can freely discuss anything. While such communities function as discussion boards, they have even greater value as large repositories of archived information. In order to unlock the value of this resource, we need an effective means for searching archived discussion threads. Unfortunately the techniques that have proven successful for searching document collections and the Web are not ideally suited to the task of searching archived community discussions. In this paper, we explore the problem of creating an effective ranking function to predict the most relevant messages to queries in community search. We extract a set of predictive features from the thread trees of newsgroup messages as well as features of message authors and lexical distribution within a message thread. Our final results indicate that when using linear regression with this feature set, our search system achieved a 28.5% performance improvement compared to our baseline system.
Language-specific models in multilingual topic tracking BIBAFull-Text 402-409
  Leah S. Larkey; Fangfang Feng; Margaret Connell; Victor Lavrenko
Topic tracking is complicated when the stories in the stream occur in multiple languages. Typically, researchers have trained only English topic models because the training stories have been provided in English. In tracking, non-English test stories are then machine translated into English to compare them with the topic models. We propose a native language hypothesis stating that comparisons would be more effective in the original language of the story. We first test and support the hypothesis for story link detection. For topic tracking the hypothesis implies that it should be preferable to build separate language-specific topic models for each language in the stream. We compare different methods of incrementally building such native language topic models.
Web taxonomy integration through co-bootstrapping BIBAFull-Text 410-417
  Dell Zhang; Wee Sun Lee
We address the problem of integrating objects from a source taxonomy into a master taxonomy. This problem is not only currently pervasive on the web, but also important to the emerging semantic web. A straightforward approach to automating this process would be to learn a classifier that can classify objects from the source taxonomy into categories of the master taxonomy. The key insight is that the availability of the source taxonomy data could be helpful to build better classifiers for the master taxonomy if their categorizations have some semantic overlap. In this paper, we propose a new approach, co-bootstrapping, to enhance the classification by exploiting such implicit knowledge. Our experiments with real-world web data show substantial improvements in the performance of taxonomy integration.

Natural language processing

Evaluation of an extraction-based approach to answering definitional questions BIBAFull-Text 418-424
  Jinxi Xu; Ralph Weischedel; Ana Licuanan
This paper evaluates an extraction-based approach to answering definitional questions. Our system extracted useful linguistic constructs called linguistic features from raw text using information extraction tools and formulated answers based on such features. The features employed include appositives, copulas, structured patterns, relations, propositions and raw sentences. The features were ranked based on feature type and similarity to a question profile. Redundant features were detected using a simple heuristic-based strategy. The approach achieved state of the art performance at the TREC 2003 QA evaluation. Component analysis of the system was carried out using an automatic scoring function called Rouge (Lin and Hovy, 2003). Major findings include 1) answers using linguistic features are significantly better than those using raw sentences; 2) the most useful features are appositives and copulas; 3) question profiles, as a means of modeling user interests, can significantly improve system performance; 4) the Rouge scores are closely correlated with subjective evaluation results, indicating the suitability of using Rouge for evaluating definitional QA systems.
Query based event extraction along a timeline BIBAFull-Text 425-432
  Hai Leong Chieu; Yoong Keok Lee
In this paper, we present a framework and a system that extracts events relevant to a query from a collection C of documents, and places such events along a timeline. Each event is represented by a sentence extracted from C, based on the assumption that "important" events are widely cited in many documents for a period of time within which these events are of interest. In our experiments, we used queries that are event types ("earthquake") and person names (e.g. "George Bush"). Evaluation was performed using G8 leader names as queries: comparison made by human evaluators between manually and system generated timelines showed that although manually generated timelines are on average more preferable, system generated timelines are sometimes judged to be better than manually constructed ones.
Sentence completion BIBAFull-Text 433-439
  Korinna Grabski; Tobias Scheffer
We discuss a retrieval model in which the task is to complete a sentence, given an initial fragment, and given an application specific document collection. This model is motivated by administrative and call center environments, in which users have to write documents with a certain repetitiveness. We formulate the problem setting and discuss appropriate performance metrics. We present an index-based retrieval algorithm and a cluster-based approach, and evaluate our algorithms using collections of emails that have been written by two distinct service centers.

Web structure

Block-level link analysis BIBAFull-Text 440-447
  Deng Cai; Xiaofei He; Ji-Rong Wen; Wei-Ying Ma
Link Analysis has shown great potential in improving the performance of web search. PageRank and HITS are two of the most popular algorithms. Most of the existing link analysis algorithms treat a web page as a single node in the web graph. However, in most cases, a web page contains multiple semantics and hence the web page might not be considered as the atomic node. In this paper, the web page is partitioned into blocks using the vision-based page segmentation algorithm. By extracting the page-to-block, block-to-page relationships from link structure and page layout analysis, we can construct a semantic graph over the WWW such that each node exactly represents a single semantic topic. This graph can better describe the semantic structure of the web. Based on block-level link analysis, we proposed two new algorithms, Block Level PageRank and Block Level HITS, whose performances we study extensively using web data.
Usefulness of hyperlink structure for query-biased topic distillation BIBAFull-Text 448-455
  Vassilis Plachouras; Iadh Ounis
In this paper, we introduce an information theoretic method for estimating the usefulness of the hyperlink structure induced from the set of retrieved documents. We evaluate the effectiveness of this method in the context of an optimal Bayesian decision mechanism, which selects the most appropriate retrieval approaches on a per-query basis for two TREC tasks. The estimation of the hyperlink structure's usefulness is stable when we use different weighting schemes, or when we employ sampling of documents to reduce the computational overhead. Next, we evaluate the effectiveness of the hyperlink structure's usefulness in a realistic setting, by setting the thresholds of a decision mechanism automatically. Our results show that improvements over the baselines are obtained.
Block-based web search BIBAFull-Text 456-463
  Deng Cai; Shipeng Yu; Ji-Rong Wen; Wei-Ying Ma
In this paper, we introduce an information theoretic method for estimating the usefulness of the hyperlink structure induced from the set of retrieved documents. We evaluate the effectiveness of this method in the context of an optimal Bayesian decision mechanism, which selects the most appropriate retrieval approaches on a per-query basis for two TREC tasks. The estimation of the hyperlink structure's usefulness is stable when we use different weighting schemes, or when we employ sampling of documents to reduce the computational overhead. Next, we evaluate the effectiveness of the hyperlink structure's usefulness in a realistic setting, by setting the thresholds of a decision mechanism automatically. Our results show that improvements over the baselines are obtained.


A hybrid statistical/linguistic model for generating news story gists BIBAFull-Text 464-465
  William P. Doran; Nicola Stokes; Eamonn Newman; John Dunnion; Joe Carthy
In this paper, we describe a News Story Gisting system that generates a 10-word short summary of a news story. This system uses a machine learning technique to combine linguistic, statistical and positional information in order to generate an appropriate summary. We also present the results of an automatic evaluation of this system with respect to the performance of other baseline summarisers using the new ROUGE evaluation metric.
Image based gisting in CLIR BIBAFull-Text 466-467
  Mark Sanderson; Robert Pasley
In this paper, we describe research which could lead to a novel approach to gathering an overview of a document in a foreign language. The research explores how much of the meaning of a document could be represented using images by researching the ability of subjects to derive the search term that might have been used to return a set of images from an image library. The Google image search engine was used to retrieve the images for this experiment, which uses English throughout. The results were analysed with respect to a previous paper [1] exploring ability to recognise concrete objects in hierarchies. It was found that there is a tendency to use one particular level of categorization.
Classifying racist texts using a support vector machine BIBAFull-Text 468-469
  Edel Greevy; Alan F. Smeaton
In this poster we present an overview of the techniques we used to develop and evaluate a text categorisation system to automatically classify racist texts. Detecting racism is difficult because the presence of indicator words is insufficient to indicate racist texts, unlike some other text classification tasks. Support Vector Machines (SVM) are used to automatically categorise web pages based on whether or not they are racist. Different interpretations of what constitutes a term are taken, and in this poster we look at three representations of a web page within an SVM -- bag-of-words, bigrams and part-of-speech tags.
Discovery of aggregate usage profiles based on clustering information needs BIBAFull-Text 470-471
  Azreen Azman; Iadh Ounis
We present an alternative technique for discovering aggregate usage profiles from Web access logs. The technique is based on clustering information needs inferred from users' browsing paths. Browsing paths are extracted from users' access logs. Information need is inferred from each browsing path by using the Ostensive Model[1]. The technique is evaluated in a document recommendation application. We compare the performance of our technique against the well-established transaction-based technique proposed in [2]. Based on an initial evaluation, the results are encouraging.
Merging retrieval results in hierarchical peer-to-peer networks BIBKFull-Text 472-473
  Jie Lu; Jamie Callan
Keywords: hierarchical, peer-to-peer, result merging, retrieval
The effect of back-formulating questions in question answering evaluation BIBKFull-Text 474-475
  Tetsuya Sakai; Yoshimi Saito; Yumi Ichimura; Tomoharu Kokubu; Makoto Koyama
Keywords: evaluation, question answering, test collection
Effect of varying number of documents in blind feedback: analysis of the 2003 NRRC RIA workshop "bf_numdocs" experiment suite BIBKFull-Text 476-477
  Jesse Montgomery; Luo Si; Jamie Callan; David A. Evans
Keywords: information retrieval, optimal number of documents for feedback, pseudo-relevance feedback, query expansion
Eye-tracking analysis of user behavior in WWW search BIBAFull-Text 478-479
  Laura A. Granka; Thorsten Joachims; Geri Gay
We investigate how users interact with the results page of a WWW search engine using eye-tracking. The goal is to gain insight into how users browse the presented abstracts and how they select links for further exploration. Such understanding is valuable for improved interface design, as well as for more accurate interpretations of implicit feedback (e.g. clickthrough) for machine learning. The following presents initial results, focusing on the amount of time spent viewing the presented abstracts, the total number of abstract viewed, as well as measures of how thoroughly searchers evaluate their results set.
Subwebs for specialized search BIBAFull-Text 480-481
  Raman Chandrasekar; Harr Chen; Simon Corston-Oliver; Eric Brill
We describe a method to define and use subwebs, user-defined neighborhoods of the Internet. Subwebs help improve search performance by inducing a topic-specific page relevance bias over a collection of documents. Subwebs may be automatically identified using a simple algorithm we describe, and used to provide highly-relevant topic-specific information retrieval. Using subwebs in a Help and Support topic, we see marked improvements in precision compared to generic search engine results.
Comparison of using passages and documents for blind relevance feedback in information retrieval BIBAFull-Text 482-483
  Zhenmei Gu; Ming Luo
This paper compares document blind feedback and passage blind feedback in Information Retrieval (IR), based on the work during the NRRC 2003 Reliable Information Access Summer workshop. The analysis of our experimental results shows overall consistency on the performance impact of using passages and documents for blind feedback. However, it is observed that the behavior of passage blind feedback, compared to document blind feedback, is both system dependent and topic dependent. The relationships between the performance impact of passage blind feedback and the number of feedback terms and the topic's average relevant document length, respectively, are examined to illustrate these dependencies.
Measuring pseudo relevance feedback & CLIR BIBAFull-Text 484-485
  Paul Clough; Mark Sanderson
In this poster, we report on the effects of pseudo relevance feedback (PRF) for a cross language image retrieval task using a test collection. Typically PRF has been shown to improve retrieval performance in previous CLIR experiments based on average precision at a fixed rank. However our experiments have shown that queries in which no relevant documents are returned also increases. Because query reformulation for cross language is likely to be harder than with monolingual searching, a great deal of user dissatisfaction would be associated with this scenario. We propose that an additional effectiveness measure based on failed queries may better reflect user satisfaction than average precision alone.
A two-stage mixture model for pseudo feedback BIBAFull-Text 486-487
  Tao Tao; ChengXiang Zhai
Pseudo feedback is a commonly used technique to improve information retrieval performance. It assumes a few top-ranked documents to be relevant, and learns from them to improve the retrieval accuracy. A serious problem is that the performance is often very sensitive to the number of pseudo feedback documents. In this poster, we address this problem in a language modeling framework. We propose a novel two-stage mixture model, which is less sensitive to the number of pseudo feedback documents than an effective existing feedback model. The new model can tolerate a more flexible setting of the number of pseudo feedback documents without the danger of losing much retrieval accuracy.
Natural language processing for browse help BIBAFull-Text 488-489
  Eric Crestan; Claude de Loupy
In this paper, we will present three "browsing" systems that should save user's time. The first uses named entities and gives a way to reduce search space. By using a information visualization system, the user can comprehend more easily the content of a corpus or a document. Named entities are highlighted for quick reading, temporal and geographic representation gives a global view of the result of a query. All these browse and search helps seem to be very useful. Nevertheless, an evaluation would give more practical results.
Triangulation without translation BIBAFull-Text 490-491
  James Mayfield; Paul McNamee
Transitive retrieval and triangulation have been proposed as ways to improve cross-language retrieval quality when translation resources have poor lexical coverage. We demonstrate that cross-language retrieval is viable for European languages with no translation resources at all; that transitive retrieval without translation does not suffer the drop-off in retrieval quality sometimes reported for transitive retrieval with translation; and that triangulation that combines multiple transitive runs with no translation can boost performance over direct translation-free retrieval.
A session-based search engine BIBAFull-Text 492-493
  Smitha Sriram; Xuehua Shen; Chengxiang Zhai
In this poster, we describe a novel session-based search engine, which puts the search in context. The search engine has a number of session-based features including expansion of the current query with user query history and clickthrough data (title and summary of clicked web pages) in the same search session and the session boundary recognition through temporal closeness and probabilistic similarity between query terms. In addition, the search engine visualizes the rank change of web pages as different queries are submitted in the same search session to help the user reformulate the query.
Evaluation of filtering current news search results BIBAFull-Text 494-495
  Steven M. Beitzel; Eric C. Jensen; Abdur Chowdhury; David Grossman; Ophir Frieder
We describe an evaluation of result set filtering techniques for providing ultra-high precision in the task of presenting related news for general web queries. In this task, the negative user experience generated by retrieving non-relevant documents has a much worse impact than not retrieving relevant ones. We adapt cost-based metrics from the document filtering domain to this result filtering problem in order to explicitly examine the tradeoff between missing relevant documents and retrieving non-relevant ones. A large manual evaluation of three simple threshold filters shows that the basic approach of counting matching title terms outperforms also incorporating selected abstract terms based on part-of-speech or higher-level linguistic structures. Simultaneously, leveraging these cost-based metrics allows us to explicitly determine what other tasks would benefit from these alternative techniques.
The document as an ergodic markov chain BIBAFull-Text 496-497
  Eduard Hoenkamp; Dawei Song
In recent years, statistical language models are being proposed as alternative to the vector space model. Viewing documents as language samples introduces the issue of defining a joint probability distribution over the terms.
   The present paper models a document as the result of a Markov process. It argues that this process is ergodic, which is theoretically plausible, and easy to verify in practice.
   The theoretical result is that the joint distribution can be easily obtained. This can also be applied for search resolutions other than the document level. We verified this in an experiment on query expansion demonstrating both the validity and the practicability of the method. This holds a promise for general language models.
Expertise community detection BIBAFull-Text 498-499
  Raymond D'Amore
Providing knowledge workers with access to experts and communities-of-practice is central to sharing expertise and crucial to organizational performance, adaptation, and even survival. This paper covers ongoing research to develop an Expert Locator prototype, a model-based system for detecting experts and broader communities-of-practice. The underlying expertise model is extensible and supports aggregation of evidence across diverse sources. The prototype is being used to locate critical expertise in key project areas, and current evaluation indicates its potential effectiveness.
Learning patterns to answer open domain questions on the web BIBAFull-Text 500-501
  Dmitri Roussinov; Jose Robles
While being successful in providing keyword based access to web pages, commercial search portals still lack the ability to answer questions expressed in a natural language. We present a probabilistic approach to automated question answering on the Web, based on trainable patterns, answer triangulation and semantic filtering. In contrast to the other "shallow" approaches, our approach is entirely self-learning. It does not require any manually created scoring and filtering rules while still performing comparably. It also performs better than other fully trainable approaches.
Email is a stage: discovering people roles from email archives BIBFull-Text 502-503
  Anton Leuski
Searching databases for semantically-related schemas BIBAFull-Text 504-505
  Gauri Shah; Tanveer Syeda-Mahmood
In this paper, we address the problem of searching schema databases for semantically-related schemas. We first give a method of finding semantic similarity between pair-wise schemas based on tokenization, part-of-speech tagging, word expansion, and ontology matching. We then address the problem of indexing the schema database through a semantic hash table. Matching schemas in the database are found by hashing the query attributes and recording peaks in the histogram of schema hits. Results indicated a 90% improvement in search performance while maintaining high precision and recall.
Topic prediction based on comparative retrieval rankings BIBAFull-Text 506-507
  Chris Buckley
A new measure, AnchorMap, is introduced to evaluate how close two document retrieval rankings are to each other. It is shown that AnchorMap scores, when run on a set of initial ranked document lists from 8 different systems, are very highly correlated with categorization of topics as easy or hard, and separately, are highly correlated with those topics on which blind feedback works. In another experiment, AnchorMap is used to compare the initial ranked document list from a single system against the ranked document list from that system after blind feedback. Again, high AnchorMap values are highly correlated with both topic difficulty and successful application of blind feedback. Both experiments are examples of using properties of a topic which are independent of relevance information to predict the actual performance of IR systems on the topic. Initial experiments to attempt to improve retrieval performance based upon AnchorMap failed; the causes for failure are discussed.
Context-based question-answering evaluation BIBAFull-Text 508-509
  Elizabeth D. Liddy; Anne R. Diekema; Ozgur Yilmazel
In this poster, we will present the results of efforts we have undertaken to conduct evaluations of a QA system in a real world environment and to understand the nature of the dimensions on which users evaluate QA systems when given full reign to comment on whatever dimensions they deem important.
Design of an e-book user interface and visualizations to support reading for comprehension BIBAFull-Text 510-511
  Yixing Sun; David J. Harper; Stuart N. K. Watt
Current e-Book browsers provide minimal support for comprehending the organization, narrative structure, and themes, of large complex books. In order to build an understanding of such books, readers should be provided with user interfaces that present, and relate, the organizational, narrative and thematic structures. We propose adapting information retrieval techniques for the purpose of discovering these structures, and sketch three distinctive visualizations for presenting these structures to the e-Book reader. These visualizations are presented within an initial design for an e-Book browser.
Toward better weighting of anchors BIBAFull-Text 512-513
  David Hawking; Trystan Upstill; Nick Craswell
Okapi BM25 scoring of anchor text surrogate documents has been shown to facilitate effective ranking in navigational search tasks over web data. We hypothesize that even better ranking can be achieved in certain important cases, particularly when anchor scores must be fused with content scores, by avoiding length normalisation and by reducing the attentuation of scores associated with high tf. Preliminary results are presented.
Aggregated feature retrieval for MPEG-7 via clustering BIBAFull-Text 514-515
  Jiamin Ye; Alan F. Smeaton
In this paper, we describe an approach to combining text and visual features from MPEG-7 descriptions of video. A video retrieval process is aligned to a text retrieval process based on the TF*IDF vector space model via clustering of low-level visual features. Our assumption is that shots within the same cluster are not only similar visually but also semantically, to a certain extent. Our experiments on the TRECVID2002 and TRECVID2003 collections show that adding extra meaning to a shot based on the shots from the same cluster is useful when each video in a collection contains a high proportion of similar shots, for example in documentaries.
Answer models for question answering passage retrieval BIBAFull-Text 516-517
  Andres Corrada-Emmanuel; W. Bruce Croft
Answer patterns have been shown to improve the performance of open-domain factoid QA systems. Their use, however, requires either constructing the patterns manually or developing algorithms for learning them automatically. We present here a simpler approach that extends the techniques of language modeling to create answer models. These are language models trained on the correct answers to training questions. We show how they fit naturally into a probabilistic model for answer passage retrieval and demonstrate their effectiveness on the TREC 2002 QA Corpus.
Collaborative filing in a document repository BIBAFull-Text 518-519
  Harris Wu; Michael D. Gordon
We introduce an emergent, collaborative filing system. In such a system, an individual is allowed to organize a subset of documents in a repository into a personal hierarchy and share the hierarchy with others. The system generates a "consensus" hierarchy from all users' personal hierarchies, which provides a full, common, and emergent view of all documents. We believe that collaborative filing helps translate personal, tacit knowledge into sharable structures, which help the user as well a community of which he or she is a part. Our filing system is suitable for any documents from text to multimedia files. Initial results on an experimental website show promise. For a knowledge task involving extensive document retrieval, hierarchies are not only used frequently but are also effective in identifying high quality documents. One surprising finding is how often subjects use others' personal hierarchies, and upon close examination, social networks play a key role as well.
A study of topic similarity measures BIBAFull-Text 520-521
  Ryen W. White; Joemon M. Jose
In this poster we describe an investigation of topic similarity measures. We elicit assessments on the similarity of 10 pairs of topic from 76 subjects and use these as a benchmark to assess how well each measure performs. The measures have the potential to form the basis of a predictive technique, for adaptive search systems. The results of our evaluation show that measures based on the level of correlation between topics concords most with general subject perceptions of search topic similarity.
Effectiveness of web page classification on finding list answers BIBAFull-Text 522-523
  Hui Yang; Tat-Seng Chua
List question answering (QA) offers a unique challenge in effectively and efficiently locating a complete set of distinct answers from huge corpora or the Web. In TREC-12, the median average F1 performance of list QA systems was only 6.9%. This paper exploits the wealth of freely available text and link structures on the Web to seek complete answers to list questions. We employ natural language parsing, web page classification and clustering to find reliable list answers. We also study the effectiveness of web page classification on both the recall and uniqueness of answers for web-based list QA.
Detection and translation of OOV terms prior to query time BIBAFull-Text 524-525
  Ying Zhang; Phil Vines
Accurate cross-language information retrieval requires that query terms be correctly translated. Several new techniques to improve the translation of out of vocabulary terms in English-Chinese cross-language information retrieval have been developed. However, these require queries and a document collection to enable translation disambiguation. Although effective, they involve much processing and searching of the Web at query time, and may not be practical in a production web search engine. In this work, we consider what tasks maybe carried out beforehand, the goal being to reduce the processing required at query time. We have successfully developed new techniques to extract and translate out of vocabulary terms using the Web and add them into a translation dictionary prior to query time.
Evaluation of the real and perceived value of automatic and interactive query expansion BIBAFull-Text 526-527
  Yael Nemeth; Bracha Shapira; Meirav Taeib-Maimon
The paper describes a user study examining methods for improving users queries, specifically interactive and automatic query expansion and advanced search options. The user study includes subjective and objective evaluation of the effect of the above methods and a comparison between the real and perceived effect.
The NRRC reliable information access (RIA) workshop BIBKFull-Text 528-529
  Donna Harman; Chris Buckley
Keywords: failure analysis, relevance feedback
On evaluating web search with very few relevant documents BIBAFull-Text 530-531
  Ian Soboroff
Many common web searches by their nature have a very small number of relevant documents. Homepage and "namedpage" searching are known-item searches where there is only a single relevant document. Topic distillation is a special kind of topical relevance search where the user wishes to find a few key web sites rather than every relevant web page. Because these types of searches are so common, web search evaluations have come to focus on tasks where there are very few relevant documents. Evaluations with few relevant documents pose special challenges for current metrics. In particular, the TREC 2003 topic distillation evaluation is unable to distinguish most submitted runs from each other.
A music recommender based on audio features BIBAFull-Text 532-533
  Qing Li; Byeong Man Kim; Dong Hai Guan; Duk whan Oh
Many collaborative music recommender systems (CMRS) have succeeded in capturing the similarity among users or items based on ratings, however they have rarely considered about the available information from the multimedia such as genres, let alone audio features from the media stream. Such information is valuable and can be used to solve several problems in RS. In this paper, we design a CMRS based on audio features of the multimedia stream. In the CMRS, we provide recommendation service by our proposed method where a clustering technique is used to integrate the audio features of music into the collaborative filtering (CF) framework in hopes of achieving better performance. Experiments are carried out to demonstrate that our approach is feasible.
Information extraction using two-phase pattern discovery BIBAFull-Text 534-535
  Liping Ma; John Shepherd
This paper presents a new two-phase pattern (2PP) discovery technique for information extraction. 2PP consists of orthographic pattern discovery (OPD) and semantic pattern discovery (SPD) where the OPD determines the structural features from an identified region of a document and the SPD discovers a dominant semantic pattern for the region via inference, apposition and analogy. Then the discovered pattern is applied back into the region to extract required data items through pattern matching. We evaluated 2PP using 6500 data items and obtained effective result.
A search engine for imaged documents in PDF files BIBAFull-Text 536-537
  Yue Lu; Li Zhang; Chew Lim Tan
Large quantities of documents in the Internet and digital libraries are simply scanned and archived in image format, many of which are packed in PDF files. The word search tool provided by Adobe Reader/Acrobat does not work for these imaged documents. In this paper, we present a search engine to deal with this issue for imaged documents in PDF files. The experimental results show an encouraging performance.
Context sensitive vocabulary and its application in protein secondary structure prediction BIBAFull-Text 538-539
  Yan Liu; Jaime Carbonell; Judith Klein-Seetharaman; Vanathi Gopalakrishnan
Protein secondary structure prediction is an important step towards understanding the relation between protein sequence and structure. However, most current prediction methods use features difficult for biologists to interpret. In this paper, we present a new method that applies information retrieval techniques to solve the problem: we extract a context sensitive biological vocabulary for protein sequences and apply text classification methods to predict protein secondary structure. Experimental results show that our method performs comparably to the state-of-art methods. Furthermore, the context sensitive vocabularies can serve as a useful tool to discover meaningful regular expression patterns for protein structures.
Formal multiple-bernoulli models for language modeling BIBKFull-Text 540-541
  Donald Metzler; Victor Lavrenko; W. Bruce Croft
Keywords: information retrieval, language modeling
User biased document language modelling BIBAFull-Text 542-543
  L. Azzopardi; M. Girolami; C. J. van Rijsbergen
Capitalizing on the intuitive underlying assumptions of Language Modelling for Ad-Hoc Retrieval we present a novel approach that is capable of injecting the user's context of the document collection into the retrieval process. The preliminary findings from the evaluation undertaken suggest that improved IR performance is possible under certain circumstances. This motivates further investigation to determine the extent and significance of this improved performance.
Information retrieval for language tutoring: an overview of the REAP project BIBKFull-Text 544-545
  Kevyn Collins-Thompson; Jamie Callan
Keywords: computer-assisted learning, information retrieval
A unified model of literal mining and link analysis for ranking web resources BIBAFull-Text 546-547
  Yinghui Xu; Kyoji Umemura
Web link analysis has been proved to provide significant enhancement to the precision of Web search in practice. The PageRank algorithm, which is used in Google Search Engine, plays an important role on improving the quality of its resuts by employing the explicit hyperlink structure among the Web pages. The prestige of Web pages defined by PageRank is purely derived from surfer random walk on the Web graph without textual content consideration. However, in the practical sense, user surfing behavior is far from random jumping. In this paper, we present a unified model for a more accurate page rank. User's surfing is guided by a probabilistic model that is based on literal matching between connected pages. The result shows that our proposed ranking algorithms do perform better than the original PageRank.
Automatic recognition of reading levels from user queries BIBKFull-Text 548-549
  Xiaoyong Liu; W. Bruce Croft; Paul Oh; David Hart
Keywords: answer level, context, personalization, query classification, query inference, readability, reading level
A joint framework for collaborative and content filtering BIBAFull-Text 550-551
  Justin Basilico; Thomas Hofmann
This paper proposes a novel, unified, and systematic approach to combine collaborative and content-based filtering for ranking and user preference prediction. The framework incorporates all available information by coupling together multiple learning problems and using a suitable kernel or similarity function between user-item pairs. We propose and evaluate an on-line algorithm (JRank) that generalizes perceptron learning using this framework and shows significant improvement over other approaches.
Refining term weights of documents using term dependencies BIBAFull-Text 552-553
  Hee-soo Kim; Ikkyu Choi; Minkoo Kim
When processing raw documents in Information Retrieval (IR) System, a term-weighting scheme is used to calculate the importance of each term which occurs in a document. However, most term-weighting schemes assume that a term is independent of the other terms. Term dependency is an indispensable consequence of language use [1]. Therefore, this assumption can make the information of a document being lost. In this paper, we propose new approach to refine term weights of documents using term dependencies discovered from a set of documents. Then, we evaluate our method with two experiments based on the vector space model [2] and the language model [3].
Multiple sources of evidence for XML retrieval BIBAFull-Text 554-555
  Borkur Sigurbjornsson; Jaap Kamps; Maarten de Rijke
Document-centric XML collections contain text-rich documents, marked up with XML tags. The tags add lightweight semantics to the text. Querying such collections calls for a hybrid query language: the text-rich nature of the documents suggest a content-oriented (IR) approach, while the mark-up allows users to add structural constraints to their IR queries. We will show how evidence for relevancy from different sources helps to answer such hybrid queries. We evaluate our methods using the INEX 2003 test set, and show that structural hints in hybrid queries help to improve retrieval effectiveness.
Verifying a Chinese collection for text categorization BIBAFull-Text 556-557
  Yuen-Hsien Tseng; William John Teahan
This article describes the development of a free test collection for Chinese text categorization. A novel retrieval-based approach was developed to detect duplicates and label inconsistency in this corpus and in Reuters-21578 for comparison. The method was able to detect certain types of similar and/or duplicated documents that were overlooked by an alternative repetition-based method [1]. Experiments showed that effectiveness was not affected by the confusing documents.
Query-related data extraction of hidden web documents BIBAFull-Text 558-559
  Y. L. Hedley; M. Younas; A. James; M. Sanderson
The larger amount of information on the Web is stored in document databases and is not indexed by general-purpose search engines (i.e., Google and Yahoo). Such information is dynamically generated through querying databases -- which are referred to as Hidden Web databases. Documents returned in response to a user query are typically presented using template-generated Web pages. This paper proposes a novel approach that identifies Web page templates by analysing the textual contents and the adjacent tag structures of a document in order to extract query-related data. Preliminary results demonstrate that our approach effectively detects templates and retrieves data with high recall and precision.
The patent retrieval task in the fourth NTCIR workshop BIBAFull-Text 560-561
  Atsushi Fujii; Makoto Iwayama; Noriko Kando
This paper describes the Patent Retrieval Task in the Fourth NTCIR Workshop, and the test collections produced in this task. We perform the invalidity search task, in which each participant group searches a patent collection for the patents that can invalidate the demand in an existing claim. We also perform the automatic patent map generation task, in which the patents associated with a specific topic are organized in a multi-dimensional matrix.
Measuring ineffectiveness BIBAFull-Text 562-563
  Ellen M. Voorhees
An evaluation methodology that targets ineffective topics is needed to support research on obtaining more consistent retrieval across topics. Using average values of traditional evaluation measures is not an appropriate methodology because it emphasizes effective topics: poorly performing topics' scores are by definition small, and they are therefore difficult to distinguish from the noise inherent in retrieval evaluation. We examine two new measures that emphasize a system's worst topics. While these measures focus on different aspects of retrieval behavior than traditional measures, the measures are less stable than traditional measures and the margin of error associated with the new measures is large relative to the observed differences in scores.
Information retrieval using hierarchical dirichlet processes BIBAFull-Text 564-565
  Philip J. Cowans
An information retrieval method is proposed using a hierarchical Dirichlet process as a prior on the parameters of a set of multinomial distributions. The resulting method naturally includes a number of features found in other popular methods. Specifically, tf.idf-like term weighting and document length normalisation are recovered. The new method is compared with Okapi BM-25 [3] and the Twenty-One model [1] on TREC data and is shown to give better performance.
Broken plural detection for Arabic information retrieval BIBAFull-Text 566-567
  Abduelbaset Goweder; Massimo Poesio; Anne De Roeck
An information retrieval method is proposed using a hierarchical Dirichlet process as a prior on the parameters of a set of multinomial distributions. The resulting method naturally includes a number of features found in other popular methods. Specifically, tf.idf-like term weighting and document length normalisation are recovered. The new method is compared with Okapi BM-25 [3] and the Twenty-One model [1] on TREC data and is shown to give better performance.
A study of methods for normalizing user ratings in collaborative filtering BIBAFull-Text 568-569
  Rong Jin; Luo Si
The goal of collaborative filtering is to make recommendations for a test user by utilizing the rating information of users who share interests similar to the test user. Because ratings are determined not only by user interests but also the rating habits of users, it is important to normalize ratings of different users to the same scale. In this paper, we compare two different normalization strategies for user ratings, namely the Gaussian normalization method and the decoupling normalization method. Particularly, we incorporated these two rating normalization methods into two collaborative filtering algorithms, and evaluated their effectiveness on the EachMovie dataset. The experiment results have shown that the decoupling method for rating normalization is more effective than the Gaussian normalization method in improving the performance of collaborative filtering algorithms.
A review of relevance feedback experiments at the 2003 reliable information access (RIA) workshop. BIBAFull-Text 570-571
  Robert H. Warren; Ting Liu
We review here the results of one of the experiments performed at the 2003 Reliable Information Access (RIA) Workshop, hosted by Mitre Corporation and the Northeast Regional Research Center (NRRC). The experiment concentrates on query expansion using relevance feedback and explores the behaviour of several information retrieval systems using variable numbers of relevant documents.
Supporting federated information sharing communities BIBAFull-Text 572-573
  Bicheng Liu; David J. Harper; Stuart Watt
In this paper we describe the concept of Federated Information Sharing Communities (FISC), and associated architecture, which provide a way for organisations, distributed workgroups and individuals to build up a federated community based on their common interests over the World Wide Web. To support communities, we develop capabilities that go beyond the generic retrieval of documents to include the ability to retrieve people, their interests and inter-relationships. We focus on providing social awareness "in the large" to help users understand the members within a community and the relationships between them. Within the FISC framework, we provide viewpoint retrieval to enable a user to construct visual contextual views of the community from the perspective of any community member. To evaluate these ideas we develop test beds to compare individual component technologies such as user and group profile construction and similarity matching, and we develop prototypes to explore the broader architecture and usage issues.
The effect of document retrieval quality on factoid question answering performance BIBKFull-Text 574-575
  Kevyn Collins-Thompson; Jamie Callan; Egidio Terra; Charles L. A. Clarke
Keywords: information retrieval, question answering
Exploiting hyperlink recommendation evidence in navigational web search BIBFull-Text 576-577
  Trystan Upstill; Stephen Robertson
Context-based methods for text categorisation BIBAFull-Text 578-579
  D. S. Hunnisett; W. J. Teahan
We propose several context-based methods for text categorization. One method, a small modification to the PPM compression-based model which is known to significantly degrade compression performance, counter-intuitively has the opposite effect on categorization performance. Another method, called C-measure, simply counts the presence of higher order character contexts, and outperforms all other approaches investigated.
eMailSift: mining-based approaches to email classification BIBKFull-Text 580-581
  Manu Aery; Sharma Chakravarthy
Keywords: email classification, frequent itemsets, graph based data mining
Constructing a text corpus for inexact duplicate detection BIBAFull-Text 582-583
  Jack G. Conrad; Cindy P. Schriber
As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical. The goal of this work is to facilitate (a) investigations into the phenomenon of near duplicates and (b) algorithmic approaches to minimizing its negative effect on search results. Harnessing the expertise of both client-users and professional searchers, we establish principled methods to generate a test collection for identifying and handling inexact duplicate documents.
Why current IR engines fail BIBAFull-Text 584-585
  Chris Buckley
Observations from a unique investigation of failure analysis of Information Retrieval (IR) research engines are presented. The Reliable Information Access (RIA) Workshop invited seven leading IR research groups to supply both their systems and their experts to an effort to analyze why their systems fail on some topics and whether the failures are due to system flaws, approach flaws, or the topic itself. There were surprising results from this cross-system failure analysis. One is that despite systems retrieving very different documents, the major cause of failure for any particular topic was almost always the same across all systems. Another is that relationships between aspects of a topic are not especially important for state-of-the-art systems; the systems are failing at a much more basic level where the top-retrieved documents are not reflecting some aspect at all.
Automatic sense disambiguation for acronyms BIBAFull-Text 586-587
  Manuel Zahariev
A machine learning methodology for the disambiguation of acronym senses is presented, which starts from an acronym sense dictionary. Training data is automatically extracted from downloaded documents identified from the results of search engine queries. Leave-one-out cross-validation on 9,963 documents with 47 acronym forms achieves accuracy 92.58% and Fβ=1=91.52%.
Filtering for personal web information agents BIBKFull-Text 588-589
  Gabriel L. Somlo; Adele E. Howe
Keywords: WebIR, adaptive filtering, continuous queries
Evaluating content-based filters for image and video retrieval BIBAFull-Text 590-591
  Michael G. Christel; Neema Moraveji; Chang Huang
This paper investigates the level of metadata accuracy required for image filters to be valuable to users. Access to large digital image and video collections is hampered by ambiguous and incomplete metadata attributed to imagery. Though improvements are constantly made in the automatic derivation of semantic feature concepts such as indoor, outdoor, face, and cityscape, it is unclear how good these improvements should be and under what circumstances they are effective. This paper explores the relationship between metadata accuracy and effectiveness of retrieval using an amateur photo collection, documentary video, and news video. The accuracy of the feature classification is varied from performance typical of automated classifications today to ideal performance taken from manually generated truth data. Results establish an accuracy threshold at which semantic features can be useful, and empirically quantify the collection size when filtering first shows its effectiveness.
Semantic video classification by integrating unlabeled samples for classifier training BIBAFull-Text 592-593
  Jianping Fan; Hangzai Luo
Semantic video classification has become an active research topic to enable more effective video retrieval and knowledge discovery from large-scale video databases. However, most existing techniques for classifier training require a large number of hand-labeled samples to learn correctly. To address this problem, we have proposed a semi-supervised framework to achieve incremental classifier training by integrating a limited number of labeled samples with a large number of unlabeled samples. Specifically, this emi-supervised framework includes: (a) Modeling the semantic video concepts by using the finite mixture models to approximate the class distributions of the relevant salient objects; (b) Developing an adaptive EM algorithm to integrate the unlabeled samples to achieve parameter estimation and model selection simultaneously; The experimental results in a certain domain of medical videos are also provided.
Implicit queries (IQ) for contextualized search BIBAFull-Text 594
  Susan Dumais; Edward Cutrell; Raman Sarin; Eric Horvitz
The Implicit Query (IQ) prototype is a system which automatically generates context-sensitive searches based on a user's current computing activities. In the demo, we show IQ running when users are reading or composing email. Queries are automatically generated by analyzing the email message, and results are presented in a small pane adjacent to the current window to provide peripheral awareness of related information.
An implicit system for predicting interests BIBAFull-Text 595
  Ryen W. White; Joemon M. Jose
We demonstrate an adaptive search system that works proactively to help searchers find relevant information. The system observes searcher interaction, uses what it sees to model information needs and chooses additional query terms. The system watches for changes in the topic of the search and selects retrieval strategies that reflect the extent to which the topic is seen to change.
Geotemporal querying of multilingual documents BIBAFull-Text 596
  Fredric C. Gey; Aitao Chen; Ray Larson; Kim Carl
This demonstration utilizes a geographic information system interface to display multilingual news documents in time and space by extracting place names from text and matching them to a multilingual multi-script gazetteer which identifies the latitude and longitude of the location.
ACES: a contextual engine for search BIBKFull-Text 597
  Xuehua Shen; Smitha Sriram; Chengxiang Zhai
Keywords: context, query expansion, query history
Armadillo: harvesting information for the semantic web BIBKFull-Text 598
  Sam Chapman; Alexiei Dingli; Fabio Ciravegna
Keywords: armadillo, information extraction, information integration, information retrieval, web intelligence
UKSearch: search with automatically acquired domain knowledge BIBKFull-Text 599
  Udo Kruschwitz; Hala Al-Bakour
Keywords: WWW, concept hierarchies
Geographic information retrieval (GIR): searching where and what BIBFull-Text 600
  Ray R. Larson; Patricia Frontiera

Doctorial consortium

Sharing knowledge online (abstract only): a dream or reality? BIBAFull-Text 602
  Melanie Gnasa
The Web provides a global platform for knowledge sharing. However, several shortcomings still arise from the absence of personalization and collaboration in Web searches. More effective retrieval techniques could be provided by means of transforming explicit knowledge into implicit knowledge. The approach presented in this paper is based on a peer-to-peer architecture and aims at complementing classical Web searches in terms of personalized ranking lists. These local rankings can be accumulated and evaluated in order to supplement the process of knowledge generation by building Virtual Knowledge Communities. Furthermore, the aggregation of ranking lists can be used to identify topics as well as communities of interest. Together with social aspects for community support, a framework for congenial Web search is defined.
Supporting federated information sharing communities (abstract only) BIBAFull-Text 602
  Bicheng Liu
Increasingly, the World Wide Web is being viewed as a means of creating web communities rather than simply as a means of publishing and delivering documents and services. In this research we develop the concept of Federated Information Sharing Communities (FISC), and associated architecture, that enables community-centred information systems to be constructed. Such systems provide a way for organisations, distributed workgroups and individuals to build up a federated community based on their common interests over the World Wide Web. To support communities, we develop capabilities that go beyond the generic retrieval of documents to include the ability to retrieve people, their interests and inter-relationships. We focus on providing social awareness "in the large" to help users understand the members within a community and the relationships between them: who is working on what topic, and who is working with whom. Within the FISC framework, we provide a viewpoint retrieval service to enable a user to construct visual contextual views of the community from the perspective of any community member. To evaluate these ideas we develop test beds to compare individual component technologies such as user and group profile construction and similarity matching, and we develop prototypes (Web Network and "CiteSeer Community") to explore the broader architecture and usage issues.
Improving document representation by accumulating relevance feedback (abstract only): the relevance feedback accumulation algorithm BIBAFull-Text 602
  Razvan Stefan Bot
This paper presents a document representation improvement technique named the Relevance Feedback Accumulation (RFA) algorithm. Using prior relevance feedback assessments and a data mining measure called support this algorithm improves document representations and generates higher quality indexes. At the same time, the algorithm is efficient and scalable, suited for retrieval systems managing large document collections. The results of the preliminary evaluation reveal that the RFA algorithm is able to reduce the index dimensionality while improving retrieval effectiveness.
Toponym resolution in text (abstract only): "which Sheffield is it?" BIBAFull-Text 602
  Jochen L. Leidner
Named entity tagging comprises the sub-tasks of identifying a text span and classifying it, but this view ignores the relationship between the entities and the world. Spatial and temporal entities ground events in space-time, and this relationship is vital for applications such as question answering and event tracking. There is much recent work regarding the temporal dimension (Setzer and Gaizauskas 2002, Mani and Wilson 2000), but no detailed study of the spatial dimension.
   I propose to investigate how spatial named entities (which are often referentially ambiguous) can be automatically resolved with respect to an extensional coordinate model (toponym resolution). To this end, various information sources including linguistic cue patterns, co-occurrence information, discourse/positional information, world knowledge (such as size and population) as well as minimality heuristics (Leidner et al. 2003) will be combined in a supervised machine learning regime.
   The major contributions of this research project will be a corpus of text manually annotated for spatial named entities with their model correlates as a training and evaluation resource, a novel method to spatially ground toponyms in text and a component-based evaluation based on this new reference corpus.
Understanding combination of evidence using generative probabilistic models for information retrieval (abstract only) BIBAFull-Text 603
  Paul Ogilvie
Structured documents, rich information needs, and detailed information about users are becoming more pervasive within everyday computing usage. Applications such as Question Answering, reading tutors, and XML retrieval demand more robust retrieval on richly annotated documents. In order to effectively serve these applications, the community will need a better understanding of the combination of evidence. In this work, I propose that the use of simple generative probabilistic models will be an effective framework for these problems. Statistical language models, which are a special case of generative probabilistic models, have been used extensively within recent Information Retrieval research. Their flexibility has been very effective in adapting to numerous tasks and problems. I propose to extend the statistical language modeling framework to handle rich information needs and documents with structural and linguistic annotations. Much of the prior work on combination of evidence has had few well-studied theoretical contributions, so I also propose to develop a sounder theoretical basis which gives more predictable results.
Discovering and representing the contextual and narrative structure of e-books to support reading and comprehension (abstract only) BIBAFull-Text 603
  Yixing Sun
A person reading a book needs to build an understanding based on the available textual materials. As a result of a survey of users' reading behaviours and of existing e-Book user interfaces, we found that most of these interfaces provide poor support for the actual processes of reading and comprehension. In particular, there is generally minimal support for understanding the overall structure (or contextual structure) and the narrative structure of a book. We propose adapting topic tracking and detection techniques to discover the narrative threads within a book, and hence its narrative structure. The contextual and narrative structures will be presented to the user through purpose-designed visualisations, which will be integrated and linked within a newly developed e-Book browser. We have chosen to use the Bible as our test corpus, as it has a rich narrative structure, and relatively complex contextual structure. Evaluation of the interface, and its components, will be done through field studies involving actual readers of the Bible, to assess the effectiveness of the user interface in enhancing a user's experience.
Reliability and verification of natural language text on the world wide web (abstract only) BIBAFull-Text 603
  Melanie J. Martin
The hypothesis that information on the Web can be verified automatically, with minimal user interaction, will be tested by building and evaluating an interactive system. In this paper, verification is defined as a reasonable determination of the truth or correctness of a statement by examination, research, or comparison with similar text. The system will contain modules for reliability ranking, query processing, document retrieval, and document clustering based on agreement. The query processing and document retrieval components will use standard IR techniques. The reliability module will estimate the likelihood that a statement on the Web can be trusted using standards developed by information scientists, as well as linguistic aspects of the page and the link structure of associated web pages. The clustering module will cluster relevant documents based on whether or not they agree or disagree with the statement to be verified. Relevant references are discussed.
An artificial intelligence approach to information retrieval (abstract only) BIBAFull-Text 603
  Andrew Trotman
Current approaches to information retrieval rely on the creativity of individuals to develop new algorithms. In this investigation the use of genetic algorithms (GA) and genetic programming (GP) to learn IR algorithms is examined.
   Document structure weighting is a technique whereby different parts of a document (title, abstract, etc.) contribute unevenly to the overall document weight during ranking. Near optimal weights can be learned with a GA. Doing so shows a statistically significant 5% relative improvement in MAP for vector space inner product and Croft's probabilistic ranking, but no improvement for BM25. Two applications of this approach are suggested: offline learning, and relevance feedback.
   In a second set of experiments, a new ranking function was learned using GP. This new function yields a statistically significant 11% relative improvement on unseen queries tested on the training documents. Portability tests to different collections (not used in training) demonstrate the performance of the new function exceeds vector space and probability, and slightly exceeds BM25. Learning weights for this new function is proposed.
   The application of genetic learning to stemming and thesaurus construction is discussed. Stemming rules such as those of the Porter algorithm are candidates for GP learning whereas synonym sets are candidates for GA learning.
Supporting multiple information-seeking strategies in a single system framework (abstract only) BIBAFull-Text 604
  Xiaojun Yuan
This research explores the relationship between information-seeking strategies (ISSs) and information retrieval (IR) system design. When people seek information they engage in a variety of ISSs in order to search for specific items, learn about the contents of the database, evaluate retrieved information, and so on.
   The theoretical foundations of the work are based on the information-seeking episode model developed by Belkin (1996), and the multi-facet classification scheme of information behaviors proposed by Cool & Belkin (2002). The goal of this research is to construct and evaluate an interactive retrieval system which uses different combinations of IR techniques to support different ISSs. Example IR techniques include comparison using exact and probabilistic matching algorithms; summarization of information objects using titles, snippets or abstracts; visualization techniques such as lists or classified results; and navigation techniques such as scrolling or following links. By designing a retrieval system with diverse strategies in mind, we can adaptively support multiple ISSs, permitting a user to move seamlessly from one strategy to another, choosing instantiations of each support technique tailored to the specific ISS. The research will be conducted in a series of four steps. (1) Develop an object-oriented framework for representing basic IR techniques. (2) Design, implement and evaluate systems which support individual ISSs such as browsing and searching. (3) Specify an interaction structure for guiding and controlling sequences of different supporting techniques.(4) Design, implement, and evaluate a dynamically adaptive system supporting multiple ISSs in comparison to a non-adaptive baseline system.