HCI Bibliography Home | HCI Conferences | ECIR Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
ECIR Tables of Contents: 03040506070809101112131415

Proceedings of ECIR'06, the 2006 European Conference on Information Retrieval

Fullname:ECIR 2006: Advances in Information Retrieval: 28th European Conference on IR Research
Editors:Mounia Lalmas; Andy MacFarlane; Stefan Rüger; Anastasios Tombros; Theodora Tsikrika; Alexei Yavlinsky
Location:London, United Kingdom
Dates:2006-Apr-10 to 2006-Apr-12
Publisher:Springer Berlin Heidelberg
Series:Lecture Notes in Computer Science 3936
Standard No:DOI: 10.1007/11735106 hcibib: ECIR06; ISBN: 978-3-540-33347-0 (print), 978-3-540-33348-7 (online)
Links:Online Proceedings | Conference Home Page
  1. Progress in Information Retrieval
  2. Formal Models
  3. Document and Query Representation and Text Understanding
  4. Design and Evaluation
  5. Topic Identification and News Retrieval
  6. Clustering and Classification
  7. Refinement and Feedback
  8. Performance and Peer-to-Peer Networks
  9. Web Search
  10. Structure/XML
  11. Multimedia
  12. Cross-Language Retrieval
  13. Genomic IR
  14. Posters

Progress in Information Retrieval

Progress in Information Retrieval BIBAFull-Text 1-11
  Mounia Lalmas; Stefan Rüger; Theodora Tsikrika; Alexei Yavlinsky
This paper summarises the scientific work presented at the 28th European Conference on Information Retrieval and demonstrates that the field has not only significantly progressed over the last year but has also continued to make inroads into areas such as Genomics, Multimedia, Peer-to-Peer and XML retrieval.
Enterprise Search -- The New Frontier? BIBAFull-Text 12
  David Hawking
The advent of the current generation of Web search engines around 1998 challenged the relevance of academic information retrieval research -- established evaluation methodologies didn't scale and nor did they reflect the diverse purposes to which search engines are now put. Academic ranking algorithms of the time almost completely ignored the features which underpin modern web search: query-independent evidence and evidence external to the document. Unlike their commercial counterparts, academic researchers have for years been unable to access Web scale collections and their corresponding link graphs and search logs.

Formal Models

Frequentist and Bayesian Approach to Information Retrieval BIBAFull-Text 13-24
  Giambattista Amati
We introduce the hypergeometric models KL, DLH and DLLH using the DFR approach, and we compare these models to other relevant models of IR. The hypergeometric models are based on the probability of observing two probabilities: the relative within-document term frequency and the entire collection term frequency. Hypergeometric models are parameter-free models of IR. Experiments show that these models have an excellent performance with small and very large collections. We provide their foundations from the same IR probability space of language modelling (LM). We finally discuss the difference between DFR and LM. Briefly, DFR is a frequentist (Type I), or combinatorial approach, whilst language models use a Bayesian (Type II) approach for mixing the two probabilities, being thus inherently parametric in its nature.
Using Proportional Transportation Distances for Measuring Document Similarity BIBAFull-Text 25-36
  Xiaojun Wan; Jianwu Yang
A novel document similarity measure based on the Proportional Transportation Distance (PTD) is proposed in this paper. The proposed measure improves on the previously proposed similarity measure based on optimal matching by allowing many-to-many matching between subtopics of documents. After documents are decomposed into sets of subtopics, the Proportional Transportation Distance is employed to evaluate the similarity between sets of subtopics for two documents by solving a transportation problem. Experiments on TDT-3 data demonstrate its good ability for measuring document similarity and also its high robustness, i.e. it does not rely on the underlying document decomposition algorithm largely as the optimal matching based measure.
A User-Item Relevance Model for Log-Based Collaborative Filtering BIBAFull-Text 37-48
  Jun Wang; Arjen P. de Vries; Marcel J. T. Reinders
Implicit acquisition of user preferences makes log-based collaborative filtering favorable in practice to accomplish recommendations. In this paper, we follow a formal approach in text retrieval to re-formulate the problem. Based on the classic probability ranking principle, we propose a probabilistic user-item relevance model. Under this formal model, we show that user-based and item-based approaches are only two different factorizations with different independence assumptions. Moreover, we show that smoothing is an important aspect to estimate the parameters of the models due to data sparsity. By adding linear interpolation smoothing, the proposed model gives a probabilistic justification of using TF×IDF-like item ranking in collaborative filtering. Besides giving the insight understanding of the problem of collaborative filtering, we also show experiments in which the proposed method provides a better recommendation performance on a music play-list data set.

Document and Query Representation and Text Understanding

Generating Search Term Variants for Text Collections with Historic Spellings BIBAFull-Text 49-60
  Andrea Ernst-Gerlach; Norbert Fuhr
In this paper, we describe a new approach for retrieval in texts with non-standard spelling, which is important for historic texts in English or German. For this purpose, we present a new algorithm for generating search term variants in ancient orthography. By applying a spell checker on a corpus of historic texts, we generate a list of candidate terms for which the contemporary spellings have to be assigned manually. Then our algorithm produces a set of probabilistic rules. These probabilities can be considered for ranking in the retrieval stage. An experimental comparison shows that our approach outperforms competing methods.
Efficient Phrase Querying with Common Phrase Index BIBAFull-Text 61-71
  Matthew Chang; Chung Keung Poon
In this paper, we propose a common phrase index as an efficient index structure to support phrase queries in a very large text database. Our structure is an extension of previous index structures for phrases and achieves better query efficiency with negligible extra storage cost. In our experimental evaluation, a common phrase index has 5% and 20% improvement in query time for the overall and large queries (queries of long phrases) respectively over an auxiliary nextword index. Moreover, it uses only 1% extra storage cost. Compared with an inverted index, our improvement is 40% and 72% for the overall and large queries respectively.
Document Length Normalization Using Effective Level of Term Frequency in Large Collections BIBAFull-Text 72-83
  Soheila Karbasi; Mohand Boughanem
The effectiveness of the information retrieval systems is largely dependent on term-weighting. Most current term-weighting approaches involve the use of term frequency normalization. We develop here a method to assess the potential role of the term frequency-inverse document frequency measures that are commonly used in text retrieval systems. Since automatic information retrieval systems have to deal with documents of varying sizes and terms of varying frequencies, we carried out preliminary tests to evaluate the effect of term-weighing items on the retrieval performance. With regard to the preliminary tests, we identify a novel factor (effective level of term frequency) that represents the document content based on its length and maximum term-frequency. This factor is used to find the maximum main terms within the documents and an appropriate subset of documents containing the query terms. We show that, all document terms need not be considered for ranking a document with respect to a query. Regarding the result of the experiments, the effective level of term frequency (EL) is a significant factor in retrieving relevant documents, especially in large collections. Experiments were under-taken on TREC collections to evaluate the effectiveness of our proposal.

Design and Evaluation

Beyond the Web: Retrieval in Social Information Spaces BIBAFull-Text 84-95
  Sebastian Marius Kirsch; Melanie Gnasa; Armin B. Cremers
We research whether the inclusion of information about an information user's social environment and his position in the social network of his peers leads to an improval in search effectiveness.
   Traditional information retrieval methods fail to address the fact that information production and consumption are social activities. We ameliorate this problem by extending the domain model of information retrieval to include social networks.
   We describe a technique for information retrieval in such an environment and evaluate it in comparison to vector space retrieval.
Evaluating Web Search Result Summaries BIBAFull-Text 96-106
  Shao Fen Liang; Siobhan Devlin; John Tait
The aim of our research is to produce and assess short summaries to aid users' relevance judgements, for example for a search engine result page. In this paper we present our new metric for measuring summary quality based on representativeness and judgeability, and compare the summary quality of our system to that of Google. We discuss the basis for constructing our evaluation methodology in contrast to previous relevant open evaluations, arguing that the elements which make up an evaluation methodology: the tasks, data and metrics, are interdependent and the way in which they are combined is critical to the effectiveness of the methodology. The paper discusses the relationship between these three factors as implemented in our own work, as well as in SUMMAC/MUC/DUC.
Measuring the Complexity of a Collection of Documents BIBAFull-Text 107-118
  Vishwa Vinay; Ingemar J. Cox; Natasa Milic-Frayling; Ken Wood
Some text collections are more difficult to search or more complex to organize into topics than others. What properties of the data characterize this complexity? We use a variation of the Cox-Lewis statistic to measure the natural tendency of a set of points to fall into clusters. We compute this quantity for document collections that are represented as a set of term vectors. We consider applications of the Cox-Lewis statistic in three scenarios: comparing clusterability of different text collections using the same representation, comparing different representations of the same text collection, and predicting the query performance based on the clusterability of the query results set. Our experimental results show a correlation between the observed effectiveness and this statistic, thereby demonstrating the utility of such data analysis in text retrieval.

Topic Identification and News Retrieval

Sentence Retrieval with LSI and Topic Identification BIBAFull-Text 119-130
  David Parapar; Álvaro Barreiro
This paper presents two sentence retrieval methods. We adopt the task definition done in the TREC Novelty Track: sentence retrieval consists in the extraction of the relevant sentences for a query from a set of relevant documents for that query. We have compared the performance of the Latent Semantic Indexing (LSI) retrieval model against the performance of a topic identification method, also based on Singular Value Decomposition (SVD) but with a different sentence selection method. We used the TREC Novelty Track collections from years 2002 and 2003 for the evaluation. The results of our experiments show that these techniques, particularly sentence retrieval based on topic identification, are valid alternative approaches to other more ad-hoc methods devised for this task.
Ranking Web News Via Homepage Visual Layout and Cross-Site Voting BIBAFull-Text 131-142
  Jinyi Yao; Jue Wang; Zhiwei Li; Mingjing Li; Wei-Ying Ma
Reading news is one of the most popular activities when people surf the internet. As too many news sources provide independent news information and each has its own preference, detecting unbiased important news might be very useful for users to keep up to date with what are happening in the world. In this paper we present a novel method to identify important news in web environment which consists of diversified online news sites. We observe that a piece of important news generally occupies visually significant place in some homepage of a news site and import news event will be reported by many news sites. To explore these two properties, we model the relationship between homepages, news and latent events by a tripartite graph, and present an algorithm to identify important news in this model. Based on this algorithm, we implement a system TOPSTORY to dynamically generate homepages for users to browse important news reports. Our experimental study indicates the effectiveness of proposed approach.
Clustering-Based Searching and Navigation in an Online News Source BIBAFull-Text 143-154
  Simón C. Smith; M. Andrea Rodríguez
The growing amount of online news posted on the WWW demands new algorithms that support topic detection, search, and navigation of news documents. This work presents an algorithm for topic detection that considers the temporal evolution of news and the structure of web documents. Then, it uses the results of the topic detection algorithm for searching and navigating in an online news source. An experimental evaluation with a collection of online news in Spanish indicates the advantages of incorporating the temporal aspect and structure of documents in the topic detection of news. In addition, topic-based clusters are well suited for guiding the search and navigation of news.

Clustering and Classification

Mobile Clustering Engine BIBAFull-Text 155-166
  Claudio Carpineto; Andrea Della Pietra; Stefano Mizzaro; Giovanni Romano
Although mobile information retrieval is seen as the next frontier of the search market, the rendering of results on mobile devices is still unsatisfactory. We present Credino, a clustering engine for PDAs based on the theory of concept lattices that can help overcome some specific challenges posed by small-screen, narrow-band devices. Credino is probably the first clustering engine for mobile devices freely available for testing on the Web. An experimental evaluation, besides confirming that finding information is more difficult on a PDA than on a desktop computer, suggests that mobile clustering engine is more effective than mobile search engine.
Improving Quality of Search Results Clustering with Approximate Matrix Factorisations BIBAFull-Text 167-178
  Stanislaw Osinski
In this paper we show how approximate matrix factorisations can be used to organise document summaries returned by a search engine into meaningful thematic categories. We compare four different factorisations (SVD, NMF, LNMF and K-Means/Concept Decomposition) with respect to topic separation capability, outlier detection and label quality. We also compare our approach with two other clustering algorithms: Suffix Tree Clustering (STC) and Tolerance Rough Set Clustering (TRC). For our experiments we use the standard merge-then-cluster approach based on the Open Directory Project web catalogue as a source of human-clustered document summaries.
Adapting the Naive Bayes Classifier to Rank Procedural Texts BIBAFull-Text 179-190
  Ling Yin; Richard Power
This paper presents a machine-learning approach for ranking web documents according to the proportion of procedural text they contain. By 'procedural text' we refer to ordered lists of steps, which are very common in some instructional genres such as online manuals. Our initial training corpus is built up by applying some simple heuristics to select documents from a large collection and contains only a few documents with a large proportion of procedural texts. We adapt the Naive Bayes classifier to better fit this less than ideal training corpus. This adapted model is compared with several other classifiers in ranking procedural texts using different sets of features and is shown to perform well when only highly distinctive features are used.

Refinement and Feedback

The Effects of Relevance Feedback Quality and Quantity in Interactive Relevance Feedback: A Simulation Based on User Modeling BIBAFull-Text 191-204
  Heikki Keskustalo; Kalervo Järvelin; Ari Pirkola
Experiments on the effectiveness of relevance feedback with real users are time-consuming and expensive. This makes simulation for rapid testing desirable. We define a user model, which helps to quantify some interaction decisions involved in simulated relevance feedback. First, the relevance criterion defines the relevance threshold of the user to accept documents as relevant to his/her needs. Second, the browsing effort refers to the patience of the user to browse through the initial list of retrieved documents in order to give feedback. Third, the feedback effort refers to the effort and ability of the user to collect feedback documents. We use the model to construct several simulated relevance feedback scenarios in a laboratory setting. Using TREC data providing graded relevance assessments, we study the effect of the quality and quantity of the feedback documents on the effectiveness of the relevance feedback and compare this to the pseudo-relevance feedback. Our results indicate that one can compensate large amounts of relevant but low quality feedback by small amounts of highly relevant feedback.
Using Query Profiles for Clarification BIBAFull-Text 205-216
  Henning Rode; Djoerd Hiemstra
The following paper proposes a new kind of relevance feedback. It shows how so-called query profiles can be employed for disambiguation and clarification.
   Query profiles provide useful summarized previews on the retrieved answers to a given query. They outline ambiguity in the query and when combined with appropriate means of interactivity allow the user to easily adapt the final ranking. Statistical analysis of the profiles even enables the retrieval system to automatically suggest search restrictions or preferences. The paper shows a preliminary experimental study of the proposed feedback methods within the setting of TREC's interactive HARD track.
Lexical Entailment for Information Retrieval BIBAFull-Text 217-228
  Stéphane Clinchant; Cyril Goutte; Eric Gaussier
Textual Entailment has recently been proposed as an application independent task of recognising whether the meaning of one text may be inferred from another. This is potentially a key task in many NLP applications. In this contribution, we investigate the use of various lexical entailment models in Information Retrieval, using the language modelling framework. We show that lexical entailment potentially provides a significant boost in performance, similar to pseudo-relevance feedback, but at a lower computational cost. In addition, we show that the performance is relatively stable with respect to the corpus the lexical entailment measure is estimated on.

Performance and Peer-to-Peer Networks

A Hybrid Approach to Index Maintenance in Dynamic Text Retrieval Systems BIBAFull-Text 229-240
  Stefan Büttcher; Charles L. A. Clarke
In-place and merge-based index maintenance are the two main competing strategies for on-line index construction in dynamic information retrieval systems based on inverted lists. Motivated by recent results for both strategies, we investigate possible combinations of in-place and merge-based index maintenance. We present a hybrid approach in which long posting lists are updated in-place, while short lists are updated using a merge strategy. Our experimental results show that this hybrid approach achieves better indexing performance than either method (in-place, merge-based) alone.
Efficient Parallel Computation of PageRank BIBAFull-Text 241-252
  Christian Kohlschütter; Paul-Alexandru Chirita; Wolfgang Nejdl
PageRank inherently is massively parallelizable and distributable, as a result of web's strict host-based link locality. We show that the Gauß-Seidel iterative method can actually be applied in such a parallel ranking scenario in order to improve convergence. By introducing a two-dimensional web model and by adapting the PageRank to this environment, we present efficient methods to compute the exact rank vector even for large-scale web graphs in only a few minutes and iteration steps, with intrinsic support for incremental web crawling, and without the need for page sorting/reordering or for sharing global rank information.
Comparing Different Architectures for Query Routing in Peer-to-Peer Networks BIBAFull-Text 253-264
  Henrik Nottelmann; Norbert Fuhr
Efficient and effective routing of content-based queries is an emerging problem in peer-to-peer networks, and can be seen as an extension of the traditional "resource selection" problem. Although some approaches have been proposed, finding the best architecture (defined by the network topology, the underlying selection method, and its integration into peer-to-peer networks) is still an open problem. This paper investigates different building blocks of such architectures, among them the decision-theoretic framework, CORI, hierarchical networks, distributed hash tables and HyperCubes. The evaluation on a large test-bed shows that the decision-theoretic framework can be applied effectively and cost-efficiently onto peer-to-peer networks.
Automatic Document Organization in a P2P Environment BIBAFull-Text 265-276
  Stefan Siersdorfer; Sergej Sizov
This paper describes an efficient method to construct reliable machine learning applications in peer-to-peer (P2P) networks by building ensemble based meta methods. We consider this problem in the context of distributed Web exploration applications like focused crawling. Typical applications are user-specific classification of retrieved Web contents into personalized topic hierarchies as well as automatic refinements of such taxonomies using unsupervised machine learning methods (e.g. clustering). Our approach is to combine models from multiple peers and to construct the advanced decision model that takes the generalization performance of multiple 'local' peer models into account. In addition, meta algorithms can be applied in a restrictive manner, i.e. by leaving out some 'uncertain' documents. The results of our systematic evaluation show the viability of the proposed approach.

Web Search

Exploring URL Hit Priors for Web Search BIBAFull-Text 277-288
  Ruihua Song; Guomao Xin; Shuming Shi; Ji-Rong Wen; Wei-Ying Ma
URL usually contains meaningful information for measuring the relevance of a Web page to a query in Web search. Some existing works utilize URL depth priors (i.e. the probability of being a good page given the length and depth of a URL) for improving some types of Web search tasks. This paper suggests the use of the location of query terms occur in a URL for measuring how well a web page is matched with a user's information need in web search. First, we define and estimate URL hit types, i.e. the priori probability of being a good answer given the type of query term hits in the URL. The main advantage of URL hit priors (over depth priors) is that it can achieve stable improvement for both informational and navigational queries. Second, an obstacle of exploiting such priors is that shortening and concatenation are frequently used in a URL. Our investigation shows that only 30% URL hits are recognized by an ordinary word breaking approach. Thus we combine three methods to improve matching. Finally, the priors are integrated into the probabilistic model for enhancing web document retrieval. Our experiments were conducted using 7 query sets of TREC2002, TREC2003 and TREC2004, and show that the proposed approach is stable and improve retrieval effectiveness by 4%~11% for navigational queries and 10% for informational queries.
A Study of Blog Search BIBAFull-Text 289-301
  Gilad Mishne; Maarten de Rijke
We present an analysis of a large blog search engine query log, exploring a number of angles such as query intent, query topics, and user sessions. Our results show that blog searches have different intents than general web searches, suggesting that the primary targets of blog searchers are tracking references to named entities, and locating blogs by theme. In terms of interest areas, blog searchers are, on average, more engaged in technology, entertainment, and politics than web searchers, with a particular interest in current events. The user behavior observed is similar to that in general web search: short sessions with an interest in the first few results only.
A Comparative Study of the Effectiveness of Search Result Presentation on the Web BIBAFull-Text 302-313
  Hideo Joho; Joemon M. Jose
Presentation of search results in Web-based information retrieval (IR) systems has been dominated by a textual form of information such as the title, snippet, URL, and/or file type of retrieved documents. On the other hand, document's visual aspects such as the layout, colour scheme, or presence of images have been studied in a limited context with regard to their effectiveness of search result presentation. This paper presents a comparative evaluation of textual and visual forms of document summaries as the additional document surrogate in the search result presentation. In our study, a sentence-based summarisation technique was used to create a textual document summary, and the thumbnail image of web pages was used to represent a visual summary. The experimental results suggest that both have the cases where the additional elements contributed to a positive effect not only in users' relevance assessment but also in query re/formulation. The results also suggest that the two forms of document summary are likely to have different contexts to facilitate user's search experience. Therefore, our study calls for further research on adaptive models of IR systems to make use of their advantages in appropriate contexts.


Bricks: The Building Blocks to Tackle Query Formulation in Structured Document Retrieval BIBAFull-Text 314-325
  Roelof van Zwol; Jeroen Baas; Herre van Oostendorp; Frans Wiering
Structured document retrieval focusses on the retrieval of relevant document fragments for a given information need that contains both structural and textual aspects.
   We focus here on the theory behind Bricks, a visual query formulation technique for structured document retrieval that aims at reducing the complexity of the query formulation process and required knowledge of the underlying document structure for the user, while maintaining full expression power, as offered by the NEXI query language for XML retrieval.
   In addition, we present the outcomes of a large scale usability experiment, which compared Bricks to a keyword-based and a NEXI-based interface. The results show that participants were more successful at completing a search assignments using Bricks. Furthermore, we observed that the participants were also able to successfully complete complex search assignments significantly faster, when using the Bricks interface.
Structural Feedback for Keyword-Based XML Retrieval BIBAFull-Text 326-337
  Ralf Schenkel; Martin Theobald
Keyword-based queries are an important means to retrieve information from XML collections with unknown or complex schemas. Relevance Feedback integrates relevance information provided by a user to enhance retrieval quality. For keyword-based XML queries, feedback engines usually generate an expanded keyword query from the content of elements marked as relevant or nonrelevant. This approach that is inspired by text-based IR completely ignores the semistructured nature of XML. This paper makes the important step from pure content-based to structural feedback. It presents a framework that expands a keyword query into a full-fledged content-and-structure query. Extensive experiments with the established INEX benchmark and our TopX search engine show the feasibility of our approach.
Machine Learning Ranking for Structured Information Retrieval BIBAFull-Text 338-349
  Jean-Noël Vittaut; Patrick Gallinari
We consider the Structured Information Retrieval task which consists in ranking nested textual units according to their relevance for a given query, in a collection of structured documents. We propose to improve the performance of a baseline Information Retrieval system by using a learning ranking algorithm which operates on scores computed from document elements and from their local structural context. This model is trained to optimize a Ranking Loss criterion using a training set of annotated examples composed of queries and relevance judgments on a subset of the document elements. The model can produce a ranked list of documents elements which fulfills a given information need expressed in the query. We analyze the performance of our algorithm on the INEX collection and compare it to a baseline model which is an adaptation of Okapi to Structured Information Retrieval.
Generating and Retrieving Text Segments for Focused Access to Scientific Documents BIBAFull-Text 350-361
  Caterina Caracciolo; Maarten de Rijke
When presented with a retrieved document, users of a search engine are usually left with the task of pinning down the relevant information inside the document. Often this is done by a time-consuming combination of skimming, scrolling and Ctrl+F. In the setting of a digital library for scientific literature the issue is especially urgent when dealing with reference works, such as surveys and handbooks, as these typically contain long documents. Our aim is to develop methods for providing a "go-read-here" type of retrieval functionality, which points the user to a segment where she can best start reading to find out about her topic of interest. We examine multiple query-independent ways of segmenting texts into coherent chunks that can be returned in response to a query. Most (experienced) authors use paragraph breaks to indicate topic shifts, thus providing us with one way of segmenting documents. We compare this structural method with semantic text segmentation methods, both with respect to topical focus and relevancy. Our experimental evidence is based on manually segmented scientific documents and a set of queries against this corpus. Structural segmentation based on contiguous blocks of relevant paragraphs is shown to be a viable solution for our intended application of providing "go-read-here" functionality.


Browsing Personal Images Using Episodic Memory (Time + Location) BIBAFull-Text 362-372
  Chufeng Chen; Michael Oakes; John Tait
In this paper we consider episodic memory for system design in image retrieval. Time and location are the main factors in episodic memory, and these types of data were combined for image event clustering. We conducted a user studies to compare five image browsing systems using searching time and user satisfaction as criteria for success. Our results showed that the browser which clusters images based on time and location data combined was significantly better than four other more standard browsers. This suggests that episodic memory is potentially useful for improving personal image management.
An Information Retrieval System for Motion Capture Data BIBAFull-Text 373-384
  Bastian Demuth; Tido Röder; Meinard Müller; Bernhard Eberhardt
Motion capturing has become an important tool in fields such as sports sciences, biometrics, and particularly in computer animation, where large collections of motion material are accumulated in the production process. In order to fully exploit motion databases for reuse and for the synthesis of new motions, one needs efficient retrieval and browsing methods to identify similar motions. So far, only ad-hoc methods for content-based motion retrieval have been proposed, which lack efficiency and rely on quantitative, numerical similarity measures, making it difficult to identify logically related motions. We propose an efficient motion retrieval system based on the query-by-example paradigm, which employs qualitative, geometric similarity measures. This allows for intuitive and interactive browsing in a purely content-based fashion without relying on textual annotations. We have incorporated this technology in a novel user interface facilitating query formulation as well as visualization and ranking of search results.
Can a Workspace Help to Overcome the Query Formulation Problem in Image Retrieval? BIBAFull-Text 385-396
  Jana Urban; Joemon M. Jose
We have proposed a novel image retrieval system that incorporates a workspace where users can organise their search results. A task-oriented and user-centred experiment has been devised involving design professionals and several types of realistic search tasks. We study the workspace's effect on two aspects: task conceptualisation and query formulation. A traditional relevance feedback system serves as baseline. The results of this study show that the workspace is more useful with respect to both of the above aspects. The proposed approach leads to a more effective and enjoyable search experience.

Cross-Language Retrieval

A Fingerprinting Technique for Evaluating Semantics Based Indexing BIBAFull-Text 397-406
  Eduard Hoenkamp; Sander van Dijk
The quality of search engines depends usually on the content of the returned documents rather than on the text used to express this content. So ideally, search techniques should be directed more toward the semantic dependencies underlying documents than toward the texts themselves. The most visible examples in this direction are Latent Semantic Analysis (LSA), and the Hyperspace Analog to Language (HAL). If these techniques are really based on semantic dependencies, as they contend, then they should be applicable across languages.
   To investigate this contention we used electronic versions of two kinds of material with their translations: a novel, and a popular treatise about cosmology. We used the analogy of fingerprinting as employed in forensics to establish whether individuals are related. Genetic fingerprinting uses enzymes to split the DNA and then compare the resulting band patterns. Likewise, in our research we used queries to split a document into fragments. If a search technique really isolates fragments semantically related to the query, then a document and its translation should have similar band patterns.
   In this paper we (1) present the fingerprinting technique, (2) introduce the material used, and (3) report results of an evaluation for two semantic indexing techniques.
A Cross-Language Approach to Historic Document Retrieval BIBAFull-Text 407-419
  Marijn Koolen; Frans Adriaans; Jaap Kamps; Maarten de Rijke
Our cultural heritage, as preserved in libraries, archives and museums, is made up of documents written many centuries ago. Large-scale digitization initiatives make these documents available to non-expert users through digital libraries and vertical search engines. For a user, querying a historic document collection may be a disappointing experience: queries involving modern words may not be very effective for retrieving documents that contain many historic terms. We propose a cross-language approach to historic document retrieval, and investigate (1) the automatic construction of translation resources for historic languages, and (2) the retrieval of historic documents using cross-language information retrieval techniques. Our experimental evidence is based on a collection of 17th century Dutch documents and a set of 25 known-item topics in modern Dutch. Our main findings are as follows: First, we are able to automatically construct rules for modernizing historic language based on comparing (a) phonetic sequence similarity, (b) the relative frequency of consonant and vowel sequences, and (c) the relative frequency of character n-gram sequences, of historic and modern corpora. Second, modern queries are not very effective for retrieving historic documents, but the historic language tools lead to a substantial improvement in retrieval effectiveness. The improvements are above and beyond the improvement due to using a modern stemming algorithm (whose effectiveness actually goes up when the historic language is modernized).
Automatic Acquisition of Chinese-English Parallel Corpus from the Web BIBAFull-Text 420-431
  Ying Zhang; Ke Wu; Jianfeng Gao; Phil Vines
Parallel corpora are a valuable resource for tasks such as cross-language information retrieval and data-driven natural language processing systems. Previously only small scale corpora have been available, thus restricting their practical use. This paper describes a system that overcomes this limitation by automatically collecting high quality parallel bilingual corpora from the web. Previous systems used a single principle feature for parallel web page verification, whereas we use multiple features to identify parallel texts via a k-nearest-neighbor classifier. Our system was evaluated using a data set containing 6500 Chinese-English candidate parallel pairs that have been manually annotated. Experiments show that the use of a k-nearest-neighbors classifier with multiple features achieves substantial improvements over the systems that use any one of these features. The system achieved a precision rate of 95% and a recall rate of 97%, and thus is a significant improvement over earlier work.

Genomic IR

Fast Discovery of Similar Sequences in Large Genomic Collections BIBAFull-Text 432-443
  Yaniv Bernstein; Michael Cameron
Detection of highly similar sequences within genomic collections has a number of applications, including the assembly of expressed sequence tag data, genome comparison, and clustering sequence collections for improved search speed and accuracy. While several approaches exist for this task, they are becoming infeasible -- either in space or in time -- as genomic collections continue to grow at a rapid pace. In this paper we present an approach based on document fingerprinting for identifying highly similar sequences. Our approach uses a modest amount of memory and executes in a time roughly proportional to the size of the collection. We demonstrate substantial speed improvements compared to the CD-HIT algorithm, the most successful existing approach for clustering large protein sequence collections.
Using Concept-Based Indexing to Improve Language Modeling Approach to Genomic IR BIBAFull-Text 444-455
  Xiaohua Zhou; Xiaodan Zhang; Xiaohua Hu
Genomic IR, characterized by its highly specific information need, severe synonym and polysemy problem, long term name and rapid growing literature size, is challenging IR community. In this paper, we are focused on addressing the synonym and polysemy issue within the language model framework. Unlike the ways translation model and traditional query expansion techniques approach this issue, we incorporate concept-based indexing into a basic language model for genomic IR. In particular, we adopt UMLS concepts as indexing and searching terms. A UMLS concept stands for a unique meaning in the biomedicine domain; a set of synonymous terms will share same concept ID. Therefore, the new approach makes the document ranking effective while maintaining the simplicity of language models. A comparative experiment on the TREC 2004 Genomics Track data shows significant improvements are obtained by incorporating concept-based indexing into a basic language model. The MAP (mean average precision) is significantly raised from 29.17% (the baseline system) to 36.94%. The performance of the new approach is also significantly superior to the mean (21.72%) of official runs participated in TREC 2004 Genomics Track and is comparable to the performance of the best run (40.75%). Most official runs including the best run extensively use various query expansion and pseudo-relevance feedback techniques while our approach does nothing except for the incorporation of concept-based indexing, which evidences the view that semantic smoothing, i.e. the incorporation of synonym and sense information into the language models, is a more standard approach to achieving the effects traditional query expansion and pseudo-relevance feedback techniques target.


The Effects on Topic Familiarity on Online Search Behaviour and Use of Relevance Criteria BIBAFull-Text 456-459
  Lei Wen; Ian Ruthven; Pia Borlund
This paper presents an experimental study on the effect of topic familiarity on the assessment behaviour of online searchers. In particular we investigate the effect of topic familiarity on the resources and relevance criteria used by searchers. Our results indicate that searching on an unfamiliar topic leads to use of more generic and fewer specialised resources and that searchers employ different relevance criteria when searching on less familiar topics.
PERC: A Personal Email Classifier BIBAFull-Text 460-463
  Shih-Wen Ke; Chris Bowerman; Michael Oakes
Improving the accuracy of assigning new email messages to small folders can reduce the likelihood of users creating duplicate folders for some topics. In this paper we presented a hybrid classification model, PERC, and use the Enron Email Corpus to investigate the performance of kNN, SVM and PERC in a simulation of a real-time situation. Our results show that PERC is significantly better at assigning messages to small folders. The effects of different parameter settings for the classifiers are discussed.
Influence Diagrams for Contextual Information Retrieval BIBAKFull-Text 464-467
  Lynda Tamine-Lechani; Mohand Boughanem
The purpose of contextual information retrieval is to make some exploration towards designing user specific search engines that are able to adapt the retrieval model to the variety of differences on user's contexts. In this paper we propose an influence diagram based retrieval model which is able to incorporate contexts, viewed as user's long-term interests into the retrieval process.
Keywords: personalized information access; influence diagrams; user context
Morphological Variation of Arabic Queries BIBAFull-Text 468-471
  Asaad Alberair; Mark Sanderson
Although it has been shown that in test collection based studies, stemming improves retrieval effectiveness in an information retrieval system, morphological variations of queries searching on the same topic are less well understood. This work examines the broad morphological variation that searchers of an Arabic retrieval system put into their queries. In this study, 15 native Arabic speakers were asked to generate queries, morphological variants of query words were collated across users. Queries composed of either the commonest or rarest variants of each word were submitted to a retrieval system and the effectiveness of the searches was measured. It was found that queries composed of the more popular morphological variants were more likely to retrieve relevant documents that those composed of less popular.
Combining Short and Long Term Audio Features for TV Sports Highlight Detection BIBAFull-Text 472-475
  Bin Zhang; Weibei Dou; Liming Chen
As bearer of high-level semantics, audio signal is being more and more used in content-based multimedia retrieval. In this paper, we investigate TV tennis game highlight detection based on the use of both short and long term audio features and propose two approaches, decision fusion and hierarchical classifier, in order to combine these two kinds of audio features. As more information is included in decision making, the overall performance of the system is enhanced.
Object-Based Access to TV Rushes Video BIBAFull-Text 476-479
  Alan F. Smeaton; Gareth J. F. Jones; Hyowon Lee; Noel E. O'Connor; Sorin Sav
Recent years have seen the development of different modalities for video retrieval. The most common of these are (1) to use text from speech recognition or closed captions, (2) to match keyframes using image retrieval techniques like colour and texture [6] and (3) to use semantic features like "indoor", "outdoor" or "persons". Of these, text-based retrieval is the most mature and useful, while image-based retrieval using low-level image features usually depends on matching keyframes rather than whole-shots. Automatic detection of video concepts is receiving much attention and as progress is made in this area we will see consequent impact on the quality of video retrieval. In practice it is the combination of these techniques which realises the most useful, and effective, video retrieval as shown by us repeatedly in TRECVid [5].
An Efficient Computation of the Multiple-Bernoulli Language Model BIBAFull-Text 480-483
  Leif Azzopardi; David E. Losada
The Multiple Bernoulli (MB) Language Model has been generally considered too computationally expensive for practical purposes and superseded by the more efficient multinomial approach. While, the model has many attractive properties, little is actually known about the retrieval effectiveness of the MB model due to its high cost of execution. In this paper, we show how an efficient implementation of this model can be achieved. The resulting method is comparable in terms of efficiency to other standard term matching algorithms (such as the vector space model, BM25 and the multinomial Language Model).
Title and Snippet Based Result Re-ranking in Collaborative Web Search BIBAFull-Text 484-487
  Oisín Boydell; Barry Smyth
Collaborative Web search is a form of meta-search that manipulates the results of underlying Web search engines in response to the learned preferences of a given community of users. Results that have previously been selected in response to similar queries by community members are promoted in the returned results. However, promotion is limited to these previously-selected results and in this paper we describe and evaluate how relevant results without a selection history can also be promoted by exploiting snippet-text and title similarities.
A Classification of IR Effectiveness Metrics BIBAFull-Text 488-491
  Gianluca Demartini; Stefano Mizzaro
Effectiveness is a primary concern in the information retrieval (IR) field. Various metrics for IR effectiveness have been proposed in the past; we take into account all the 44 metrics we are aware of, classifying them into a two-dimensional grid. The classification is based on the notions of relevance, i.e., if (or how much) a document is relevant, and retrieval, i.e., if (how much) a document is retrieved. To our knowledge, no similar classification has been proposed so far.
Experiments on Average Distance Measure BIBAFull-Text 492-495
  Vincenzo Della Mea; Gianluca Demartini; Luca Di Gaspero; Stefano Mizzaro
ADM (Average Distance Measure) is an IR effectiveness metric based on the assumptions of continuous relevance and retrieval. This paper presents some novel experimental results on two different test collections: TREC 8, re-assessed on 4-levels relevance judgments, and TREC 13 TeraByte collection. The results confirm that ADM correlation with standard measures is high, even when using less data, i.e., few documents.
Phrase Clustering Without Document Context BIBAFull-Text 496-500
  Eric SanJuan; Fidelia Ibekwe-SanJuan
We applied different clustering algorithms to the task of clustering multi-word terms in order to reflect a humanly built ontology. Clustering was done without the usual document co-occurrence information. Our clustering algorithm, CPCL (Classification by Preferential Clustered Link) is based on general lexico-syntactic relations which do not require prior domain knowledge or the existence of a training set. Results show that CPCL performs well in terms of cluster homogeneity and shows good adaptability for handling large and sparse matrices.
Rapid Development of Web-Based Monolingual Question Answering Systems BIBAFull-Text 501-504
  Edward W. D. Whittaker; Julien Hamonic; Dong Yang; Tor Klingberg; Sadaoki Furui
In this paper we describe the application of our statistical pattern classification approach to question answering (QA) to the rapid development of monolingual QA systems. We show how the approach has been applied successfully to QA in English, Japanese, Chinese, Russian and Swedish to form the basis of our publicly accessible web-based multilingual QA system at http://asked.jp.
Filtering Obfuscated Email Spam by means of Phonetic String Matching BIBAFull-Text 505-509
  Valerio Freschi; Andrea Seraghiti; Alessandro Bogliolo
Rule-based email filters mainly rely on the occurrence of critical words to classify spam messages. However, perceptive obfuscation techniques can be used to elude exact pattern matching. In this paper we propose a new technique for filtering obfuscated email spam that performs approximate pattern matching both on the original message and on its phonetic transcription.
Sprinkling: Supervised Latent Semantic Indexing BIBAFull-Text 510-514
  Sutanu Chakraborti; Robert Lothian; Nirmalie Wiratunga; Stuart Watt
Latent Semantic Indexing (LSI) is an established dimensionality reduction technique for Information Retrieval applications. However, LSI generated dimensions are not optimal in a classification setting, since LSI fails to exploit class labels of training documents. We propose an approach that uses class information to influence LSI dimensions whereby class labels of training documents are encoded as new terms, which are appended to the documents. When LSI is carried out on the augmented term-document matrix, terms pertaining to the same class are pulled closer to each other. Evaluation over experimental data reveals significant improvement in classification accuracy over LSI. The results also compare favourably with naive Support Vector Machines.
Web-Based Multiple Choice Question Answering for English and Arabic Questions BIBAFull-Text 515-518
  Rawia Awadallah; Andreas Rauber
Answering multiple-choice questions, where a set of possible answers is provided together with the question, constitutes a simplified but nevertheless challenging area in question answering research. This paper introduces and evaluates two novel techniques for answer selection. It furthermore analyses in how far performance figures obtained using the English language Web as data source can be transferred to less dominant languages on the Web, such as Arabic. Result evaluation is based on questions from both the English and the Arabic versions of the TV show "Who wants to be a Millionaire?" as well as on the TREC-2002 QA data.
Authoritative Re-ranking of Search Results BIBAFull-Text 519-522
  Toine Bogers; Antal van den Bosch
We examine the use of authorship information in information retrieval for closed communities by extracting expert rankings for queries. We demonstrate that these rankings can be used to re-rank baseline search results and improve performance significantly. We also perform experiments in which we base expertise ratings only on first authors or on all except the final authors, and find that these limitations do not further improve our re-ranking method.
Readability Applied to Information Retrieval BIBAFull-Text 523-526
  Lorna Kane; Joe Carthy; John Dunnion
Readability refers to all characteristics of a document that contribute to its 'ease of understanding or comprehension due to the style of writing' [1]. The readability of a text is dependent on a number of factors, including but not constrained to; its legibility, syntactic difficulty, semantic difficulty and the organization of the text [2]. As many as 228 variables were found to influence the readability of a text in Gray and Leary's seminal study [2]. These variables were classified as relating to document content, style, format or, features of organization.
Automatic Determination of Feature Weights for Multi-feature CBIR BIBAFull-Text 527-530
  Peter Wilkins; Paul Ferguson; Cathal Gurrin; Alan F. Smeaton
Image and video retrieval are both currently dominated by approaches which combine the outputs of several different representations or features. The ways in which the combination can be done is an established research problem in content-based image retrieval (CBIR). These approaches vary from image clustering through to semantic frameworks and mid-level visual features to ultimately determine sets of relative weights for the non-linear combination of features. Simple approaches to determining these weights revolve around executing a standard set of queries with known relevance judgements on some form of training data and is iterative in nature. Whilst successful, this requires both training data and human intervention to derive the optimal weights.
Towards Automatic Retrieval of Album Covers BIBAFull-Text 531-534
  Markus Schedl; Peter Knees; Tim Pohle; Gerhard Widmer
We present first steps towards intelligent retrieval of music album covers from the web. The continuous growth of electronic music distribution constantly increases the interest in methods to automatically provide added value like lyrics or album covers. While existing approaches rely on large proprietary databases, we focus on methods that make use of the whole web by using Google's or A9.com's image search. We evaluate the current state of the approach and point out directions for further improvements.
Clustering Sentences for Discovering Events in News Articles BIBAFull-Text 535-538
  Martina Naughton; Nicholas Kushmerick; Joe Carthy
We investigate the use of clustering methods for the task of grouping the text spans in a news article that refer to the same event. We provide evidence that the order in which events are described is structured in a way that can be exploited during clustering. We evaluate our approach on a corpus of news articles describing events that have occurred in the Iraqi War.
Specificity Helps Text Classification BIBAFull-Text 539-542
  Lucas Bouma; Maarten de Rijke
We examine the impact on classification effectiveness of semantic differences in categories. Specifically, we measure broadness and narrowness of categories in terms of their distance to the root of a hierarchically organized thesaurus. Using categories of four different levels degrees of broadness, we show that classifying documents into narrow categories gives better scores than classifying them into broad terms, which we attribute to the fact that more specific categories are associated with terms with a higher discriminatory power.
A Declarative DB-Powered Approach to IR BIBAFull-Text 543-547
  Roberto Cornacchia; Arjen P. de Vries
We present a prototype system using array comprehensions to bridge the gap between databases and information retrieval. It allows researchers to express their retrieval models in the General Matrix Framework for Information Retrieval [1], and have these executed on relational database systems with negligible effort.
Judging the Spatial Relevance of Documents for GIR BIBAFull-Text 548-552
  Paul D. Clough; Hideo Joho; Ross Purves
Geographic Information Retrieval (GIR) is concerned with the retrieval of documents based on both thematic and geographic content. An important issue in GIR, as for all IR, is relevance. In this paper we argue that spatial relevance should be considered independently from thematic relevance, and propose an initial scheme. A pilot study to assess this relevance scheme is presented, with initial results suggesting that users can distinguish between these two relevance dimensions, and that furthermore they have different properties. We suggest that spatial relevance requires greater assessor effort and more localised geographic knowledge than judging thematic relevance.
Probabilistic Score Normalization for Rank Aggregation BIBAFull-Text 553-556
  Miriam Fernández; David Vallet; Pablo Castells
Rank aggregation is a pervading operation in IR technology. We hypothesize that the performance of score-based aggregation may be affected by artificial, usually meaningless deviations consistently occurring in the input score distributions, which distort the combined result when the individual biases differ from each other. We propose a score-based rank aggregation model where the source scores are normalized to a common distribution before being combined. Early experiments on available data from several TREC collections are shown to support our proposal.
Learning Links Between a User's Calendar and Information Needs BIBAFull-Text 557-560
  Elena Vildjiounaite; Vesa Kyllönen
Personal information needs depend on long-term interests and on current and future situations (contexts): people are mainly interested in weather forecasts for future destinations, and in toy advertisements when a child's birthday approaches. As computer capabilities for being aware of users' contexts grow, the users' willingness to set manually rules for context-based information retrieval will decrease. Thus computers must learn to associate user contexts with information needs in order to collect and present information proactively. This work presents experiments with training a SVM (Support Vector Machines) classifier to learn user information needs from calendar information.
Supporting Relevance Feedback in Video Search BIBAFull-Text 561-564
  Cathal Gurrin; Dag Johansen; Alan F. Smeaton
WWW Video Search Engines have become increasingly commonplace within the last few years and at the same time video retrieval research has been receiving more attention with the annual TRECVid series of workshops. In this paper we evaluate methods of relevance feedback for video search engines operating over TV news data. We show for both video shots and TV news stories, that an optimal number of terms can be identified to compose a new query for feedback and that in most cases; the number of documents employed for feedback does not have a great effect on these optimal numbers of terms.
Intrinsic Plagiarism Detection BIBAKFull-Text 565-569
  Sven Meyer zu Eissen; Benno Stein
Current research in the field of automatic plagiarism detection for text documents focuses on algorithms that compare plagiarized documents against potential original documents. Though these approaches perform well in identifying copied or even modified passages, they assume a closed world: a reference collection must be given against which a plagiarized document can be compared.
   This raises the question whether plagiarized passages within a document can be detected automatically if no reference is given, e. g. if the plagiarized passages stem from a book that is not available in digital form. We call this problem class intrinsic plagiarism detection. The paper is devoted to this problem class; it shows that it is possible to identify potentially plagiarized passages by analyzing a single document with respect to variations in writing style.
   Our contributions are fourfold: (i) a taxonomy of plagiarism delicts along with detection methods, (ii) new features for the quantification of style aspects, (iii) a publicly available plagiarism corpus for benchmark comparisons, and (iv) promising results in non-trivial plagiarism detection settings: in our experiments we achieved recall values of 85% with a precision of 75% and better.
Keywords: plagiarism detection; style analysis; classifier; plagiarism corpus
Investigating Biometric Response for Information Retrieval Applications BIBAFull-Text 570-574
  Colum Mooney; Micheál Scully; Gareth J. F. Jones; Alan F. Smeaton
Current information retrieval systems make no measurement of the user's response to the searching process or the information itself. Existing psychological studies show that subjects exhibit measurable physiological responses when carrying out certain tasks, e.g. when viewing images, which generally result in heightened emotional states. We find that users exhibit measurable biometric behaviour in the form of galvanic skin response when watching movies, and engaging in interactive tasks. We examine how this data might be exploited in the indexing of data for search and within the search process itself.
Relevance Feedback Using Weight Propagation BIBAFull-Text 575-578
  Fadi Yamout; Michael Oakes; John Tait
A new Relevance Feedback (RF) technique is developed to improve upon the efficiency and performance of existing techniques. This is based on propagating positive and negative weights from documents judged relevant and not relevant respectively, to other documents, which are deemed similar according to one of a number of criteria. The performance and efficiency improve since the documents are treated as independent vectors rather than being merged into a single vector as is the case with traditional approaches, and only the documents considered in a given neighbourhood are inspected. This is especially important when using large test collections.
Context-Specific Frequencies and Discriminativeness for the Retrieval of Structured Documents BIBAFull-Text 579-582
  Jun Wang; Thomas Roelleke
Structured document retrieval requires the ranking of document elements. Previous approaches either aggregate term weights or retrieval status values, or propose alternatives to idf, for example, ief (inverse element frequency). We propose and investigate in this paper a new approach: Context-specific idf, which is, in contrast to aggregation-based ranking functions, parameter-free.