HCI Bibliography Home | HCI Conferences | ECIR Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
ECIR Tables of Contents: 03040506070809101112131415

Proceedings of ECIR'05, the 2005 European Conference on Information Retrieval

Fullname:ECIR 2005: Advances in Information Retrieval: 27th European Conference on IR Research
Editors:David E. Losada; Juan M. Fernández-Luna
Location:Santiago de Compostela, Spain
Dates:2005-Mar-21 to 2005-Mar-23
Publisher:Springer Berlin Heidelberg
Series:Lecture Notes in Computer Science 3408
Standard No:DOI: 10.1007/b107096 hcibib: ECIR05; ISBN: 978-3-540-25295-5 (print), 978-3-540-31865-1 (online)
Papers:53
Pages:572
Links:Online Proceedings | Conference Home Page
  1. Keynote Papers
  2. Peer-to-Peer
  3. Information Retrieval Models (I)
  4. Text Summarization
  5. Information Retrieval Methods (I)
  6. Information Retrieval Models (II)
  7. Text Classification and Fusion
  8. User Studies and Evaluation
  9. Information Retrieval Methods (II)
  10. Multimedia Retrieval
  11. Web Information Retrieval
  12. Posters

Keynote Papers

A Probabilistic Logic for Information Retrieval BIBAFull-Text 1-6
  C. J. 'Keith' van Rijsbergen
One of the most important models for IR derives from the representation of documents and queries as vectors in a vector space. I will show how logic emerges from the geometry of such a vector space. As a consequence of looking at such a space in terms of states and observables I will show how an appropriate probability measure can be constructed on this space which may be the basis for a suitable probabilistic logic for information retrieval.
Applications of Web Query Mining BIBAFull-Text 7-22
  Ricardo Baeza-Yates
Server logs of search engines store traces of queries submitted by users, which include queries themselves along with Web pages selected in their answers. The same is true in Web site logs where queries and later actions are recorded from search engine referrers or from an internal search box. In this paper we present two applications based in analyzing and clustering queries. The first one suggest changes to improve the text and structure of a Web site and the second does relevance ranking boosting and query recommendation in search engines.

Peer-to-Peer

BuddyNet: History-Based P2P Search BIBAFull-Text 23-37
  Yilei Shao; Randolph Wang
Peer-to-peer file sharing has become a very popular Internet application. P2P systems such as Gnutella and Kazaa work well when the number of peers is small. Their performances degraded significantly when the number of peers scales. In order to overcome the scalability problem, numerous research groups have experimented with different approaches. We conduct a novel evaluation study on Kazaa traffic focusing on the interest-based locality. Our analysis shows that strong interest-based locality exist in P2P systems and can be exploited to improve performance. Based on our findings, we propose a history-based P2P search algorithm and topology adaptation mechanism. The resulting system naturally clusters peers with similar interests to each other and greatly improves the efficiency for searching. We test our design through simulations; the results show significant reduction in total system load and large speedup in search efficiency compared to random walk and interest shortcut schemes. In addition, we show that our system is more robust under dynamic situations.
A Suite of Testbeds for the Realistic Evaluation of Peer-to-Peer Information Retrieval Systems BIBAFull-Text 38-51
  Iraklis A. Klampanos; Victor Poznanski; Joemon M. Jose; Peter Dickman
Peer-to-peer (P2P) networking continuously gains popularity among computing science researchers. The problem of information retrieval (IR) over P2P networks is being addressed by researchers attempting to provide valuable insight as well as solutions for its successful deployment. All published studies have, so far, been evaluated by simulation means, using well-known document collections (usually acquired from TREC). Researchers test their systems using divided collections whose documents have been previously distributed to a number of simulated peers. This practice leads to two problems: First, there is little justification in favour of the document distributions used by relevant studies and second, since different studies use different experimental testbeds, there is no common ground for comparing the solutions proposed. In this work, we contribute a number of different document testbeds for evaluating P2P IR systems. Each of these has been deduced from TREC's WT10g collection and corresponds to different potential P2P IR application scenarios. We analyse each methodology and testbed with respect to the document distributions achieved as well as to the location of relevant items within each setting. This work marks the beginning of an effort to provide more realistic evaluation environments for P2P IR systems as well as to create a common ground for comparisons of existing and future architectures.
Federated Search of Text-Based Digital Libraries in Hierarchical Peer-to-Peer Networks BIBAFull-Text 52-66
  Jie Lu; Jamie Callan
Peer-to-peer architectures are a potentially powerful model for developing large-scale networks of text-based digital libraries, but peer-to-peer networks have so far provided very limited support for text-based federated search of digital libraries using relevance-based ranking. This paper addresses the problems of resource representation, resource ranking and selection, and result merging for federated search of text-based digital libraries in hierarchical peer-to-peer networks. Existing approaches to text-based federated search are adapted and new methods are developed for resource representation and resource selection according to the unique characteristics of hierarchical peer-to-peer networks. Experimental results demonstrate that the proposed approaches offer a better combination of accuracy and efficiency than more common alternatives for federated search in peer-to-peer networks.

Information Retrieval Models (I)

'Beauty' of the World Wide Web -- Cause, Goal, or Principle BIBAFull-Text 67-80
  Sándor Dominich; Júlia Góth; Mária Horváth; Tamás Kiezer
It is known that the degree distribution in the World Wide Web (WWW) obeys a power law whose degree exponent exhibits a fairly robust behaviour. The usual method, linear regression, used to construct the power law is not based on any, probably existing, intrinsic property of the WWW which it is assumed to reflect. In the present paper, statistical evidence is given to conjecture that at the heart of this robustness property lies the Golden Section. Applications of this conjecture are also presented and discussed.
sPLMap: A Probabilistic Approach to Schema Matching BIBAFull-Text 81-95
  Henrik Nottelmann; Umberto Straccia
This paper introduces the first formal framework for learning mappings between heterogeneous schemas which is based on logics and probability theory. This task, also called "schema matching", is a crucial step in integrating heterogeneous collections. As schemas may have different granularities, and as schema attributes do not always match precisely, a general-purpose schema mapping approach requires support for uncertain mappings, and mappings have to be learned automatically. The framework combines different classifiers for finding suitable mapping candidates (together with their weights), and selects that set of mapping rules which is the most likely one. Finally, the framework with different variants has been evaluated on two different data sets.
Encoding XML in Vector Spaces BIBAFull-Text 96-111
  Vinay Kakade; Prabhakar Raghavan
We develop a framework for representing XML documents and queries in vector spaces and build indexes for processing text-centric semi-structured queries that support a proximity measure between XML documents. The idea of using vector spaces for XML retrieval is not new. In this paper we (i) unify prior approaches into a single framework; (ii) develop techniques to eliminate special purpose auxiliary computations (outside the vector space) used previously; (iii) give experimental evidence on benchmark queries that our approach is competitive in its retrieval quality and (iv) as an immediate consequence of the framework, are able to classify and cluster XML documents.

Text Summarization

Features Combination for Extracting Gene Functions from MEDLINE BIBAFull-Text 112-126
  Patrick Ruch; Laura Perret; Jacques Savoy
This paper describes and evaluates a summarization system that extracts the gene function textual descriptions (called GeneRIF) based on a MedLine record. Inputs for this task include both a locus (a gene in the LocusLink database), and a pointer to a MedLine record supporting the GeneRIF. In the suggested approach we merge two independent phrase extraction strategies. The first proposed strategy (LASt) uses argumentative, positional and structural features in order to suggest a GeneRIF. The second extraction scheme (LogReg) incorporates statistical properties to select the most appropriate sentence as the GeneRIF. Based on the TREC-2003 genomic collection, the basic extraction strategies are already competitive (52.78% for LASt and 52.28% for LogReg, respectively). When used in a combined approach, the extraction task clearly shows improvement, achieving a Dice score of over 55%.
Filtering for Profile-Biased Multi-document Summarization BIBAFull-Text 127-141
  Sana Leila Châar; Olivier Ferret; Christian Fluhr
In this article, we present an information filtering method that selects from a set of documents their most significant excerpts in relation to a user profile. This method relies on both structured profiles and a topical analysis of documents. The topical analysis is also used for expanding a profile in relation to a particular document by selecting the terms of the document that are closely linked to those of the profile. This expansion is a way for selecting in a more reliable way excerpts that are linked to profiles but also for selecting excerpts that may bring new and interesting information about their topics. This method was implemented by the REDUIT system, which was successfully evaluated for document filtering and passage extraction.
Automatic Text Summarization Based on Word-Clusters and Ranking Algorithms BIBAFull-Text 142-156
  Massih R. Amini; Nicolas Usunier; Patrick Gallinari
This paper investigates a new approach for Single Document Summarization based on a Machine Learning ranking algorithm. The use of machine learning techniques for this task allows one to adapt summaries to the user needs and to the corpus characteristics. These desirable properties have motivated an increasing amount of work in this field over the last few years. Most approaches attempt to generate summaries by extracting text-spans (sentences in our case) and adopt the classification framework which consists to train a classifier in order to discriminate between relevant and irrelevant spans of a document. A set of features is first used to produce a vector of scores for each sentence in a given document and a classifier is trained in order to make a global combination of these scores. We believe that the classification criterion for training a classifier is not adapted for SDS and propose an original framework based on ranking for this task. A ranking algorithm also combines the scores of different features but its criterion tends to reduce the relative misordering of sentences within a document. Features we use here are either based on the state-of-the-art or built upon word-clusters. These clusters are groups of words which often co-occur with each other, and can serve to expand a query or to enrich the representation of the sentences of the documents. We analyze the performance of our ranking algorithm on two data sets -- the Computation and Language (cmp_lg) collection of TIPSTER SUMMAC and the WIPO collection. We perform comparisons with different baseline -- non learning -- systems, and a reference trainable summarizer system based on the classification framework. The experiments show that the learning algorithms perform better than the non-learning systems while the ranking algorithm outperforms the classifier. The difference of performance between the two learning algorithms depends on the nature of datasets. We give an explanation of this fact by the different separability hypothesis of the data made by the two learning algorithms.
Comparing Topiary-Style Approaches to Headline Generation BIBAFull-Text 157-168
  Ruichao Wang; Nicola Stokes; William P. Doran; Eamonn Newman; Joe Carthy; John Dunnion
In this paper we compare a number of Topiary-style headline generation systems. The Topiary system, developed at the University of Maryland with BBN, was the top performing headline generation system at DUC 2004. Topiary-style headlines consist of a number of general topic labels followed by a compressed version of the lead sentence of a news story. The Topiary system uses a statistical learning approach to finding topic labels for headlines, while our approach, the LexTrim system, identifies key summary words by analysing the lexical cohesive structure of a text. The performance of these systems is evaluated using the ROUGE evaluation suite on the DUC 2004 news stories collection. The results of these experiments show that a baseline system that identifies topic descriptors for headlines using term frequency counts outperforms the LexTrim and Topiary systems. A manual evaluation of the headlines also confirms this result.

Information Retrieval Methods (I)

Improving Retrieval Effectiveness by Using Key Terms in Top Retrieved Documents BIBAFull-Text 169-184
  Yang Lingpeng; Ji Donghong; Zhou Guodong; Nie Yu
In this paper, we propose a method to improve the precision of top retrieved documents in Chinese information retrieval where the query is a short description by re-ordering retrieved documents in the initial retrieval. To re-order the documents, we firstly find out terms in query and their importance scales by making use of the information derived from top N (N≤30) retrieved documents in the initial retrieval; secondly, we re-order retrieved K (N<<K) documents by what kinds of terms of query they contain. That is, we first automatically extract key terms from top N retrieved documents, then we collect key terms that occur in query and their document frequencies in the N retrieved documents, finally we use these collected terms to re-order the initially retrieved documents. Each collected term is assigned a weight by its length and its document frequency in top N retrieved documents. Each document is re-ranked by the sum of weights of collected terms it contains. In our experiments on 42 query topics in NTCIR3 Cross Lingual Information Retrieval (CLIR) dataset, an average 17.8%-27.5% improvement can be made for top 10 documents and an average 6.6%-26.9% improvement can be made for top 100 documents at relax/rigid relevance judgment and different parameter setting.
Evaluating Relevance Feedback Algorithms for Searching on Small Displays BIBAFull-Text 185-199
  Vishwa Vinay; Ingemar J. Cox; Natasa Milic-Frayling; Ken Wood
Searching online information resources using mobile devices is affected by displays on which only a small fraction of the set of ranked documents can be displayed. In this paper, we ask whether the search effort can be reduced, on average, by user feedback indicating a single most relevant document in each display. For small display sizes and limited user actions, we are able to construct a tree representing all possible outcomes. Examination of the tree permits us to compute an upper limit on relevance feedback performance. Three standard feedback algorithms are considered -- Rocchio, Robertson/Sparck-Jones and a Bayesian algorithm. Two display strategies are considered, one based on maximizing the immediate information gain and the other on most likely documents. Our results bring out the strengths and weaknesses of the algorithms, and the need for exploratory display strategies with conservative feedback algorithms.
Term Frequency Normalisation Tuning for BM25 and DFR Models BIBAFull-Text 200-214
  Ben He; Iadh Ounis
The term frequency normalisation parameter tuning is a crucial issue in information retrieval (IR), which has an important impact on the retrieval performance. The classical pivoted normalisation approach suffers from the collection-dependence problem. As a consequence, it requires relevance assessment for each given collection to obtain the optimal parameter setting. In this paper, we tackle the collection-dependence problem by proposing a new tuning method by measuring the normalisation effect. The proposed method refines and extends our methodology described in [7]. In our experiments, we evaluate our proposed tuning method on various TREC collections, for both the normalisation 2 of the Divergence From Randomness (DFR) models and the BM25's normalisation method. Results show that for both normalisation methods, our tuning method significantly outperforms the robust empirically-obtained baselines over diverse TREC collections, while having a marginal computational cost.

Information Retrieval Models (II)

Improving the Context-Based Influence Diagram Model for Structured Document Retrieval: Removing Topological Restrictions and Adding New Evaluation Methods BIBAFull-Text 215-229
  Luis M. de Campos; Juan M. Fernández-Luna; Juan F. Huete
In this paper we present the theoretical developments necessary to extend the existing Context-based Influence Diagram Model for Structured Documents (CID model), in order to improve its retrieval performance and expressiveness. Firstly, we make it more flexible and general by removing the original restrictions on the type of structured documents that CID represents. This extension requires the design of a new algorithm to compute the posterior probabilities of relevance. Another contribution is related to the evaluation of the influence diagram. The computation of the expected utilities in the original CID model was approximated by applying an independence criterion. We present another approximation that does not assume independence, as well as an exact evaluation method.
Knowing-Aboutness: Question-Answering Using a Logic-Based Framework BIBAFull-Text 230-244
  Terence Clifton; William Teahan
We describe the background and motivation for a logic-based framework, based on the theory of "Knowing-Aboutness", and its specific application to Question-Answering. We present the salient features of our system, and outline the benefits of our framework in terms of a more integrated architecture that is more easily evaluated. Favourable results are presented in the TREC 2004 Question-Answering evaluation.
Modified LSI Model for Efficient Search by Metric Access Methods BIBAFull-Text 245-259
  Tomáš Skopal; Pavel Moravec
Text collections represented in LSI model are hard to search efficiently (i.e. quickly), since there exists no indexing method for the LSI matrices. The inverted file, often used in both boolean and classic vector model, cannot be effectively utilized, because query vectors in LSI model are dense. A possible way for efficient search in LSI matrices could be the usage of metric access methods (MAMs). Instead of cosine measure, the MAMs can utilize the deviation metric for query processing as an equivalent dissimilarity measure. However, the intrinsic dimensionality of collections represented by LSI matrices is often large, which decreases MAMs' performance in searching. In this paper we introduce σ-LSI, a modification of LSI in which we artificially decrease the intrinsic dimensionality of LSI matrices. This is achieved by an adjustment of singular values produced by SVD. We show that suitable adjustments could dramatically improve the efficiency when searching by MAMs, while the precision/recall values remain preserved or get only slightly worse.
PIRE: An Extensible IR Engine Based on Probabilistic Datalog BIBAFull-Text 260-274
  Henrik Nottelmann
This paper introduces PIRE, a probabilistic IR engine. For both document indexing and retrieval, PIRE makes heavy use of probabilistic Datalog, a probabilistic extension of predicate Horn logics. Using such a logical framework together with probability theory allows for defining and using data types (e.g. text, names, numbers), different weighting schemes (e.g. normalised tf, tf.idf or BM25) and retrieval functions (e.g. uncertain inference, language models). Extending the system thus is reduced to adding new rules. Furthermore, this logical framework provide a powerful tool for including additional background knowledge into the retrieval process.

Text Classification and Fusion

Data Fusion with Correlation Weights BIBAFull-Text 275-286
  Shengli Wu; Sally McClean
This paper is focused on the effect of correlation on data fusion for multiple retrieval results. If some of the retrieval results involved in data fusion correlate more strongly than the others, their common opinion will dominate the voting process in data fusion. This may degrade the effectiveness of data fusion in many cases, especially when very good results appear to be a minority. For solving this problem, we assign each result a weight, which is derived from the correlation coefficient of that result to the other results, then the linear combination method can be used for data fusion. The evaluation of the effectiveness of the proposed method with TREC 5 (ad hoc track) results is reported. Furthermore, we explore the relationship between results correlation and data fusion by some experiments, and demonstrate that a relationship between them does exists.
Using Restrictive Classification and Meta Classification for Junk Elimination BIBAFull-Text 287-299
  Stefan Siersdorfer; Gerhard Weikum
This paper addresses the problem of performing supervised classification on document collections containing also junk documents. With "junk documents" we mean documents that do not belong to the topic categories (classes) we are interested in. This type of documents can typically not be covered by the training set; nevertheless in many real world applications (e.g. classification of web or intranet content, focused crawling etc.) such documents occur quite often and a classifier has to make a decision about them. We tackle this problem by using restrictive methods and ensemble-based meta methods that may decide to leave out some documents rather than assigning them to inappropriate classes with low confidence. Our experiments with four different data sets show that the proposed techniques can eliminate a relatively large fraction of junk documents while dismissing only a significantly smaller fraction of potentially interesting documents.
On Compression-Based Text Classification BIBAFull-Text 300-314
  Yuval Marton; Ning Wu; Lisa Hellerstein
Compression-based text classification methods are easy to apply, requiring virtually no preprocessing of the data. Most such methods are character-based, and thus have the potential to automatically capture non-word features of a document, such as punctuation, word-stems, and features spanning more than one word. However, compression-based classification methods have drawbacks (such as slow running time), and not all such methods are equally effective. We present the results of a number of experiments designed to evaluate the effectiveness and behavior of different compression-based text classification methods on English text. Among our experiments are some specifically designed to test whether the ability to capture non-word (including super-word) features causes character-based text compression methods to achieve more accurate classification.

User Studies and Evaluation

Ontology as a Search-Tool: A Study of Real Users' Query Formulation With and Without Conceptual Support BIBAFull-Text 315-329
  Sari Suomela; Jaana Kekäläinen
This study examines 16 real users' use of an ontology as a search tool. The users' queries constructed with the help of a Concept-based Information Retrieval Interface (CIRI) were compared to queries created independently based on the same search task description. Also the effectiveness of the CIRI queries was compared to the users' unaided queries. The simulated search task method was used to make the searching situations as close to real as possible. Due to CIRI's query expansion feature the number of search terms was remarkably higher in ontology queries than in Direct interface queries. The search results were evaluated with generalised precision and generalised relative recall as well as precision based on personal assessments. The Direct interface queries performed better in all methods of comparison.
An Analysis of Query Similarity in Collaborative Web Search BIBAFull-Text 330-344
  Evelyn Balfe; Barry Smyth
Web search logs provide an invaluable source of information regarding the search behaviour of users. This information can be reused to aid future searches, especially when these logs contain the searching histories of specific communities of users. To date this information is rarely exploited as most Web search techniques continue to rely on the more traditional term-based IR approaches. In contrast, the I-SPY system attempts to reuse past search behaviours as a means to re-rank result-lists according to the implied preferences of like-minded communities of users. It relies on the ability to recognise previous search sessions that are related to the current target search by looking for similarities between past and current queries. We have previously shown how a simple model of query similarity can significantly improve search performance by implementing this reuse approach. In this paper we build on previous work by evaluating alternative query similarity models.
A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation BIBAFull-Text 345-359
  Cyril Goutte; Eric Gaussier
We address the problems of 1/ assessing the confidence of the standard point estimates, precision, recall and F-score, and 2/ comparing the results, in terms of precision, recall and F-score, obtained using two different methods. To do so, we use a probabilistic setting which allows us to obtain posterior distributions on these performance indicators, rather than point estimates. This framework is applied to the case where different methods are run on different datasets from the same source, as well as the standard situation where competing results are obtained on the same data.
Exploring Cost-Effective Approaches to Human Evaluation of Search Engine Relevance BIBAFull-Text 360-374
  Kamal Ali; Chi-Chao Chang; Yunfang Juan
In this paper, we examine novel and less expensive methods for search engine evaluation that do not rely on document relevance judgments. These methods, described within a proposed framework, are motivated by the increasing focus on search results presentation, by the growing diversity of documents and content sources, and by the need to measure effectiveness relative to other search engines. Correlation analysis of the data obtained from actual tests using a subset of the methods in the framework suggest that these methods measure different aspects of the search engine. In practice, we argue that the selection of the test method is a tradeoff between measurement intent and cost.

Information Retrieval Methods (II)

Document Identifier Reassignment Through Dimensionality Reduction BIBAKFull-Text 375-387
  Roi Blanco; Álvaro Barreiro
Most modern retrieval systems use compressed Inverted Files (IF) for indexing. Recent works demonstrated that it is possible to reduce IF sizes by reassigning the document identifiers of the original collection, as it lowers the average distance between documents related to a single term. Variable-bit encoding schemes can exploit the average gap reduction and decrease the total amount of bits per document pointer. However, approximations developed so far requires great amounts of time or use an uncontrolled memory size. This paper presents an efficient solution to the reassignment problem consisting in reducing the input data dimensionality using a SVD transformation. We tested this approximation with the Greedy-NN TSP algorithm and one more efficient variant based on dividing the original problem in sub-problems. We present experimental tests and performance results in two TREC collections, obtaining good compression ratios with low running times. We also show experimental results about the tradeoff between dimensionality reduction and compression, and time performance.
Keywords: Document identifier reassignment; SVD; indexing; compression
Scalability Influence on Retrieval Models: An Experimental Methodology BIBAFull-Text 388-402
  Amélie Imafouo; Michel Beigbeder
Few works in Information Retrieval (IR) tackled the questions of Information Retrieval Systems (IRS) effectiveness and efficiency in the context of scalability in corpus size.
   We propose a general experimental methodology to study the scalability influence on IR models. This methodology is based on the construction of a collection on which a given characteristic C is the same whatever be the portion of collection selected. This new collection called uniform can be split into sub-collection of growing size on which some given properties will be studied.
   We apply our methodology to WT10G (TREC9 collection) and consider the characteristic C to be the distribution of relevant documents on a collection. We build a uniform WT10G, sample it into sub-collections of increasing size and use these sub-collections to study the impact of corpus volume increase on standards IRS evaluation measures (recall/precision, high precision).
The Role of Multi-word Units in Interactive Information Retrieval BIBAFull-Text 403-420
  Olga Vechtomova
The paper presents several techniques for selecting noun phrases for interactive query expansion following pseudo-relevance feedback and a new phrase search method. A combined syntactico-statistical method was used for the selection of phrases. First, noun phrases were selected using a part-of-speech tagger and a noun-phrase chunker, and secondly, different statistical measures were applied to select phrases for query expansion. Experiments were also conducted studying the effectiveness of noun phrases in document ranking. We analyse the problems of phrase weighting and suggest new ways of addressing them. A new method of phrase matching and weighting was developed, which specifically addresses the problem of weighting overlapping and non-contiguous word sequences in documents.
Dictionary-Based CLIR Loses Highly Relevant Documents BIBAFull-Text 421-432
  Raija Lehtokangas; Heikki Keskustalo; Kalervo Järvelin
Research on cross-language information retrieval (CLIR) has typically been restricted to settings using binary relevance assessments. In this paper, we present evaluation results for dictionary-based CLIR using graded relevance assessments in a best match retrieval environment. A text database containing newspaper articles and a related set of 35 search topics were used in the tests. First, monolingual baseline queries were automatically formed from the topics. Secondly, source language topics (in English, German, and Swedish) were automatically translated into the target language (Finnish), using both structured and unstructured queries. Effectiveness of the translated queries was compared to that of the monolingual queries. CLIR performance was evaluated using three relevance criteria: stringent, regular, and liberal. When regular or liberal criteria were used, a reasonable performance was achieved. Adopting stringent criteria caused a considerable loss of performance, when compared to monolingual Finnish performance.

Multimedia Retrieval

Football Video Segmentation Based on Video Production Strategy BIBAFull-Text 433-446
  Reede Ren; Joemon M. Jose
We present a statistical approach for parsing football video structures. Based on video production conventions, a new generic structure called 'attack' is identified, which is an equivalent of scene in other video domains. We define four video segments to construct it, namely play, focus, replay and break. Two middle level visual features, play field ratio and zoom size, are also computed. The detection process includes a two-pass classifier, a combination of Gaussian Mixture Model and Hidden Markov Models. A general suffix tree is introduced to identify and organize 'attack'. In experiments, video structure classification accuracy of about 86% is achieved on broadcasting World Cup 2002 video data.
Fractional Distance Measures for Content-Based Image Retrieval BIBAFull-Text 447-456
  Peter Howarth; Stefan Rüger
We have applied the concept of fractional distance measures, proposed by Aggarwal et al. [1], to content-based image retrieval. Our experiments show that retrieval performances of these measures consistently outperform the more usual Manhattan and Euclidean distance metrics when used with a wide range of high-dimensional visual features. We used the parameters learnt from a Corel dataset on a variety of different collections, including the TRECVID 2003 and ImageCLEF 2004 datasets. We found that the specific optimum parameters varied but the general performance increase was consistent across all 3 collections. To squeeze the last bit of performance out of a system it would be necessary to train a distance measure for a specific collection. However, a fractional distance measure with parameter p = 0.5 will consistently outperform both L1 and L2 norms.
Combining Visual Semantics and Texture Characterizations for Precision-Oriented Automatic Image Retrieval BIBAFull-Text 457-474
  Mohammed Belkhatir
The growing need for 'intelligent' image retrieval systems leads to new architectures combining visual semantics and signal features that rely on highly expressive frameworks while providing fully-automated indexing and retrieval processes. Indeed, addressing the issue of integrating the two main approaches in the image indexing and retrieval literature (i.e. signal and semantic) is a viable solution for achieving significant retrieval quality. This paper presents a multi-facetted framework featuring visual semantics and signal texture descriptions for automatic image retrieval. It relies on an expressive representation formalism handling high-level image descriptions and a full-text query framework in an attempt to operate image indexing and retrieval operations beyond trivial low-level processes and loosely-coupled state-of-the-art systems. At the experimental level, we evaluate the retrieval performance of our system through recall and precision indicators on a test collection of 2500 photographs used in several world-class publications.

Web Information Retrieval

Applying Associative Relationship on the Clickthrough Data to Improve Web Search BIBAFull-Text 475-486
  Xue-Mei Jiang; Wen-Guan Song; Hua-Jun Zeng
The performance of web search engines may often deteriorate due to the diversity and noise contained within web pages. Some methods proposed to use clickthrough data to achieve more accurate information for web pages as well as improve the search performance. However, sparseness became the great challenge in exploiting clickthrough data. In this paper, we propose a novel algorithm to exploit the user clickthrough data. It first explores the relationship between queries and web pages to mine out co-visiting as the associative relationship among the Web pages, and then Spreading Activation mechanism is used to re-rank the results of Web search. Our approach could alleviate such sparseness and the experimental results on a large set of MSN clickthrough log data show a significant improvement on search performance over the DirectHit algorithm as well as the baseline search engine.
Factors Affecting Web Page Similarity BIBAFull-Text 487-501
  Anastasios Tombros; Zeeshan Ali
Tools that allow effective information organisation, access and navigation are becoming increasingly important on the Web. Similarity between web pages is a concept that is central to such tools. In this paper, we examine the effect that content and layout-related aspects of web pages have on web page similarity. We consider the textual content contained within common HTML tags, the structural layout of pages, and the query terms contained within pages. Our study shows that combinations of factors can yield more promising results than individual factors, and that different aspects of web pages affect similarities between pages in a different manner. We found a number of factors that, when taken into account, can result in effective measures of similarity between web pages. Query information in particular, proved to be important for the effective organisation of web pages.
Boosting Web Retrieval Through Query Operations BIBAFull-Text 502-516
  Gilad Mishne; Maarten de Rijke
We explore the use of phrase and proximity terms in the context of web retrieval, which is different from traditional ad-hoc retrieval both in document structure and in query characteristics. We show that for this type of task, the usage of both phrase and proximity terms is highly beneficial for early precision as well as for overall retrieval effectiveness. We also analyze why phrase and proximity terms are far more effective for web retrieval than for ad-hoc retrieval.

Posters

Terrier Information Retrieval Platform BIBAFull-Text 517-519
  Iadh Ounis; Gianni Amati; Vassilis Plachouras; Ben He; Craig Macdonald; Douglas Johnson
Terrier is a modular platform for the rapid development of large-scale Information Retrieval (IR) applications. It can index various document collections, including TREC and Web collections. Terrier also offers a range of document weighting and query expansion models, based on the Divergence From Randomness framework. It has been successfully used for ad-hoc retrieval, cross-language retrieval, Web IR and intranet search, in a centralised or distributed setting.
Físréal: A Low Cost Terabyte Search Engine BIBAFull-Text 520-522
  Paul Ferguson; Cathal Gurrin; Peter Wilkins; Alan F. Smeaton
In this poster we describe the development of a distributed search engine, referred to as Físréal, which utilises inexpensive workstations, yet attains fast retrieval performance for Terabyte-sized collections. We also discuss the process of leveraging additional meaning from the structure of HTML, as well as the use of anchor text documents to increase retrieval performance.
Query Formulation for Answer Projection BIBAFull-Text 523-526
  Gilad Mishne; Maarten de Rijke
We examine the effects of various query modifications on the problem of answer projection -- the task of retrieving documents that support a given answer to a question. We compare different techniques such as phrase searches and term weighting, and show that some models achieve significant improvements over unmodified queries.
Network Analysis for Distributed Information Retrieval Architectures BIBAFull-Text 527-529
  Fidel Cacheda; Victor Carneiro; Vassilis Plachouras; Iadh Ounis
In this study, we present the analysis of the interconnection network of a distributed Information Retrieval (IR) system, by simulating a switched network versus a shared access network. The results show that the use of a switched network improves the performance, especially in a replicated system because the switched network prevents the saturation of the network, particularly when using a large number of query servers.
SnapToTell: A Singapore Image Test Bed for Ubiquitous Information Access from Camera BIBAFull-Text 530-532
  Jean-Pierre Chevallet; Joo-Hwee Lim; Ramnath Vasudha
With the proliferation of camera phones, many novel applications and services are emerging. In this paper, we present the SnapToTell system, which provides information directory service to tourists, based on pictures taken by the camera phones and location information. We present also experimental results on scene recognition based on a realistic data set of scenes and locations in Singapore which form a new original application oriented image test bed freely available.
Acquisition of Translation Knowledge of Syntactically Ambiguous Named Entity BIBAFull-Text 533-535
  Takeshi Kutsumi; Takehiko Yoshimi; Katsunori Kotani; Ichiko Sata; Hitoshi Isahara
Bilingual dictionaries are essential components of cross-lingual information retrieval applications. The automatic acquisition of proper names and their translations from bilingual corpora is especially important, because a significant portion of the entries not listed in the dictionaries would be proper names.
IR and OLAP in XML Document Warehouses BIBAFull-Text 536-539
  Juan M. Pérez; Torben Bach Pedersen; Rafael Berlanga; María J. Aramburu
In this paper we propose to combine IR and OLAP (On-Line Analytical Processing) technologies to exploit a warehouse of text-rich XML documents. In the system we plan to develop, a multidimensional implementation of a relevance modeling document model will be used for interactively querying the warehouse by allowing navigation in the structure of documents and in a concept hierarchy of query terms. The facts described in the relevant documents will be ranked and analyzed in a novel OLAP cube model able to represent and manage facts with relevance indexes.
Manipulating the Relevance Models of Existing Search Engines BIBAFull-Text 540-542
  Oisín Boydell; Cathal Gurrin; Alan F. Smeaton; Barry Smyth
Collaborative search refers to how the search behavior of communities of users can be used to influence the ranking of search results. In this poster we describe how this technique, as instantiated in the I-SPY meta-search engine can be used as a general mechanism for implementing a different relevance feedback strategy. We evaluate a relevance feedback strategy based on anchor-text and query similarity using the TREC2004 Terabyte track document collection.
Enhancing Web Search Result Lists Using Interaction Histories BIBAFull-Text 543-545
  Maurice Coyle; Barry Smyth
As a method for information retrieval (IR) on the Web, search engines have become the tool of choice for most online users. However, despite the variety of next generation approaches to Web search we have seen recently (e.g. [1,2]), the problems of information overload, vague user queries and spam still have the effect that many search sessions end in user frustration. Generally search engines are criticised for returning result lists that have low precision, where the user's information need is not satisfied by any of the returned result pages.
An Evaluation of Gisting in Mobile Search BIBAFull-Text 546-548
  Karen Church; Mark T. Keane; Barry Smyth
Mobile devices suffer from limited screen real-estate and restricted text input capabilities. In the recent past these limitations have greatly effected the usability of many mobile Internet applications [1], largely because little effort has been typically made to take account of the special features of the mobile Internet. These limitations are especially problematic for mobile search-engines: they restrict the number of results that can be displayed per screen and impact the type of queries that are likely to be provided. Nevertheless, most attempts to provide mobile search engines have involved making only simplistic adaptations to standard search interfaces. For example, fewer results per page are returned and the 'snippet' text associated with each result may be truncated [2]. We believe that more fundamental adaptations are necessary if search technology is to succeed in the mobile space. In this paper we focus on the snippet text issue and we argue that providing paragraphs of descriptive text alongside each result is a luxury that does not make sense in the context of mobile device limitations. We describe how the I-SPY system [3] can track and record past queries that have resulted in the selection of a given result page and we argue that these related queries can be used to help users understand the context of a search result in place of more verbose snippet text.
Video Shot Classification Using Lexical Context BIBAFull-Text 549-551
  Stéphane Ayache; Georges Quénot; Mbarek Charhad
Associating concepts to video segments is essential for content-based video retrieval. We present here a semantic classifier working from text transcriptions coming from automatic speech recognition (ASR). The system is based on a Bayesian classifier, it is fully linked with a knowledge base which contains an ontology and named entities from several domains. The system is trained from a set of positive and negative examples for each indexed concept. It has been evaluated using the TREC VIDEO protocol and conditions for the detection of visual concepts. Three versions are compared: a baseline one, using only word as units, a second, using additionally named entities, and a last one enriched with semantic classes information.
Age Dependent Document Priors in Link Structure Analysis BIBAFull-Text 552-554
  Claudia Hauff; Leif Azzopardi
Much research has been performed investigating how links between web pages can be exploited in an Information Retrieval setting [1,4]. In this poster, we investigate the application of the Barabási-Albert model to link structure analysis on a collection of web documents within the language modeling framework. Our model utilizes the web structure as described by a Scale Free Network and derives a document prior based on a web document's age and linkage. Preliminary experiments indicate the utility of our approach over other current link structure algorithms and warrants further research.
Improving Image Representation with Relevance Judgements from the Searchers BIBAKFull-Text 555-557
  Liudmila V. Boldareva
In visual information retrieval, a semantic gap exists due to the poor match between machine-understood content of an information object and the user-percepted one. The mismatch of perception results in difficulties for a user in formulating the query, and consequently in inability for the retrieval system to produce satisfactory answers. Adding searcher's relevance judgements for (intermediary) search results is known to improve the retrieval. With relevance feedback the system learns the user's information need through interaction.
Keywords: Image retrieval; relevance feedback; long-term learning
Temporal Shot Clustering Analysis for Video Concept Detection BIBAFull-Text 558-560
  Dayong Ding; Le Chen; Bo Zhang
The phenomenon that conceptually related shots appear together in videos is called temporal shot clustering. This phenomenon is a useful cue for video concept detection, which is one of basic steps in content-based video indexing and retrieval. We propose a method, called temporal shot clustering analysis, to improve results of video concept detection by exploiting the temporal shot clustering phenomenon. Two other methods are compared with temporal shot clustering analysis on the TRECVID 2003 dataset. Experiments showed that temporal shot clustering is of much benefit for video concept detection, and that temporal shot clustering method outperforms the other methods.
IRMAN: Software Framework for IR in Mobile Social Cyberspaces BIBAFull-Text 561-563
  Zia Syed; Fiona Walsh
With the increasing popularity of blogs (online journals) as a medium for expressing personal thoughts and advice, and users becoming more mobile, we foresee an opportunity for such opinionated content to be utilised as information sources in the mobile arena. In this short paper, we present IRMAN (Information Retrieval in Mobile Adhoc Networks), a software framework for Peer-to-Peer (P2P) IR over Mobile AdHoc Networks (MANET). A Java based prototype system has been developed based on the aforementioned framework for creating, retrieving, and sharing user blogs on handhelds in mobile social cyberspaces.
Assigning Geographical Scopes To Web Pages BIBAFull-Text 564-567
  Bruno Martins; Marcirio Chaves; Mário J. Silva
Finding automatic ways of attaching geographical scopes to on-line resources, also called "geo-referencing" documents, is a challenging problem, getting increasing attention [1,5,3]. Here we present a system architecture and a process for identifying the geographical scope of Web pages, defining a scope as the region where more people than average would find that page relevant. We rely on typical Web IR heuristics (i.e. feature weighting, hypertext topic locality, anchor description) and assumptions on how people use geographical references in documents. The method involves three major steps. First, geographical named entities are identified in the text. Next, we propagate the found named entities through the Web linkage graph. Finally, a geographical ontology is used to disambiguate among the named entities associated to a document, this way selecting the most likely scope. In the future, we plan on using scopes in new location-aware search tools.
AP-Based Borda Voting Method for Feature Extraction in TRECVID-2004 BIBAFull-Text 568-570
  Le Chen; Dayong Ding; Dong Wang; Fuzong Lin; Bo Zhang
We present a novel fusion method -- AP-based Borda voting method (APBB) -- for rankings. Due to its adaptive weighting scheme, APBB outperforms many traditional methods. Comparative experiments on TRECVID 2004 data were carried out and showed the robustness and effectiveness of this method.