HCI Bibliography Home | HCI Conferences | CLEF Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
CLEF Tables of Contents: 101112131415

CLEF 2010: International Conference of the Cross-Language Evaluation Forum

Fullname:CLEF 2010: Multilingual and Multimodal Information Access Evaluation: International Conference of the Cross-Language Evaluation Forum
Editors:Maristella Agosti; Nicola Ferro; Carol Peters; Maarten de Rijke; Alan Smeaton
Location:Padua, Italy
Dates:2010-Sep-20 to 2010-Sep-23
Publisher:Springer Berlin Heidelberg
Series:Lecture Notes in Computer Science 6360
Standard No:DOI: 10.1007/978-3-642-15998-5; ISBN: 978-3-642-15997-8 (print), 978-3-642-15998-5 (online); hcibib: CLEF10
Papers:16
Pages:144
Links:Online Proceedings | DBLP Contents | Online Working Notes
  1. Keynote Addresses
  2. Resources, Tools, and Methods
  3. Experimental Collections and Datasets (1)
  4. Experimental Collections and Datasets (2)
  5. Evaluation Methodologies and Metrics (1)
  6. Evaluation Methodologies and Metrics (2)
  7. Panels

Keynote Addresses

IR between Science and Engineering, and the Role of Experimentation BIBAFull-Text 1
  Norbert Fuhr
Evaluation has always played a major role in IR research, as a means for judging about the quality of competing models. Lately, however, we have seen an over-emphasis of experimental results, thus favoring engineering approaches aiming at tuning performance and neglecting other scientific criteria. A recent study investigated the validity of experimental results published at major conferences, showing that for 95% of the papers using standard test collections, the claimed improvements were only relative, and the resulting quality was inferior to that of the top performing systems [AMWZ09].
   In this talk, it is claimed that IR is still in its scientific infancy. Despite the extensive efforts in evaluation initiatives, the scientific insights gained are still very limited -- partly due to shortcomings in the design of the testbeds. From a general scientific standpoint, using test collections for evaluation only is a waste of resources. Instead, experimentation should be used for hypothesis generation and testing in general, in order to accumulate a better understanding of the retrieval process and to develop a broader theoretic foundation for the field.
Retrieval Evaluation in Practice BIBAFull-Text 2
  Ricardo A. Baeza-Yates
Nowadays, most research on retrieval evaluation is about comparing different systems to determine which is the best one, using a standard document collection and a set of queries with relevance judgements, such as TREC. Retrieval quality baselines are usually also standard, such as BM25. However, in an industrial setting, reality is much harder. First, real Web collections are much larger -- billions of documents -- and the number of all relevant answers for most queries could be several millions. Second, the baseline is the competition, so you cannot use a weak baseline. Third, good average quality is not enough if, for example, a significant fraction of the answers have quality well below average. On the other hand, search engines have hundreds of million of users and hence click-through data can and should be used for evaluation.
   In this invited talk we explore important problems that arise in practice. Some of them are: Which queries are already well answered and which are the difficult queries? Which queries and how many answers per query should be judged by editors? How we can use clicks for retrieval evaluation? What retrieval measure we should use? What is the impact of culture, geography or language in these questions?
   All these questions are not trivial and depend in each other, so we only give partial solutions. Hence, the main message to take away is that more research in retrieval evaluation is certainly needed.

Resources, Tools, and Methods

A Dictionary- and Corpus-Independent Statistical Lemmatizer for Information Retrieval in Low Resource Languages BIBAFull-Text 3-14
  Aki Loponen; Kalervo Järvelin
We present a dictionary- and corpus-independent statistical lemmatizer StaLe that deals with the out-of-vocabulary (OOV) problem of dictionary-based lemmatization by generating candidate lemmas for any inflected word forms. StaLe can be applied with little effort to languages lacking linguistic resources. We show the performance of StaLe both in lemmatization tasks alone and as a component in an IR system using several datasets and query types in four high resource languages. StaLe is competitive, reaching 88-108% of gold standard performance of a commercial lemmatizer in IR experiments. Despite competitive performance, it is compact, efficient and fast to apply to new languages.
A New Approach for Cross-Language Plagiarism Analysis BIBAFull-Text 15-26
  Rafael Corezola Pereira; Viviane Pereira Moreira; Renata Galante
This paper presents a new method for Cross-Language Plagiarism Analysis. Our task is to detect the plagiarized passages in the suspicious documents and their corresponding fragments in the source documents. We propose a plagiarism detection method composed by five main phases: language normalization, retrieval of candidate documents, classifier training, plagiarism analysis, and post-processing. To evaluate our method, we created a corpus containing artificial plagiarism offenses. Two different experiments were conducted; the first one considers only monolingual plagiarism cases, while the second one considers only cross-language plagiarism cases. The results showed that the cross-language experiment achieved 86% of the performance of the monolingual baseline. We also analyzed how the plagiarized text length affects the overall performance of the method. This analysis showed that our method achieved better results with medium and large plagiarized passages.
Creating a Persian-English Comparable Corpus BIBAFull-Text 27-39
  Homa Baradaran Hashemi; Azadeh Shakery; Heshaam Feili
Multilingual corpora are valuable resources for cross-language information retrieval and are available in many language pairs. However the Persian language does not have rich multilingual resources due to some of its special features and difficulties in constructing the corpora. In this study, we build a Persian-English comparable corpus from two independent news collections: BBC News in English and Hamshahri news in Persian. We use the similarity of the document topics and their publication dates to align the documents in these sets. We tried several alternatives for constructing the comparable corpora and assessed the quality of the corpora using different criteria. Evaluation results show the high quality of the aligned documents and using the Persian-English comparable corpus for extracting translation knowledge seems promising.

Experimental Collections and Datasets (1)

Validating Query Simulators: An Experiment Using Commercial Searches and Purchases BIBAFull-Text 40-51
  Bouke Huurnink; Katja Hofmann; Maarten de Rijke; Marc Bron
We design and validate simulators for generating queries and relevance judgments for retrieval system evaluation. We develop a simulation framework that incorporates existing and new simulation strategies. To validate a simulator, we assess whether evaluation using its output data ranks retrieval systems in the same way as evaluation using real-world data. The real-world data is obtained using logged commercial searches and associated purchase decisions. While no simulator reproduces an ideal ranking, there is a large variation in simulator performance that allows us to distinguish those that are better suited to creating artificial testbeds for retrieval experiments. Incorporating knowledge about document structure in the query generation process helps create more realistic simulators.
Using Parallel Corpora for Multilingual (Multi-document) Summarisation Evaluation BIBAFull-Text 52-63
  Marco Turchi; Josef Steinberger; Mijail Alexandrov Kabadjov; Ralf Steinberger
We are presenting a method for the evaluation of multilingual multi-document summarisation that allows saving precious annotation time and that makes the evaluation results across languages directly comparable. The approach is based on the manual selection of the most important sentences in a cluster of documents from a sentence-aligned parallel corpus, and by projecting the sentence selection to various target languages. We also present two ways of exploiting inter-annotator agreement levels, apply them both to a baseline sentence extraction summariser in seven languages, and discuss the result differences between the two evaluation versions, as well as a preliminary analysis between languages. The same method can in principle be used to evaluate single-document summarisers or information extraction tools.

Experimental Collections and Datasets (2)

MapReduce for Information Retrieval Evaluation: "Let's Quickly Test This on 12 TB of Data" BIBAFull-Text 64-69
  Djoerd Hiemstra; Claudia Hauff
We propose to use MapReduce to quickly test new retrieval approaches on a cluster of machines by sequentially scanning all documents. We present a small case study in which we use a cluster of 15 low cost machines to search a web crawl of 0.5 billion pages showing that sequential scanning is a viable approach to running large-scale information retrieval experiments with little effort. The code is available to other researchers at: http://mirex.sourceforge.net
Which Log for Which Information? Gathering Multilingual Data from Different Log File Types BIBAFull-Text 70-81
  Maria Gäde; Vivien Petras; Juliane Stiller
In this paper, a comparative analysis of different log file types and their potential for gathering information about user behavior in a multilingual information system is presented. It starts with a discussion of potential questions to be answered in order to form an appropriate view of user needs and requirements in a multilingual information environment and the possibilities of gaining this information from log files. Based on actual examples from the Europeana portal, we compare and contrast different types of log files and the information gleaned from them. We then present the Europeana Clickstream Logger, which logs and gathers extended information on user behavior, and show first examples of the data collection possibilities.

Evaluation Methodologies and Metrics (1)

Examining the Robustness of Evaluation Metrics for Patent Retrieval with Incomplete Relevance Judgements BIBAFull-Text 82-93
  Walid Magdy; Gareth J. F. Jones
Recent years have seen a growing interest in research into patent retrieval. One of the key issues in conducting information retrieval (IR) research is meaningful evaluation of the effectiveness of the retrieval techniques applied to task under investigation. Unlike many existing well explored IR tasks where the focus is on achieving high retrieval precision, patent retrieval is to a significant degree a recall focused task. The standard evaluation metric used for patent retrieval evaluation tasks is currently mean average precision (MAP). However this does not reflect system recall well. Meanwhile, the alternative of using the standard recall measure does not reflect user search effort, which is a significant factor in practical patent search environments. In recent work we introduce a novel evaluation metric for patent retrieval evaluation (PRES) [13]. This is designed to reflect both system recall and user effort. Analysis of PRES demonstrated its greater effectiveness in evaluating recall-oriented applications than standard MAP and Recall. One dimension of the evaluation of patent retrieval which has not previously been studied is the effect on reliability of the evaluation metrics when relevance judgements are incomplete. We provide a study comparing the behaviour of PRES against the standard MAP and Recall metrics for varying incomplete judgements in patent retrieval. Experiments carried out using runs from the CLEF-IP 2009 datasets show that PRES and Recall are more robust than MAP for incomplete relevance sets for this task with a small preference to PRES as the most robust evaluation metric for patent retrieval with respect to the completeness of the relevance set.
On the Evaluation of Entity Profiles BIBAFull-Text 94-99
  Maarten de Rijke; Krisztian Balog; Toine Bogers; Antal van den Bosch
Entity profiling is the task of identifying and ranking descriptions of a given entity. The task may be viewed as one where the descriptions being sought are terms that need to be selected from a knowledge source (such as an ontology or thesaurus). In this case, entity profiling systems can be assessed by means of precision and recall values of the descriptive terms produced. However, recent evidence suggests that more sophisticated metrics are needed that go beyond mere lexical matching of system-produced descriptors against a ground truth, allowing for graded relevance and rewarding diversity in the list of descriptors returned. In this note, we motivate and propose such a metric.

Evaluation Methodologies and Metrics (2)

Evaluating Information Extraction BIBAFull-Text 100-111
  Andrea Esuli; Fabrizio Sebastiani
The issue of how to experimentally evaluate information extraction (IE) systems has received hardly any satisfactory solution in the literature. In this paper we propose a novel evaluation model for IE and argue that, among others, it allows (i) a correct appreciation of the degree of overlap between predicted and true segments, and (ii) a fair evaluation of the ability of a system to correctly identify segment boundaries. We describe the properties of this models, also by presenting the result of a re-evaluation of the results of the CoNLL'03 and CoNLL'02 Shared Tasks on Named Entity Extraction.
Tie-Breaking Bias: Effect of an Uncontrolled Parameter on Information Retrieval Evaluation BIBAFull-Text 112-123
  Guillaume Cabanac; Gilles Hubert; Mohand Boughanem; Claude Chrisment
We consider Information Retrieval evaluation, especially at Trec with the trec_eval program. It appears that systems obtain scores regarding not only the relevance of retrieved documents, but also according to document names in case of ties (i.e., when they are retrieved with the same score). We consider this tie-breaking strategy as an uncontrolled parameter influencing measure scores, and argue the case for fairer tie-breaking strategies. A study of 22 Trec editions reveals significant differences between the Conventional unfair Trec's strategy and the fairer strategies we propose. This experimental result advocates using these fairer strategies when conducting evaluations.
Automated Component-Level Evaluation: Present and Future BIBAFull-Text 124-135
  Allan Hanbury; Henning Müller
Automated component-level evaluation of information retrieval (IR) is the main focus of this paper. We present a review of the current state of web-based and component-level evaluation. Based on these systems, propositions are made for a comprehensive framework for web service-based component-level IR system evaluation. The advantages of such an approach are considered, as well as the requirements for implementing it. Acceptance of such systems by researchers who develop components and systems is crucial for having an impact and requires that a clear benefit is demonstrated.

Panels

The Four Ladies of Experimental Evaluation BIBAFull-Text 136-139
  Donna Harman; Noriko Kando; Mounia Lalmas; Carol Peters
The goal of the panel is to present some of the main lessons that we have learned in well over a decade of experimental evaluation and to promote discussion with respect to what the future objectives in this field should be.
A PROMISE for Experimental Evaluation BIBAFull-Text 140-144
  Martin Braschler; Khalid Choukri; Nicola Ferro; Allan Hanbury; Jussi Karlgren; Henning Müller; Vivien Petras; Emanuele Pianta; Maarten de Rijke; Giuseppe Santucci
Participative Research labOratory for Multimedia and Multilingual Information Systems Evaluation (PROMISE) is a Network of Excellence, starting in conjunction with this first independent CLEF 2010 conference, and designed to support and develop the evaluation of multilingual and multimedia information access systems, largely through the activities taking place in Cross-Language Evaluation Forum (CLEF) today, and taking it forward in important new ways.
   PROMISE is coordinated by the University of Padua, and comprises 10 partners: the Swedish Institute for Computer Science, the University of Amsterdam, Sapienza University of Rome, University of Applied Sciences of Western Switzerland, the Information Retrieval Facility, the Zurich University of Applied Sciences, the Humboldt University of Berlin, the Evaluation and Language Resources Distribution Agency, and the Centre for the Evaluation of Language Communication Technologies.
   The single most important step forward for multilingual and multimedia information access which PROMISE will work towards is to provide an open evaluation infrastructure in order to support automation and collaboration in the evaluation process.