HCI Bibliography Home | HCI Conferences | CLEF Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
CLEF Tables of Contents: 101112131415

CLEF 2012: International Conference of the Cross-Language Evaluation Forum

Fullname:CLEF 2012: Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics: Third International Conference of the CLEF Initiative
Editors:Tiziana Catarci; Pamela Forner; Djoerd Hiemstra; Anselmo Peñas; Giuseppe Santucci
Location:Rome, Italy
Dates:2012-Sep-17 to 2012-Sep-20
Publisher:Springer Berlin Heidelberg
Series:Lecture Notes in Computer Science 7488
Standard No:DOI: 10.1007/978-3-642-33247-0; ISBN: 978-3-642-33246-3 (print), 978-3-642-33247-0 (online); hcibib: CLEF12
Papers:17
Pages:143
Links:Online Proceedings | DBLP Contents | Online Working Notes
  1. Benchmarking and Evaluation Initiatives
  2. Information Access
  3. Evaluation Methodologies and Infrastructure
  4. Posters

Benchmarking and Evaluation Initiatives

Analysis and Refinement of Cross-Lingual Entity Linking BIBAFull-Text 1-12
  Taylor Cassidy; Heng Ji; Hongbo Deng; Jing Zheng; Jiawei Han
In this paper we propose two novel approaches to enhance cross-lingual entity linking (CLEL). One is based on cross-lingual information networks, aligned based on monolingual information extraction, and the other uses topic modeling to ensure global consistency. We enhance a strong baseline system derived from a combination of state-of-the-art machine translation and monolingual entity linking to achieve 11.2% improvement in B-Cubed+ F-measure. Our system achieved highly competitive results in the NIST Text Analysis Conference (TAC) Knowledge Base Population (KBP2011) evaluation. We also provide detailed qualitative and quantitative analysis on the contributions of each approach and the remaining challenges.
Seven Years of INEX Interactive Retrieval Experiments -- Lessons and Challenges BIBAKFull-Text 13-23
  Ragnar Nordlie; Nils Pharo
This paper summarizes a major effort in interactive search investigation, the INEX i-track, a collective effort run over a seven-year period. We present the experimental conditions, report some of the findings of the participating groups, and examine the challenges posed by this kind of collective experimental effort.
Keywords: User studies; interactive information retrieval; information search behavior
Bringing the Algorithms to the Data: Cloud-Based Benchmarking for Medical Image Analysis BIBAKFull-Text 24-29
  Allan Hanbury; Henning Müller; Georg Langs; Marc-André Weber; Bjoern H. Menze; Tomas Salas Fernandez
Benchmarks have shown to be an important tool to advance science in the fields of information analysis and retrieval. Problems of running benchmarks include obtaining large amounts of data, annotating it and then distributing it to the participants of a benchmark. Distribution of the data to participants is currently mostly done via data download that can take hours for large data sets and in countries with slow Internet connections even days. Sending physical hard disks was also used for distributing very large scale data sets (for example by TRECvid) but also this becomes infeasible if the data sets reach sizes of 5-10 TB. With cloud computing it is possible to make very large data sets available in a central place with limited costs. Instead of distributing the data to the participants, the participants can compute their algorithms on virtual machines of the cloud providers. This text presents reflections and ideas of a concrete project on using cloud-based benchmarking paradigms for medical image analysis and retrieval. It is planned to run two evaluation campaigns in 2013 and 2014 using the proposed technology.
Keywords: benchmark; medical image analysis; anatomy detection; case-based medical information retrieval; cloud computing
Going beyond CLEF-IP: The 'Reality' for Patent Searchers? BIBAFull-Text 30-35
  Julia Jürgens; Preben Hansen; Christa Womser-Hacker
This paper gives an overview of several different approaches that have been applied by participants in the CLEF-IP evaluation initiative. On this basis, it is suggested that other techniques and experimental paradigms could be helpful in further improving the results and making the experiments more realistic. The field of information seeking is therefore incorporated and its potential gain for patent retrieval explained. Furthermore, the different search tasks that are undertaken by patent searchers are introduced as possible use cases. They can serve as a basis for development in patent retrieval research in that they present the diverse scenarios with their special characteristics and give the research community therefore a realistic picture of the patent user's work.
MusiClef: Multimodal Music Tagging Task BIBAFull-Text 36-41
  Nicola Orio; Cynthia C. S. Liem; Geoffroy Peeters; Markus Schedl
MusiClef is a multimodal music benchmarking initiative that will be running a MediaEval 2012 Brave New Task on Multimodal Music Tagging. This paper describes the setup of this task, showing how it complements existing benchmarking initiatives and fosters less explored methodological directions in Music Information Retrieval. MusiClef deals with a concrete use case, encourages multimodal approaches based on these, and strives for transparency of results as much as possible. Transparency is encouraged at several levels and stages, from the feature extraction procedure up to the evaluation phase, in which a dedicated categorization of ground truth tags will be used to deepen the understanding of the relation between the proposed approaches and experimental results.

Information Access

Generating Pseudo Test Collections for Learning to Rank Scientific Articles BIBAFull-Text 42-53
  Richard Berendsen; Manos Tsagkias; Maarten de Rijke; Edgar Meij
Pseudo test collections are automatically generated to provide training material for learning to rank methods. We propose a method for generating pseudo test collections in the domain of digital libraries, where data is relatively sparse, but comes with rich annotations. Our intuition is that documents are annotated to make them better findable for certain information needs. We use these annotations and the associated documents as a source for pairs of queries and relevant documents. We investigate how learning to rank performance varies when we use different methods for sampling annotations, and show how our pseudo test collection ranks systems compared to editorial topics with editorial judgements. Our results demonstrate that it is possible to train a learning to rank algorithm on generated pseudo judgments. In some cases, performance is on par with learning on manually obtained ground truth.
Effects of Language and Topic Size in Patent IR: An Empirical Study BIBAFull-Text 54-66
  Florina Piroi; Mihai Lupu; Allan Hanbury
We revisit the effects that various characteristics of the topic documents have on the effectiveness of the systems for the task of finding prior art in the patent domain. In doing so, we provide the reader interested in approaching the domain a guide of the issues that need to be addressed in this context.
   For the current study, we select two patent based test collections with a common document representation schema and look at topic characteristics specific to the objectives of the collections. We look at the effect of languages on retrieval and at the length of the topic documents. We present the correlations between these topic facets and their retrieval results, as well as their relevant documents.
Cross-Language High Similarity Search Using a Conceptual Thesaurus BIBAFull-Text 67-75
  Parth Gupta; Alberto Barrón-Cedeño; Paolo Rosso
This work addresses the issue of cross-language high similarity and near-duplicates search, where, for the given document, a highly similar one is to be identified from a large cross-language collection of documents. We propose a concept-based similarity model for the problem which is very light in computation and memory. We evaluate the model on three corpora of different nature and two language pairs English-German and English-Spanish using the Eurovoc conceptual thesaurus. Our model is compared with two state-of-the-art models and we find, though the proposed model is very generic, it produces competitive results and is significantly stable and consistent across the corpora.
The Appearance of the Giant Component in Descriptor Graphs and Its Application for Descriptor Selection BIBAKFull-Text 76-81
  Anita Keszler; Levente Kovács; Tamás Szirányi
The paper presents a random graph based analysis approach for evaluating descriptors based on pairwise distance distributions on real data. Starting from the Erdos-Rélnyi model the paper presents results of investigating random geometric graph behaviour in relation with the appearance of the giant component as a basis for choosing descriptors based on their clustering properties. Experimental results prove the existence of the giant component in such graphs, and based on the evaluation of their behaviour the graphs, the corresponding descriptors are compared, and validated in proof-of-concept retrieval tests.
Keywords: feature selection; graph analysis; giant components
Hidden Markov Model for Term Weighting in Verbose Queries BIBAKFull-Text 82-87
  Xueliang Yan; Guanglai Gao; Xiangdong Su; Hongxi Wei; Xueliang Zhang; Qianqian Lu
It has been observed that short queries generally have better performance than their corresponding long versions when retrieved by the same IR model. This is mainly because most of the current models do not distinguish the importance of different terms in the query. Observed that sentence-like queries encode information related to the term importance in the grammatical structure, we propose a Hidden Markov Model (HMM) based method to extract such information to do term weighting. The basic idea of choosing HMM is motivated by its successful application in capturing the relationship between adjacent terms in NLP field. Since we are dealing with queries of natural language form, we think that HMM can also be used to capture the dependence between the weights and the grammatical structures. Our experiments show that our assumption is quite reasonable and that such information, when utilized properly, can greatly improve retrieval performance.
Keywords: Hidden Markov Model; Verbose Query; Term Weighting

Evaluation Methodologies and Infrastructure

DIRECTions: Design and Specification of an IR Evaluation Infrastructure BIBAFull-Text 88-99
  Maristella Agosti; Emanuele Di Buccio; Nicola Ferro; Ivano Masiero; Simone Peruzzo; Gianmaria Silvello
Information Retrieval (IR) experimental evaluation is an essential part of the research on and development of information access methods and tools. Shared data sets and evaluation scenarios allow for comparing methods and systems, understanding their behaviour, and tracking performances and progress over the time. On the other hand, experimental evaluation is an expensive activity in terms of human effort, time, and costs required to carry it out.
   Software and hardware infrastructures that support experimental evaluation operation as well as management, enrichment, and exploitation of the produced scientific data provide a key contribution in reducing such effort and costs and carrying out systematic and throughout analysis and comparison of systems and methods, overall acting as enablers of scientific and technical advancement in the field. This paper describes the specification for an Information Retrieval (IR) evaluation infrastructure by conceptually modeling the entities involved in Information Retrieval (IR) experimental evaluation and their relationships and by defining the architecture of the proposed evaluation infrastructure and the APIs for accessing it.
Penalty Functions for Evaluation Measures of Unsegmented Speech Retrieval BIBAFull-Text 100-111
  Petra Galuscáková; Pavel Pecina; Jan Hajic
This paper deals with evaluation of information retrieval from unsegmented speech. We focus on Mean Generalized Average Precision, the evaluation measure widely used for unsegmented speech retrieval. This measure is designed to allow certain tolerance in matching retrieval results (starting points of relevant segments) against a gold standard relevance assessment. It employs a Penalty Function which evaluates non-exact matches in the retrieval results based on their distance from the beginnings of their nearest true relevant segments. However, the choice of the Penalty Function is usually ad-hoc and does not necessary reflect users' perception of the speech retrieval quality. We perform a lab test to study satisfaction of users of a speech retrieval system to empirically estimate the optimal shape of the Penalty Function.
Cumulated Relative Position: A Metric for Ranking Evaluation BIBAFull-Text 112-123
  Marco Angelini; Nicola Ferro; Kalervo Järvelin; Heikki Keskustalo; Ari Pirkola; Giuseppe Santucci; Gianmaria Silvello
The development of multilingual and multimedia information access systems calls for proper evaluation methodologies to ensure that they meet the expected user requirements and provide the desired effectiveness. IR research offers a strong evaluation methodology and a range of evaluation metrics, such as MAP and (n)DCG. In this paper, we propose a new metric for ranking evaluation, the CRP. We start with the observation that a document of a given degree of relevance may be ranked too early or too late regarding the ideal ranking of documents for a query. Its relative position may be negative, indicating too early ranking, zero indicating correct ranking, or positive, indicating too late ranking. By cumulating these relative rankings we indicate, at each ranked position, the net effect of document displacements, the CRP. We first define the metric formally and then discuss its properties, its relationship to prior metrics, and its visualization. Finally we propose different visualizations of CRP by exploiting a test collection to demonstrate its behavior.
Better than Their Reputation? On the Reliability of Relevance Assessments with Students BIBAKFull-Text 124-135
  Philipp Schaer
During the last three years we conducted several information retrieval evaluation series with more than 180 LIS students who made relevance assessments on the outcomes of three specific retrieval services. In this study we do not focus on the retrieval performance of our system but on the relevance assessments and the inter-assessor reliability. To quantify the agreement we apply Fleiss' Kappa and Krippendorff's Alpha. When we compare these two statistical measures on average Kappa values were 0.37 and Alpha values 0.15. We use the two agreement measures to drop too unreliable assessments from our data set. When computing the differences between the unfiltered and the filtered data set we see a root mean square error between 0.02 and 0.12. We see this as a clear indicator that disagreement affects the reliability of retrieval evaluations. We suggest not to work with unfiltered results or to clearly document the disagreement rates.
Keywords: Evaluation; Students; Relevance Assessment; Information Retrieval; Inter-assessor Agreement; Inter-rater Agreement; Fleiss' Kappa; Krippendorff's Alpha

Posters

Comparing IR System Components Using Beanplots BIBAKFull-Text 136-137
  Jens Kürsten; Maximilian Eibl
In this poster we demonstrate an approach to gain a better understanding of the interactions between search tasks, test collections and components and configurations of retrieval systems by testing a large set of experiment configurations against standard ad-hoc test collections.
Keywords: Ad-hoc Retrieval; Component-based Evaluation
Language Independent Query Focused Snippet Generation BIBAFull-Text 138-140
  Pinaki Bhaskar; Sivaji Bandyopadhyay
The present paper describes the development of a language independent query focused snippet generation module. This module takes the query and content of each retrieved document and generates a query dependent snippet for each retrieved document. The algorithm of this module based on the sentence extraction, sentence scoring and sentence ranking. Subjective evaluation has been. English snippet got the best evaluation score, i.e. 1 and overall average evaluation score of 0.83 has been achieved in the scale of 0 to 1.
A Test Collection to Evaluate Plagiarism by Missing or Incorrect References BIBAFull-Text 141-143
  Solange de L. Pertile; Viviane Pereira Moreira
In recent years, several methods and tools been developed together with test collections to aid in plagiarism detection. However, both methods and collections have focused on content analysis, overlooking citation analysis. In this paper, we aim at filling this gap and present a test collection with cases of plagiarism by missing and incorrect references. The collection contains automatically generated academic papers in which passages from other documents have been inserted. Such passages were either: adequately referenced (i.e., not plagiarized), not referenced, or incorrectly referenced. Annotation files identifying each passage enable the evaluation of plagiarism detection systems.