HCI Bibliography Home | HCI Conferences | CLEF Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
CLEF Tables of Contents: 101112131415

CLEF 2013: International Conference of the Cross-Language Evaluation Forum

Fullname:CLEF 2013: 4th International Conference of the CLEF Initiative. Information Access Evaluation -- Multilinguality, Multimodality, and Visualization
Editors:Pamela Forner; Henning Müller; Roberto Paredes; Paolo Rosso; Benno Stein
Location:Valencia, Spain
Dates:2013-Sep-23 to 2013-Sep-26
Publisher:Springer Berlin Heidelberg
Series:Lecture Notes in Computer Science 8138
Standard No:DOI: 10.1007/978-3-642-40802-1 hcibib: CLEF13; ISBN: 978-3-642-40801-4 (print), 978-3-642-40802-1 (online)
Papers:32
Pages:370
Links:Online Proceedings | Conference Website
  1. Evaluation and Visualization
  2. Multilinguality and Less-Resourced Languages
  3. Applications
  4. Lab Overviews

Evaluation and Visualization

The Scholarly Impact of CLEF (2000-2009) BIBAFull-Text 1-12
  Theodora Tsikrika; Birger Larsen; Henning Müller; Stefan Endrullis; Erhard Rahm
This paper assesses the scholarly impact of the CLEF evaluation campaign by performing a bibliometric analysis of the citations of the CLEF 2000-2009 proceedings publications collected through Scopus and Google Scholar. Our analysis indicates a significant impact of CLEF, particularly for its well-established Adhoc, ImageCLEF, and QA labs, and for the lab/task overview publications that attract considerable interest. Moreover, initial analysis indicates that the scholarly impact of ImageCLEF is comparable to that of TRECVid.
A Quantitative Look at the CLEF Working Notes BIBAKFull-Text 13-16
  Thomas Wilhelm-Stein; Maximilian Eibl
After seven years of participation in CLEF we take a look back at the developments and trends in different domains like evaluation measures and retrieval models. For that purpose a new collection containing all CLEF working notes including their metadata was created and analysed.
Keywords: data mining; evaluation; retrospection; retrieval models; evaluation measures
Building a Common Framework for IIR Evaluation BIBAKFull-Text 17-28
  Mark Michael Hall; Elaine Toms
Cranfield-style evaluations standardised Information Retrieval (IR) evaluation practices, enabling the creation of programmes such as TREC, CLEF, and INEX, and long-term comparability of IR systems. However, the methodology does not translate well into the Interactive IR (IIR) domain, where the inclusion of the user into the search process and the repeated interaction between user and system creates more variability than the Cranfield-style evaluations can support. As a result, IIR evaluations of various systems have tended to be non-comparable, not because the systems vary, but because the methodologies used are non-comparable. In this paper we describe a standardised IIR evaluation framework, that ensures that IIR evaluations can share a standardised baseline methodology in much the same way that TREC, CLEF, and INEX imposed a process on IR evaluation. The framework provides a common baseline, derived by integrating existing, validated evaluation measures, that enables inter-study comparison, but is also flexible enough to support most kinds of IIR studies. This is achieved through the use of a "pluggable" system, into which any web-based IIR interface can be embedded. The framework has been implemented and the software will be made available to reduce the resource commitment required for IIR studies.
Keywords: evaluation; methodology; interactive information retrieval
Improving Ranking Evaluation Employing Visual Analytics BIBAFull-Text 29-40
  Marco Angelini; Nicola Ferro; Giuseppe Santucci; Gianmaria Silvello
In order to satisfy diverse user needs and support challenging tasks, it is fundamental to provide automated tools to examine system behavior, both visually and analytically. This paper provides an analytical model for examining rankings produced by IR systems, based on the discounted cumulative gain family of metrics, and visualization for performing failure and "what-if" analyses.
A Proposal for New Evaluation Metrics and Result Visualization Technique for Sentiment Analysis Tasks BIBAFull-Text 41-52
  Francisco José Valverde-Albacete; Jorge Carrillo-de-Albornoz; Carmen Peláez-Moreno
In this paper we propound the use of a number of entropy-based metrics and a visualization tool for the intrinsic evaluation of Sentiment and Reputation Analysis tasks. We provide a theoretical justification for their use and discuss how they complement other accuracy-based metrics. We apply the proposed techniques to the analysis of TASS-SEPLN and RepLab 2012 results and show how the metric is effective for system comparison purposes, for system development and postmortem evaluation.
A New Corpus for the Evaluation of Arabic Intrinsic Plagiarism Detection BIBAKFull-Text 53-58
  Imene Bensalem; Paolo Rosso; Salim Chikhi
The present paper introduces the first corpus for the evaluation of Arabic intrinsic plagiarism detection. The corpus consists of 1024 artificial suspicious documents in which 2833 plagiarism cases have been inserted automatically from source documents.
Keywords: Arabic intrinsic plagiarism detection; evaluation corpus; automatic plagiarism generation
Selecting Success Criteria: Experiences with an Academic Library Catalogue BIBAKFull-Text 59-70
  Paul Clough; Paula Goodale
Multiple methods exist for evaluating search systems, ranging from more user-oriented approaches to those more focused on evaluating system performance. When preparing an evaluation, key questions include: (i) why conduct the evaluation, (ii) what should be evaluated, and (iii) how the evaluation should be conducted. Over recent years there has been more focus on the end users of search systems and understanding what they view as 'success'. In this paper we consider what to evaluate; in particular what criteria users of search systems consider most important and whether this varies by user characteristic. Using our experience with evaluating an academic library catalogue, input was gathered from end users relating to the perceived importance of different evaluation criteria prior to conducting an evaluation. We analyse results to show which criteria users most value, together with the inter-relationships between them. Our results highlight the necessity of conducting multiple forms of evaluation to ensure that search systems are deemed successful by their users.
Keywords: Evaluation; success criteria; digital libraries
A Dependency-Inspired Semantic Evaluation of Machine Translation Systems BIBAFull-Text 71-74
  Mohammad Reza Mirsarraf; Nazanin Dehghani
The goal of translation is to preserve the original text meaning. However, lexical-based machine translation (MT) evaluation metrics count the similar terms in MT output with the human translated reference rather than measuring the similarity in meaning. In this paper, we developed an MT evaluation metric to assess the output of MT systems, semantically. Inspiring by the dependency grammar, we consider to what extent the headword and its dependents contribute in preserving the meaning of the original input text. Our experimental results show that this metric is significantly better correlated with human judgment.
A Turing Test to Evaluate a Complex Summarization Task BIBAFull-Text 75-80
  Alejandro Molina; Eric SanJuan; Juan-Manuel Torres-Moreno
This paper deals with a new strategy to evaluate a Natural Language Processing (NLP) complex task using the Turing test. Automatic summarization based on sentence compression requires to asses informativeness and modify inner sentence structures. This is much more intrinsically related with real rephrasing than plain sentence extraction and ranking paradigm so new evaluation methods are needed. We propose a novel imitation game to evaluate Automatic Summarization by Compression (ASC). Rationale of this Turing-like evaluation could be applied to many other NLP complex tasks like Machine translation or Text Generation. We show that a state of the art ASC system can pass such a test and simulate a human summary in 60% of the cases.
A Formative Evaluation of a Comprehensive Search System for Medical Professionals BIBAFull-Text 81-92
  Veronika Stefanov; Alexander Sachs; Marlene Kritz; Matthias Samwald; Manfred Gschwandtner; Allan Hanbury
Medical doctors need rapid and accurate answers, which they cannot easily find with current search systems. This paper describes a formative evaluation of a comprehensive search system for medical professionals. The study was designed to guide system development. The system features included search in text and 2D images, machine translated summaries of search results, as well as query disambiguation and suggestion features, and a comprehensive search user interface. The study design emphasizes qualitative user feedback, based on realistic simulated work tasks and data collection with spontaneous and prompted self-report, written and spoken feedback in response to questionnaires, was well as audio and video recordings, and log files. Results indicate that this is a fruitful approach to uncovering problems and eliciting requirements that would be harder to find in a component-based evaluation testing each feature separately.

Multilinguality and Less-Resourced Languages

Exploiting Multiple Translation Resources for English-Persian Cross Language Information Retrieval BIBAKFull-Text 93-99
  Hosein Azarbonyad; Azadeh Shakery; Heshaam Faili
One of the most important issues in Cross Language Information Retrieval (CLIR) which affects the performance of CLIR systems is how to exploit available translation resources. This issue can be more challenging when dealing with a language that lacks appropriate translation resources. Another factor that affects the performance of a CLIR system is the degree of ambiguity of query words. In this paper, we propose to combine different translation resources for CLIR. We also propose two different methods that exploit phrases in the query translation process to solve the problem of ambiguousness of query words. Our evaluation results on English-Persian CLIR show the superiority of phrase based and combinational translation CLIR methods over other CLIR methods.
Keywords: Cross Language Information Retrieval; English-Persian CLIR; Phrase Based Query Translation; Combining Translation Resources for CLIR
ALQASIM: Arabic Language Question Answer Selection in Machines BIBAKFull-Text 100-103
  Ahmed Magdy Ezzeldin; Mohamed Hamed Kholief; Yasser El-Sonbaty
This paper presents "ALQASIM", a question answering system that focuses on answer selection and validation. Our experiments have been conducted in the framework of the main task of QA4MRE @ CLEF 2013. ALQASIM uses a novel technique by analyzing the reading test documents instead of the questions, which leads to a promising performance of 0.31 accuracy and 0.36 C@1, without using the test-set background collections.
Keywords: Question Answering; QA4MRE; Machine Reading Evaluation; Answer Selection; Answer Validation
A Web-Based CLIR System with Cross-Lingual Topical Pseudo Relevance Feedback BIBAKFull-Text 104-107
  Xuwen Wang; Xiaojie Wang; Qiang Zhang
This paper presents the performance of a Chinese-English cross-language information retrieval (CLIR) system, which is equipped with topic-based pseudo relevance feedback. The web-based workflow simulates the real multilingual retrieval environment, and the feedback mechanism improves retrieval results automatically without putting excessive burden on users.
Keywords: Pseudo Relevance Feedback; Latent Dirichlet Allocation; Cross Language Information Retrieval; Query Expansion
A Case Study in Decompounding for Bengali Information Retrieval BIBAFull-Text 108-119
  Debasis Ganguly; Johannes Leveling; Gareth J. F. Jones
Decompounding has been found to improve information retrieval (IR) effectiveness for compounding languages such as Dutch, German, or Finnish. No previous studies, however, exist on the effect of decomposition of compounds in IR for Indian languages. In this case study, we investigate the effect of decompounding for Bengali, a highly agglutinative Indian language. The standard approach of decompounding for IR, i.e. indexing compound parts (constituents) in addition to compound words, has proven beneficial for European languages. Our experiments reported in this paper show that such a standard approach does not work particularly well for Bengali IR. Some unique characteristics of Bengali compounds are: i) only one compound constituent may be a valid word in contrast to the stricter requirement of both being so; and ii) the first character of the right constituent can be modified by the rules of Sandhi in contrast to simple concatenation. As a solution, we firstly propose a more relaxed decompounding where a compound word is decomposed into only one constituent if the other constituent is not a valid word, and secondly we perform selective decompounding by ensuring that constituents often co-occur with the compound word, which indicates how related the constituents and the compound are. We perform experiments on Bengali ad-hoc IR collections from FIRE 2008 to 2012. Our experiments show that both the relaxed decomposition and the co-occurrence-based constituent selection proves more effective than the standard frequency-based decomposition method, improving mean average precision (MAP) up to 2.72% and recall up to 1.8%, compared to not decompounding words.
Context-Dependent Semantic Annotation in Cross-Lingual Biomedical Resources BIBAFull-Text 120-123
  Rafael Berlanga; Antonio Jimeno-Yepes; María Pérez-Catalán; Dietrich Rebholz-Schuhmann
This paper presents a study about the impact of contexts in automatic semantic annotation over cross-lingual biomedical resources. Semantic annotation consists in associating parts of document texts to concepts described in some knowledge resource (KR). In this paper, we propose an unsupervised method for semantic annotation that regards contexts for validating the annotations. We test the method with two cross-lingual corpora, which allows us extracting correct annotations in the languages in the aligned corpora. Results show that annotated cross-lingual corpora provides grounds for qualitative comparison of semantic annotation algorithms.
A Comparative Evaluation of Cross-Lingual Text Annotation Techniques BIBAFull-Text 124-135
  Lei Zhang; Achim Rettinger; Michael Färber; Marko Tadic
In this paper, we study the problem of extracting knowledge from textual documents written in different languages by annotating the text on the basis of a cross-lingual knowledge base, namely Wikipedia. Our contribution is twofold. First, we propose a novel framework for evaluating cross-lingual text annotation techniques, based on annotation of a parallel corpus to a hub-language in a cross-lingual knowledge base. Second, we investigate the performance of different cross-lingual text annotation techniques according to our proposed evaluation framework. We perform experiments for an empirical comparison of three approaches: (i) Cross-lingual Named Entity Annotation (CL-NEA), (ii) Cross-lingual Wikifier Annotation (CL-WIFI), and (iii) Cross-lingual Explicit Semantic Analysis (CL-ESA). Besides establishing an evaluation framework, our results show the differences between the three investigated approaches and demonstrate their advantages and disadvantages.

Applications

Mining Query Logs of USPTO Patent Examiners BIBAKFull-Text 136-142
  Wolfgang Tannebaum; Andreas Rauber
In this paper we analyze a highly professional search setting of patent examiners of the United Patent and Trademark Office (USPTO). We gain insight into the search behavior of USPTO patent examiners to explore ways for enhancing query generation in patent searching. We show that query generation is highly patent domain specific and patent examiners follow a strict scheme for generating text queries. Means to enhance query generation in patent search are to suggest synonyms and equivalents, co-occurring terms and keyword phrases to the searchable features of the invention. Further, we show that term networks including synonyms and equivalents can be learned from the query logs for automatic query expansion in patent searching.
Keywords: Patent Searching; Query Log Analysis
Relevant Clouds: Leveraging Relevance Feedback to Build Tag Clouds for Image Search BIBAKFull-Text 143-149
  Luis A. Leiva; Mauricio Villegas; Roberto Paredes
Previous work in the literature has been aimed at exploring tag clouds to improve image search and potentially increase retrieval performance. However, to date none has considered the idea of building tag clouds derived from relevance feedback. We propose a simple approach to such an idea, where the tag cloud gives more importance to the words from the relevant images than the non-relevant ones. A preliminary study with 164 queries inspected by 14 participants over a 30M dataset of automatically annotated images showed that 1) tag clouds derived this way are found to be informative: users considered roughly 20% of the presented tags to be relevant for any query at any time; and 2) the importance given to the tags correlates with user judgments: tags ranked in the first positions tended to be perceived more often as relevant to the topic that users had in mind.
Keywords: Image Search and Retrieval; Relevance Feedback; Tag Clouds
Counting Co-occurrences in Citations to Identify Plagiarised Text Fragments BIBAFull-Text 150-154
  Solange de L. Pertile; Paolo Rosso; Viviane P. Moreira
Research in external plagiarism detection is mainly concerned with the comparison of the textual contents of a suspicious document against the contents of a collection of original documents. More recently, methods that try to detect plagiarism based on citation patterns have been proposed. These methods are particularly useful for detecting plagiarism in scientific publications. In this work, we assess the value of identifying co-occurrences in citations by checking whether this method can identify cases of plagiarism in a dataset of scientific papers. Our results show that most the cases in which co-occurrences were found indeed correspond to plagiarised passages.
The Impact of Belief Values on the Identification of Patient Cohorts BIBAFull-Text 155-166
  Travis Goodwin; Sanda M. Harabagiu
Retrieving relevant patient cohorts has the potential to accelerate clinical research. Recent evaluations have shown promising results, but also relevance measures that still need to be improved. To address the challenge of better modelling hospital visit relevance, we considered the impact of two forms of medical knowledge on the quality of patient cohorts. First, we automatically identified three types of medical concepts and, second, we asserted their belief values. This allowed us to perform experiments that capture the impact of incorporating knowledge of belief values within a retrieval system for identifying hospital visits corresponding to patient cohorts. We show that this approach generates a 149% increase for inferred average precision, a 36.5% increase of NDCG, and a 207% increase to the precision of the first ten returned documents.
Semantic Discovery of Resources in Cloud-Based PACS/RIS Systems BIBAFull-Text 167-178
  Rafael Berlanga; María Pérez; Lledó Museros; Rafael Forcada
PACS/RIS systems store a huge volume of clinical data that are mostly accessed by the patient identifier. However, clinicians would like to retrieve information about similar clinical cases. In this paper, we claim that the semantics-based technology could improve the discovery and integration of information in this type of systems. We propose a semantic approach that semantically annotates the clinical information and retrieves the resources relevant to the clinician's query, independently of their language and format. Moreover, cloud-based systems allow the integration with external resources. In this paper, we present preliminary results that show that current semantic technologies can produce good enough results to perform classification and retrieval tasks.
Subtopic Mining Based on Head-Modifier Relation and Co-occurrence of Intents Using Web Documents BIBAKFull-Text 179-191
  Se-Jong Kim; Jong-Hyeok Lee
This paper proposes a method that mines subtopics using the head-modifier relation and co-occurrence of users' intents from web documents in Japanese. We extracted subtopics using the simple patterns based on the head-modifier relation between the query and its adjacent words, and returned the ranked list of subtopics by the proposed score equation. We re-ranked subtopics according to the intent co-occurrence measure. Our method achieved good performance than the baseline methods and suggested queries from the major web search engine. The results of our method will be useful in various search scenarios, such as query suggestion and result diversification.
Keywords: search intent; subtopic mining; diversity; pattern; head-modifier

Lab Overviews

Cultural Heritage in CLEF (CHiC) 2013 BIBAKFull-Text 192-211
  Vivien Petras; Toine Bogers; Elaine Toms; Mark Hall; Jacques Savoy; Piotr Malak; Adam Pawlowski; Nicola Ferro; Ivano Masiero
The Cultural Heritage in CLEF 2013 lab comprised three tasks: multilingual ad-hoc retrieval and semantic enrichment in 13 languages (Dutch, English, German, Greek, Finnish, French, Hungarian, Italian, Norwegian, Polish, Slovenian, Spanish, and Swedish), Polish ad-hoc retrieval and the interactive task, which studied user behavior via log analysis and questionnaires. For the multilingual and Polish sub-tasks, more than 170,000 documents were assessed for relevance on a tertiary scale. The multilingual task had 7 participants submitting 30 multilingual and 41 monolingual runs. The Polish task comprised 3 participating groups submitting manual and automatic runs. The interactive task had 4 participating research groups and 208 user participants in the study. For the multilingual task, results show that more participants are necessary in order to provide comparative analyses. The interactive task created a rich data set comprising of questionnaire of log data. Further analysis of the data is planned in the future.
Keywords: cultural heritage; Europeana; ad-hoc retrieval; semantic enrichment; multilingual retrieval; Polish; interactive; user behavior
Overview of the ShARe/CLEF eHealth Evaluation Lab 2013 BIBAKFull-Text 212-231
  Hanna Suominen; Sanna Salanterä; Sumithra Velupillai; Wendy W. Chapman; Guergana Savova; Noemie Elhadad; Sameer Pradhan; Brett R. South; Danielle L. Mowery; Gareth J. F. Jones; Johannes Leveling; Liadh Kelly; Lorraine Goeuriot; David Martinez; Guido Zuccon
Discharge summaries and other free-text reports in healthcare transfer information between working shifts and geographic locations. Patients are likely to have difficulties in understanding their content, because of their medical jargon, non-standard abbreviations, and ward-specific idioms. This paper reports on an evaluation lab with an aim to support the continuum of care by developing methods and resources that make clinical reports in English easier to understand for patients, and which helps them in finding information related to their condition. This ShARe/CLEFeHealth2013 lab offered student mentoring and shared tasks: identification and normalisation of disorders (1a and 1b) and normalisation of abbreviations and acronyms (2) in clinical reports with respect to terminology standards in healthcare as well as information retrieval (3) to address questions patients may have when reading clinical reports. The focus on patients' information needs as opposed to the specialised information needs of physicians and other healthcare workers was the main feature of the lab distinguishing it from previous shared tasks. De-identified clinical reports for the three tasks were from US intensive care and originated from the MIMIC II database. Other text documents for Task 3 were from the Internet and originated from the Khresmoi project. Task 1 annotations originated from the ShARe annotations. For Tasks 2 and 3, new annotations, queries, and relevance assessments were created. 64, 56, and 55 people registered their interest in Tasks 1, 2, and 3, respectively. 34 unique teams (3 members per team on average) participated with 22, 17, 5, and 9 teams in Tasks 1a, 1b, 2 and 3, respectively. The teams were from Australia, China, France, India, Ireland, Republic of Korea, Spain, UK, and USA. Some teams developed and used additional annotations, but this strategy contributed to the system performance only in Task 2. The best systems had the F1 score of 0.75 in Task 1a; Accuracies of 0.59 and 0.72 in Tasks 1b and 2; and Precision at 10 of 0.52 in Task 3. The results demonstrate the substantial community interest and capabilities of these systems in making clinical reports easier to understand for patients. The organisers have made data and tools available for future research and development.
Keywords: Information Retrieval; Evaluation; Medical Informatics; Test-set Generation; Text Classification; Text Segmentation
Overview of CLEF-IP 2013 Lab BIBAFull-Text 232-249
  Florina Piroi; Mihai Lupu; Allan Hanbury
The first Clef-Ip test collection was made available in 2009 to support research in IR methods in the intellectual property domain; only one type of retrieval task (Prior Art Search) was given to the participants. Since then the test collection has been extended with both more content and varied types of tasks, reflecting various specific parts of patent experts' workflows. In 2013 we organized two tasks -- Passage Retrieval Starting from Claims and Structure Recognition -- on which we report in this work.
ImageCLEF 2013: The Vision, the Data and the Open Challenges BIBAFull-Text 250-268
  Barbara Caputo; Henning Muller; Bart Thomee; Mauricio Villegas; Roberto Paredes; David Zellhofer; Herve Goeau; Alexis Joly; Pierre Bonnet; Jesus Martinez Gomez; Ismael Garcia Varea; Miguel Cazorla
This paper presents an overview of the ImageCLEF 2013 lab. Since its first edition in 2003, ImageCLEF has become one of the key initiatives promoting the benchmark evaluation of algorithms for the cross-language annotation and retrieval of images in various domains, such as public and personal images, to data acquired by mobile robot platforms and botanic collections. Over the years, by providing new data collections and challenging tasks to the community of interest, the ImageCLEF lab has achieved an unique position in the multi lingual image annotation and retrieval research landscape. The 2013 edition consisted of three tasks: the photo annotation and retrieval task, the plant identification task and the robot vision task. Furthermore, the medical annotation task, that traditionally has been under the ImageCLEF umbrella and that this year celebrates its tenth anniversary, has been organized in conjunction with AMIA for the first time. The paper describes the tasks and the 2013 competition, giving an unifying perspective of the present activities of the lab while discussion the future challenges and opportunities.
Overview of INEX 2013 BIBAFull-Text 269-281
  Patrice Bellot; Antoine Doucet; Shlomo Geva; Sairam Gurajada; Jaap Kamps; Gabriella Kazai; Marijn Koolen; Arunav Mishra; Véronique Moriceau; Josiane Mothe; Michael Preminger; Eric SanJuan; Ralf Schenkel; Xavier Tannier; Martin Theobald; Matthew Trappett; Qiuyue Wang
INEX investigates focused retrieval from structured documents by providing large test collections of structured documents, uniform evaluation measures, and a forum for organizations to compare their results. This paper reports on the INEX 2013 evaluation campaign, which consisted of four activities addressing three themes: searching professional and user generated data (Social Book Search track); searching structured or semantic data (Linked Data track); and focused retrieval (Snippet Retrieval and Tweet Contextualization tracks). INEX 2013 was an exciting year for INEX in which we consolidated the collaboration with (other activities in) CLEF and for the second time ran our workshop as part of the CLEF labs in order to facilitate knowledge transfer between the evaluation forums. This paper gives an overview of all the INEX 2013 tracks, their aims and task, the built test-collections, and gives an initial analysis of the results.
Recent Trends in Digital Text Forensics and Its Evaluation BIBAFull-Text 282-302
  Tim Gollub; Martin Potthast; Anna Beyer; Matthias Busse; Francisco Rangel; Paolo Rosso; Efstathios Stamatatos; Benno Stein
This paper outlines the concepts and achievements of our evaluation lab on digital text forensics, PAN 13, which called for original research and development on plagiarism detection, author identification, and author profiling. We present a standardized evaluation framework for each of the three tasks and discuss the evaluation results of the altogether 58 submitted contributions. For the first time, instead of accepting the output of software runs, we collected the softwares themselves and run them on a computer cluster at our site. As evaluation and experimentation platform we use TIRA, which is being developed at the Webis Group in Weimar. TIRA can handle large-scale software submissions by means of virtualization, sandboxed execution, tailored unit testing, and staged submission. In addition to the achieved evaluation results, a major achievement of our lab is that we now have the largest collection of state-of-the-art approaches with regard to the mentioned tasks for further analysis at our disposal.
QA4MRE 2011-2013: Overview of Question Answering for Machine Reading Evaluation BIBAFull-Text 303-320
  Anselmo Peñas; Eduard Hovy; Pamela Forner; Álvaro Rodrigo; Richard Sutcliffe; Roser Morante
This paper describes the methodology for testing the performance of Machine Reading systems through Question Answering and Reading Comprehension Tests. This was the attempt of the QA4MRE challenge which was run as a Lab at CLEF 2011-2013. The traditional QA task was replaced by a new Machine Reading task, whose intention was to ask questions that required a deep knowledge of individual short texts and in which systems were required to choose one answer, by analysing the corresponding test document in conjunction with background text collections provided by the organization. Four different tasks have been organized during these years: Main Task, Processing Modality and Negation for Machine Reading, Machine Reading of Biomedical Texts about Alzheimer's disease, and Entrance Exams. This paper describes their motivation, their goals, their methodology for preparing the data sets, their background collections, their metrics used for the evaluation, and the lessons learned along these three years.
Multilingual Question Answering over Linked Data (QALD-3): Lab Overview BIBAFull-Text 321-332
  Philipp Cimiano; Vanessa Lopez; Christina Unger; Elena Cabrio; Axel-Cyrille Ngonga Ngomo; Sebastian Walter
The third edition of the open challenge on Question Answering over Linked Data (QALD-3) has been conducted as a half-day lab at CLEF 2013. Differently from previous editions of the challenge, has put a strong emphasis on multilinguality, offering two tasks: one on multilingual question answering and one on ontology lexicalization. While no submissions were received for the latter, the former attracted six teams who submitted their systems' results on the provided datasets. This paper provides an overview of QALD-3, discussing the approaches proposed by the participating systems as well as the obtained results.
Overview of RepLab 2013: Evaluating Online Reputation Monitoring Systems BIBAKFull-Text 333-352
  Enrique Amigó; Jorge Carrillo de Albornoz; Irina Chugur; Adolfo Corujo; Julio Gonzalo; Tamara Martín; Edgar Meij; Maarten de Rijke; Damiano Spina
This paper summarizes the goals, organization, and results of the second RepLab competitive evaluation campaign for Online Reputation Management Systems (RepLab 2013). RepLab focused on the process of monitoring the reputation of companies and individuals, and asked participant systems to annotate different types of information on tweets containing the names of several companies: first tweets had to be classified as related or unrelated to the entity; relevant tweets had to be classified according to their polarity for reputation (Does the content of the tweet have positive or negative implications for the reputation of the entity?), clustered in coherent topics, and clusters had to be ranked according to their priority (potential reputation problems had to come first). The gold standard consists of more than 140,000 tweets annotated by a group of trained annotators supervised and monitored by reputation experts.
Keywords: RepLab; Reputation Management; Evaluation Methodologies and Metrics; Test Collections; Text Clustering; Sentiment Analysis
Entity Recognition in Parallel Multi-lingual Biomedical Corpora: The CLEF-ER Laboratory Overview BIBAFull-Text 353-367
  Dietrich Rebholz-Schuhmann; Simon Clematide; Fabio Rinaldi; Senay Kafkas; Erik M. van Mulligen; Chinh Bui; Johannes Hellrich; Ian Lewin; David Milward; Michael Poprat; Antonio Jimeno-Yepes; Udo Hahn; Jan A. Kors
The identification and normalisation of biomedical entities from the scientific literature has a long tradition and a number of challenges have contributed to the development of reliable solutions. Increasingly patient records are processed to align their content with other biomedical data resources, but this approach requires analysing documents in different languages across Europe [1,2].
   The CLEF-ER challenge has been organized by the Mantra project partners to improve entity recognition (ER) in multilingual documents. Several corpora in different languages, i.e. Medline titles, EMEA documents and patent claims, have been prepared to enable ER in parallel documents. The participants have been ask to annotate entity mentions with concept unique identifiers (CUIs) in the documents of their preferred non-English language.
   The evaluation determines the number of correctly identified entity mentions against a silver standard (Task A) and the performance measures for the identification of CUIs in the non-English corpora. The participants could make use of the prepared terminological resources for entity normalisation and of the English silver standard corpora (SSCs) as input for concept candidates in the non-English documents.
   The participants used different approaches including translation techniques and word or phrase alignments apart from lexical lookup and other text mining techniques. The performances for task A and B was lower for the patent corpus in comparison to Medline titles and EMEA documents. In the patent documents, chemical entities were identified at higher performance, whereas the other two document types cover a higher portion of medical terms. The number of novel terms provided from all corpora is currently under investigation.
   Altogether, the CLEF-ER challenge demonstrates the performances of annotation solutions in different languages against an SSC.