| From Information Retrieval to Information Interaction | | BIBA | Full-Text | 1-11 | |
| Gary Marchionini | |||
| This paper argues that a new paradigm for information retrieval has evolved that incorporates human attention and mental effort and takes advantage of new types of information objects and relationships that have emerged in the WWW environment. One aspect of this new model is attention to highly interactive user interfaces that engage people directly and actively in information seeking. Two examples of these kinds of interfaces are described. | |||
| IR and AI: Traditions of Representation and Anti-representation in Information Processing | | BIBA | Full-Text | 12-26 | |
| Yorick Wilks | |||
| The paper discusses the traditional, and ongoing, question as to whether natural language processing (NLP) techniques, or indeed and representational techniques at all, aid in the retrieval of information, as that task is traditionally understood. The discussion is partly a response to Karen Sparck Jones' (1999) claim that artificial intelligence, and by implication NLP, should learn from the methodology of Information Retrieval (IR), rather than vice versa, as the first sentence above implies. The issue has been made more interesting and complicated by the shift of interest from classic IR experiments with very long queries to Internet search queries which are typically of two highly ambiguous terms. This simple fact has changed the assumptions of the debate. Moreover, the return to statistical and empirical methods with NLP have made it less clear what an NLP technique, or even a "representational" method, is. The paper also notes the growth of "language models" within IR and the use of the term "translation" in recent years to describe a range of activities, including IR, and which constitutes rather the opposite of what Sparck Jones was calling for. | |||
| A User-Centered Approach to Evaluating Topic Models | | BIBA | Full-Text | 27-41 | |
| Diane Kelly; Fernando Diaz; Nicholas J. Belkin; James Allan | |||
| This paper evaluates the automatic creation of personal topic models using two language model-based clustering techniques. The results of these methods are compared with user-defined topic classes of web pages from personal web browsing histories from a 5-week period. The histories and topics were gathered during a naturalistic case study of the online information search and use behavior of two users. This paper further investigates the effectiveness of using display time and retention behaviors as implicit evidence for weighting documents during topic model creation. Results show that agglomerative techniques -- specifically, average-link clustering -- provide the most effective methodology for building topic models while ignoring topic evidence and implicit evidence. | |||
| A Study of User Interaction with a Concept-Based Interactive Query Expansion Support Tool | | BIBA | Full-Text | 42-56 | |
| Hideo Joho; Mark Sanderson; Micheline Beaulieu | |||
| A medium-scale user study was carried out to investigate the usability of a concept-based query expansion support tool. The tool was fully integrated into the interface of an IR system, and designed to support the user by offering automatically generated concept hierarchies. Two types of hierarchies were compared with a baseline. Several observations were made as a result of the study: 1) the hierarchy is often accessed after an examination of the first page of search results; 2) accessing the hierarchies reduces the number of iterations and paging actions; 3) accessing the hierarchies increases the chance of finding relevant items more accurately than the baseline; 4) the hierarchical structure helps the users to handle a large number of concepts; and finally, 5) subjects were not aware of the difference between two types of hierarchies. | |||
| Searcher's Assessments of Task Complexity for Web Searching | | BIBA | Full-Text | 57-71 | |
| David J. Bell; Ian Ruthven | |||
| The complexity of search tasks has been shown to be an important factor in searchers' ability to find relevant information and their satisfaction with the performance of search engines. In user evaluations of search engines an understanding of how task complexity affects search behaviour is important to properly understand the results of an evaluation. In this paper we examine the issue of search task complexity for the purposes of evaluation. In particular we concentrate on the searchers' ability to recognise the internal complexity of search tasks, how complexity is affected by task design, and how complexity affects the success of searching. | |||
| Evaluating Passage Retrieval Approaches for Question Answering | | BIBA | Full-Text | 72-84 | |
| Ian Roberts; Robert Gaizauskas | |||
| Automatic open domain question answering (QA) has been the focus of much recent research, stimulated by the introduction of a QA track in TREC in 1999. Many QA systems have been developed and most follow the same broad pattern of operation: first an information retrieval (IR) system, often passage-based, is used to find passages from a large document collection which are likely to contain answers, and then these passages are analysed in detail to extract answers from them. Most research to date has focused on this second stage, with relatively little detailed investigation into aspects of IR component performance which impact on overall QA system performance. In this paper, we (a) introduce two new measures, coverage and answer redundancy, which we believe capture aspects of IR performance specifically relevant to QA more appropriately than do the traditional recall and precision measures, and (b) demonstrate their use in evaluating a variety of passage retrieval approaches using questions from TREC-9 and TREC 2001. | |||
| Identification of Relevant and Novel Sentences Using Reference Corpus | | BIBA | Full-Text | 85-98 | |
| Hsin-Hsi Chen; Ming-Feng Tsai; Ming-Hung Hsu | |||
| The major challenging issue to determine the relevance and the novelty of sentences is the amount of information used in similarity computation among sentences. An information retrieval (IR) with reference corpus approach is proposed. A sentence is considered as a query to a reference corpus, and similarity is measured in terms of the weighting vectors of document lists ranked by IR systems. Two sentences are regarded as similar if they are related to the similar document lists returned by IR systems. A dynamic threshold setting method is presented. Besides IR with reference corpus, we also use IR systems to retrieve sentences from given sentences. The corpus-based approach with dynamic thresholds outperforms direct retrieval approach. The average F-measure of relevance and novelty detection using Okapi system was 0.212 and 0.207, 57.14% and 58.64% of human performance, respectively. | |||
| Answer Selection in a Multi-stream Open Domain Question Answering System | | BIBA | Full-Text | 99-111 | |
| Valentin Jijkoun; Maarten de Rijke | |||
| Question answering systems aim to meet users' information needs by returning exact answers in response to a question. Traditional open domain question answering systems are built around a single pipeline architecture. In an attempt to exploit multiple resources as well as multiple answering strategies, systems based on a multi-stream architecture have recently been introduced. Such systems face the challenging problem of having to select a single answer from pools of answers obtained using essentially different techniques. We report on experiments aimed at understanding and evaluating the effect of different options for answer selection in a multi-stream question answering system. We examine the impact of local tiling techniques, assignments of weights to streams based on past performance and/or question type, as well redundancy-based ideas. Our main finding is that redundancy-based ideas in combination with naively learned stream weights conditioned on question type work best, and improve significantly over a number of baselines. | |||
| A Bidimensional View of Documents for Text Categorisation | | BIBA | Full-Text | 112-126 | |
| Giorgio Maria Di Nunzio | |||
| The question addressed in this paper is to find a bidimensional representation of textual documents for the problem of text categorisation. The projection of documents is performed following subsequent steps. The main idea is to consider a possible double aspect of the importance of a word: the local importance in a category, and the global importance in the rest of the categories. This information is combined properly and summarized in two coordinates. Then, a machine learning method may be used in this simple bidimensional space to classify the documents. The results that can be obtained in this space are satisfactory with respect to the best state-of-the-art performances. | |||
| Query Difficulty, Robustness, and Selective Application of Query Expansion | | BIBA | Full-Text | 127-137 | |
| Giambattista Amati; Claudio Carpineto; Giovanni Romano | |||
| There is increasing interest in improving the robustness of IR systems, i.e.
their effectiveness on difficult queries. A system is robust when it achieves
both a high Mean Average Precision (MAP) value for the entire set of topics and
a significant MAP value over its worst X topics (MAP(X)). It is a well known
fact that Query Expansion (QE) increases global MAP but hurts the performance
on the worst topics. A selective application of QE would thus be a natural
answer to obtain a more robust retrieval system.
We define two information theoretic functions which are shown to be correlated respectively with the average precision and with the increase of average precision under the application of QE. The second measure is used to selectively apply QE. This method achieves a performance similar to that with unexpanded method on the worst topics, and better performance than full QE on the whole set of topics. | |||
| Combining CORI and the Decision-Theoretic Approach for Advanced Resource Selection | | BIBA | Full-Text | 138-153 | |
| Henrik Nottelmann; Norbert Fuhr | |||
| In this paper we combine two existing resource selection approaches, CORI and the decision-theoretic framework (DTF). The state-of-the-art system CORI belongs to the large group of heuristic resource ranking methods which select a fixed number of libraries with respect to their similarity to the query. In contrast, DTF computes an optimum resource selection with respect to overall costs (from different sources, e.g. retrieval quality, time, money). In this paper, we improve CORI by integrating it with DTF: The number of relevant documents is approximated by applying a linear or a logistic function on the CORI library scores. Based on this value, one of the existing DTF variants (employing a recall-precision function) estimates the number of relevant documents in the result set. Our evaluation shows that precision in the top ranks of this technique is higher than for the existing resource selection methods for long queries and lower for short queries; on average the combined approach outperforms CORI and the other DTF variants. | |||
| Predictive Top-Down Knowledge Improves Neural Exploratory Bottom-Up Clustering | | BIBA | Full-Text | 154-166 | |
| Chihli Hung; Stefan Wermter; Peter Smith | |||
| In this paper, we explore the hypothesis that integrating symbolic top-down knowledge into text vector representations can improve neural exploratory bottom-up representations for text clustering. By extracting semantic rules from WordNet, terms with similar concepts are substituted with a more general term, the hypernym. This hypernym semantic relationship supplements the neural model in document clustering. The neural model is based on the extended significance vector representation approach into which predictive top-down knowledge is embedded. When we examine our hypothesis by six competitive neural models, the results are consistent and demonstrate that our robust hybrid neural approach is able to improve classification accuracy and reduce the average quantization error on 100,000 full-text articles. | |||
| Contextual Document Clustering | | BIBA | Full-Text | 167-180 | |
| Vladimir Dobrynin; David Patterson; Niall Rooney | |||
| In this paper we present a novel algorithm for document clustering. This approach is based on distributional clustering where subject related words, which have a narrow context, are identified to form meta-tags for that subject. These contextual words form the basis for creating thematic clusters of documents. In a similar fashion to other research papers on document clustering, we analyze the quality of this approach with respect to document categorization problems and show it to outperform the information theoretic method of sequential information bottleneck. | |||
| Complex Linguistic Features for Text Classification: A Comprehensive Study | | BIBA | Full-Text | 181-196 | |
| Alessandro Moschitti; Roberto Basili | |||
| Previous researches on advanced representations for document retrieval have
shown that statistical state-of-the-art models are not improved by a variety of
different linguistic representations. Phrases, word senses and syntactic
relations derived by Natural Language Processing (NLP) techniques were observed
ineffective to increase retrieval accuracy. For Text Categorization (TC) are
available fewer and less definitive studies on the use of advanced document
representations as it is a relatively new research area (compared to document
retrieval).
In this paper, advanced document representations have been investigated. Extensive experimentation on representative classifiers, Rocchio and SVM, as well as a careful analysis of the literature have been carried out to study how some NLP techniques used for indexing impact TC. Cross validation over 4 different corpora in two languages allowed us to gather an overwhelming evidence that complex nominals, proper nouns and word senses are not adequate to improve TC accuracy. | |||
| Eliminating High-Degree Biased Character Bigrams for Dimensionality Reduction in Chinese Text Categorization | | BIBA | Full-Text | 197-208 | |
| Dejun Xue; Maosong Sun | |||
| High dimensionality of feature space is a main obstacle for Text Categorization (TC). In a candidate feature set consisting of Chinese character bigrams, there exist a number of bigrams which are high-degree biased according to character frequencies. Usually, these bigrams are likely to survive for their strength of discriminating documents after the process of feature selection. However, most of them are useless for document categorization because of the weakness in representing document contents. The paper firstly defines a criterion to identify the high-degree biased Chinese bigrams. Then, two schemes called s-BR1 and s-BR2 are proposed to deal with these bigrams: the former directly eliminates them from the feature set whereas the latter replaces them with the corresponding significant characters involved. Experimental results show that the high-degree biased bigrams should be eliminated from the feature set, and the σ-BR1 scheme is quite effective for further dimensionality reduction in Chinese text categorization, after a feature selection process with a Chi-CIG score function. | |||
| Broadcast News Gisting Using Lexical Cohesion Analysis | | BIBA | Full-Text | 209-222 | |
| Nicola Stokes; Eamonn Newman; Joe Carthy; Alan F. Smeaton | |||
| In this paper we describe an extractive method of creating very short summaries or gists that capture the essence of a news story using a linguistic technique called lexical chaining. The recent interest in robust gisting and title generation techniques originates from a need to improve the indexing and browsing capabilities of interactive digital multimedia systems. More specifically these systems deal with streams of continuous data, like a news programme, that require further annotation before they can be presented to the user in a meaningful way. We automatically evaluate the performance of our lexical chaining-based gister with respect to four baseline extractive gisting methods on a collection of closed caption material taken from a series of news broadcasts. We also report results of a human-based evaluation of summary quality. Our results show that our novel lexical chaining approach to this problem outperforms standard extractive gisting methods. | |||
| From Text Summarisation to Style-Specific Summarisation for Broadcast News | | BIBA | Full-Text | 223-237 | |
| Heidi Christensen; BalaKrishna Kolluru; Yoshihiko Gotoh; Steve Renals | |||
| In this paper we report on a series of experiments investigating the path from text-summarisation to style-specific summarisation of spoken news stories. We show that the portability of traditional text summarisation features to broadcast news is dependent on the diffusiveness of the information in the broadcast news story. An analysis of two categories of news stories (containing only read speech or some spontaneous speech) demonstrates the importance of the style and the quality of the transcript, when extracting the summary-worthy information content. Further experiments indicate the advantages of doing style-specific summarisation of broadcast news. | |||
| Relevance Feedback for Cross Language Image Retrieval | | BIBA | Full-Text | 238-252 | |
| Paul Clough; Mark Sanderson | |||
| In this paper we show how relevance feedback can be used to improve retrieval performance for a cross language image retrieval task through query expansion. This area of CLIR is different from existing problems, but has thus far received little attention from CLIR researchers. Using the ImageCLEF test collection, we simulate user interaction with a CL image retrieval system, and in particular the situation in which a user selects one or more relevant images from the top n. Using textual captions associated with the images, relevant images are used to create a feedback model in the Lemur language model for information retrieval, and our results show that feedback is beneficial, even when only one relevant document is selected. This is particularly useful for cross language retrieval where problems during translation can result in a poor initial ranked list with few relevant in the top n. We find that the number of feedback documents and the influence of the initial query on the feedback model most affect retrieval performance. | |||
| NNk Networks for Content-Based Image Retrieval | | BIBA | Full-Text | 253-266 | |
| Daniel Heesch; Stefan Rüger | |||
| This paper describes a novel interaction technique to support content-based image search in large image collections. The idea is to represent each image as a vertex in a directed graph. Given a set of image features, an arc is established between two images if there exists at least one combination of features for which one image is retrieved as the nearest neighbour of the other. Each arc is weighted by the proportion of feature combinations for which the nearest neighbour relationship holds. By thus integrating the retrieval results over all possible feature combinations, the resulting network helps expose the semantic richness of images and thus provides an elegant solution to the problem of feature weighting in content-based image retrieval. We give details of the method used for network generation and describe the ways a user can interact with the structure. We also provide an analysis of the network's topology and provide quantitative evidence for the usefulness of the technique. | |||
| Integrating Perceptual Signal Features within a Multi-facetted Conceptual Model for Automatic Image Retrieval | | BIBA | Full-Text | 267-282 | |
| Mohammed Belkhatir; Philippe Mulhem; Yves Chiaramella | |||
| The majority of the content-based image retrieval (CBIR) systems are restricted to the representation of signal aspects, e.g. color, texture...without explicitly considering the semantic content of images. According to these approaches a sun, for example, is represented by an orange or yellow circle, but not by the term "sun". The signal-oriented solutions are fully automatic, and thus easily usable on substantial amounts of data, but they do not fill the existing gap between the extracted low-level features and semantic descriptions. This obviously penalizes qualitative and quantitative performances in terms of recall and precision, and therefore users' satisfaction. Another class of methods, which were tested within the framework of the Fermi-GC project, consisted in modeling the content of images following a sharp process of human-assisted indexing. This approach, based on an elaborate model of representation (the conceptual graph formalism) provides satisfactory results during the retrieval phase but is not easily usable on large collections of images because of the necessary human intervention required for indexing. The contribution of this paper is twofold: in order to achieve more efficiency as far as user interaction is concerned, we propose to highlight a bond between these two classes of image retrieval systems and integrate signal and semantic features within a unified conceptual framework. Then, as opposed to state-of-the-art relevance feedback systems dealing with this integration, we propose a representation formalism supporting this integration which allows us to specify a rich query language combining both semantic and signal characterizations. We will validate our approach through quantitative (recall-precision curves) evaluations. | |||
| Improving Retrieval Effectiveness by Reranking Documents Based on Controlled Vocabulary | | BIBA | Full-Text | 283-295 | |
| Jaap Kamps | |||
| There is a common availability of classification terms in online text collections and digital libraries, such as manually assigned keywords or key-phrases from a controlled vocabulary in scientific collections. Our goal is to explore the use of additional classification information for improving retrieval effectiveness. Earlier research explored the effect of adding classification terms to user queries, leading to little or no improvement. We explore a new feedback technique that reranks the set of initially retrieved documents based on the controlled vocabulary terms assigned to the documents. Since we do not want to rely on the availability of special dictionaries or thesauri, we compute the meaning of controlled vocabulary terms based on their occurrence in the collection. Our reranking strategy significantly improves retrieval effectiveness in domain-specific collections. Experimental evaluation is done on the German GIRT and French Amaryllis collections, using the test-suite of the Cross-Language Evaluation Forum (CLEF). | |||
| A Study of the Assessment of Relevance for the INEX'02 Test Collection | | BIBA | Full-Text | 296-310 | |
| Gabriella Kazai; Sherezad Masood; Mounia Lalmas | |||
| We investigate possible assessment trends and inconsistencies within the collected relevance assessments of the INEX'02 test collection in order to provide a critical analysis of the employed relevance criterion and assessment procedure for the evaluation of content-oriented XML retrieval approaches. | |||
| A Simulated Study of Implicit Feedback Models | | BIBA | Full-Text | 311-326 | |
| Ryen W. White; Joemon M. Jose; C. J. van Rijsbergen; Ian Ruthven | |||
| In this paper we report on a study of implicit feedback models for unobtrusively tracking the information needs of searchers. Such models use relevance information gathered from searcher interaction and can be a potential substitute for explicit relevance feedback. We introduce a variety of implicit feedback models designed to enhance an Information Retrieval (IR) system's representation of searchers' information needs. To benchmark their performance we use a simulation-centric evaluation methodology that measures how well each model learns relevance and improves search effectiveness. The results show that a heuristic-based binary voting model and one based on Jeffrey's rule of conditioning [5] outperform the other models under investigation. | |||
| Cross-Language Information Retrieval Using EuroWordNet and Word Sense Disambiguation | | BIBA | Full-Text | 327-337 | |
| Paul Clough; Mark Stevenson | |||
| One of the aims of EuroWordNet (EWN) was to provide a resource for Cross-Language Information Retrieval (CLIR). In this paper we present experiments which test the usefulness of EWN for this purpose via a formal evaluation using the Spanish queries from the TREC6 CLIR test set. All CLIR systems using bilingual dictionaries must find a way of dealing with multiple translations and we employ a Word Sense Disambiguation (WSD) algorithm for this purpose. It was found that this algorithm achieved only around 50% correct disambiguation when compared with manual judgement, however, retrieval performance using the senses it returned was 90% of that recorded using manually disambiguated queries. | |||
| Fault-Tolerant Fulltext Information Retrieval in Digital Multilingual Encyclopedias with Weighted Pattern Morphing | | BIBA | Full-Text | 338-352 | |
| Wolfram M. Esser | |||
| This paper introduces a new approach to add fault-tolerance to a fulltext
retrieval system. The weighted pattern morphing technique circumvents some of
the disadvantages of the widely used edit distance measure and can serve as a
front end to almost any fast non fault-tolerant search engine. The technique
enables approximate searches by carefully generating a set of modified patterns
(morphs) from the original user pattern and by searching for promising members
of this set by a non fault-tolerant search backend. Morphing is done by
recursively applying so called submorphs, driven by a penalty weight matrix.
The algorithm can handle phonetic similarities that often occur in multilingual
scientific encyclopedias as well as normal typing errors such as omission or
swapping of letters. We demonstrate the process of filtering out less promising
morphs. We also show how results from approximate search experiments carried
out on a huge encyclopedic text corpus were used to determine reasonable
parameter settings.
A commercial pharmaceutic CD-ROM encyclopedia, a dermatological online encyclopedia and an online e-Learning system use an implementation of the presented approach and thus prove its "road capability". | |||
| Measuring a Cross Language Image Retrieval System | | BIBA | Full-Text | 353-363 | |
| Mark Sanderson; Paul Clough; Catherine Paterson; Wai Tung Lo | |||
| Cross language information retrieval is a field of study that has received significant research attention, resulting in systems that despite the errors of automatic translation (from query to document), on average, produce relatively good retrieval results. Traditionally, most work has focussed on retrieval from sets of newspaper articles; however, other forms of collection are being searched: one example being the cross language retrieval of images by text caption. Limited past work has established, through test collection evaluation, that as with traditional CLIR, image CLIR is effective. This paper presents two studies that start to establish the usability of such a system: first, a test collection-based examination, which avoids traditional measures of effectiveness, is described and results from it are discussed; second, a preliminary usability study of a working cross language image retrieval system is presented. Together the examinations show that, in general, searching for images captioned in a language unknown to a searcher is usable. | |||
| An Optimistic Model for Searching Web Directories | | BIBA | Full-Text | 364-377 | |
| Fidel Cacheda; Ricardo Baeza-Yates | |||
| Web directories are taxonomies for the classification of Web documents using a directed acyclic graph of categories. This paper introduces an optimistic model for Web directories that improves the performance of restricted searches. This model considers the directed acyclic graph of categories as a tree with some "exceptions". The validity of this optimistic model has been analysed by developing and comparing it with a basic model and a hybrid model with partial information. The proposed model is able to improve in 50% the response time of a basic model, and with respect to the hybrid model, both systems provide similar response time, except for large answers. In this case, the optimistic model outperforms the hybrid model in approximately 61%. Moreover, in a saturated workload environment the optimistic model proved to perform better than the basic and hybrid models for all type of queries. | |||
| Content-Aware DataGuides: Interleaving IR and DB Indexing Techniques for Efficient Retrieval of Textual XML Data | | BIBA | Full-Text | 378-393 | |
| Felix Weigel; Holger Meuss; François Bry; Klaus U. Schulz | |||
| Not only since the advent of XML, many applications call for efficient structured document retrieval, challenging both Information Retrieval (IR) and database (DB) research. Most approaches combining indexing techniques from both fields still separate path and content matching, merging the hits in an expensive join. This paper shows that retrieval is significantly accelerated by processing text and structure simultaneously. The Content-Aware DataGuide (CADG) interleaves IR and DB indexing techniques to minimize path matching and suppress joins at query time, also saving needless I/O operations during retrieval. Extensive experiments prove the CADG to outperform the DataGuide [11,14] by a factor 5 to 200 on average. For structurally unselective queries, it is over 400 times faster than the DataGuide. The best results were achieved on large collections of heterogeneously structured textual documents. | |||
| Performance Analysis of Distributed Architectures to Index One Terabyte of Text | | BIBA | Full-Text | 394-408 | |
| Fidel Cacheda; Vassilis Plachouras; Iadh Ounis | |||
| We simulate different architectures of a distributed Information Retrieval system on a very large Web collection, in order to work out the optimal setting for a particular set of resources. We analyse the effectiveness of a distributed, replicated and clustered architecture using a variable number of workstations. A collection of approximately 94 million documents and 1 terabyte of text is used to test the performance of the different architectures. We show that in a purely distributed architecture, the brokers become the bottleneck due to the high number of local answer sets to be sorted. In a replicated system, the network is the bottleneck due to the high number of query servers and the continuous data interchange with the brokers. Finally, we demonstrate that a clustered system will outperform a replicated system if a large number of query servers is used, mainly due to the reduction of the network load. | |||
| Applying the Divergence from Randomness Approach for Content-Only Search in XML Documents | | BIBA | Full-Text | 409-419 | |
| Mohammad Abolhassani; Norbert Fuhr | |||
| Content-only retrieval of XML documents deals with the problem of locating the smallest XML elements that satisfy the query. In this paper, we investigate the application of a specific language model for this task, namely Amati's approach of divergence from randomness. First, we investigate different ways for applying this model without modification by redefining the concept of an (atomic) document for the XML setting. However, this approach yields a retrieval quality lower than the best method known before. We improved the retrieval quality through extending the basic model by an additional factor that refers to the hierarchical structure of XML documents. | |||