| The Next Generation Web Search and the Demise of the Classic IR Model | | BIBA | Full-Text | 1 | |
| Andrei Broder | |||
| The classic IR model assumes a human engaged in activity that generates an
"information need". This need is verbalized and then expressed as a query to
search engine over a defined corpus. In the past decade, Web search engines
have evolved from a first generation based on classic IR algorithms scaled to
web size and thus supporting only informational queries, to a second generation
supporting navigational queries using web specific information (primarily link
analysis), to a third generation enabling transactional and other "semantic"
queries based on a variety of technologies aimed to directly satisfy the
unexpressed "user intent", thus moving further and further away from the
classic model.
What is coming next? In this talk, we identify two trends, both representing "short-circuits" of the model: The first is the trend towards context driven Information Supply (IS), that is, the goal of Web IR will widen to include the supply of relevant information from multiple sources without requiring the user to make an explicit query. The information supply concept greatly precedes information retrieval; what is new in the web framework, is the ability to supply relevant information specific to a given activity and a given user, while the activity is being performed. Thus the entire verbalization and query-formation phase are eliminated. The second trend is "social search" driven by the fact that the Web has evolved to being simultaneously a huge repository of knowledge and a vast social environment. As such, it is often more effective to ask the members of a given web milieu rather than construct elaborate queries. This short-circuits only the query formulation, but allows information finding activities such as opinion elicitation and discovery of social norms, that are not expressible at all as queries against a fixed corpus. | |||
| The Last Half-Century: A Perspective on Experimentation in Information Retrieval | | BIBA | Full-Text | 2 | |
| Stephen Robertson | |||
| The experimental evaluation of information retrieval systems has a venerable history. Long before the current notion of a search engine, in fact before search by computer was even feasible, people in the library and information science community were beginning to tackle the evaluation issue. Sometimes it feels as though evaluation methodology has become fixed (stable or frozen, according to your viewpoint). However, this is far from the case. Interest in methodological questions is as great now as it ever was, and new ideas are continuing to develop. This talk will be a personal take on the field. | |||
| Learning in Hyperlinked Environments | | BIBA | Full-Text | 3 | |
| Marco Gori | |||
| A remarkable number of important problems in different domains (e.g. web mining, pattern recognition, biology ...) are naturally modeled by functions defined on graphical domains, rather than on traditional vector spaces. Following the recent developments in statistical relational learning, in this talk, I introduce Diffusion Learning Machines (DLM) whose computation is very much related to Web ranking schemes based on link analysis. Using arguments from function approximation theory, I argue that, as a matter of fact, DLM can compute any conceivable ranking function on the Web. The learning is based on a human supervision scheme that takes into account both the content and the links of the pages. I give very promising experimental results on artificial tasks and on the learning of functions used in link analysis, like PageRank, HITS, and TrustRank. Interestingly, the proposed learning mechanism is proven to be effective also when the rank depends jointly on the page content and on the links. Finally, I argue that the propagation of the relationships expressed by the links reduces dramatically the sample complexity with respect to traditional learning machines operating on vector spaces, thus making it reasonable the application to real-world problems on the Web, like spam detection and page classification. | |||
| A Parameterised Search System | | BIBA | Full-Text | 4-15 | |
| Roberto Cornacchia; Arjen P. de Vries | |||
| This paper introduces the concept of a Parameterised Search System (PSS), which allows flexibility in user queries, and, more importantly, allows system engineers to easily define customised search strategies. Putting this idea into practise requires a carefully designed system architecture that supports a declarative abstraction language for the specification of search strategies. These specifications should stay as close as possible to the problem definition (i.e., the retrieval model to be used in the search application), abstracting away the details of the physical organisation of data and content. We show how extending an existing XML retrieval system with an abstraction mechanism based on array databases meets this requirement. | |||
| Similarity Measures for Short Segments of Text | | BIBA | Full-Text | 16-27 | |
| Donald Metzler; Susan Dumais; Christopher Meek | |||
| Measuring the similarity between documents and queries has been extensively studied in information retrieval. However, there are a growing number of tasks that require computing the similarity between two very short segments of text. These tasks include query reformulation, sponsored search, and image retrieval. Standard text similarity measures perform poorly on such tasks because of data sparseness and the lack of context. In this work, we study this problem from an information retrieval perspective, focusing on text representations and similarity measures. We examine a range of similarity measures, including purely lexical measures, stemming, and language modeling-based measures. We formally evaluate and analyze the methods on a query-query similarity task using 363,822 queries from a web search log. Our analysis provides insights into the strengths and weaknesses of each method, including important tradeoffs between effectiveness and efficiency. | |||
| Multinomial Randomness Models for Retrieval with Document Fields | | BIBA | Full-Text | 28-39 | |
| Vassilis Plachouras; Iadh Ounis | |||
| Document fields, such as the title or the headings of a document, offer a way to consider the structure of documents for retrieval. Most of the proposed approaches in the literature employ either a linear combination of scores assigned to different fields, or a linear combination of frequencies in the term frequency normalisation component. In the context of the Divergence From Randomness framework, we have a sound opportunity to integrate document fields in the probabilistic randomness model. This paper introduces novel probabilistic models for incorporating fields in the retrieval process using a multinomial randomness model and its information theoretic approximation. The evaluation results from experiments conducted with a standard TREC Web test collection show that the proposed models perform as well as a state-of-the-art field-based weighting model, while at the same time, they are theoretically founded and more extensible than current field-based models. | |||
| On Score Distributions and Relevance | | BIBA | Full-Text | 40-51 | |
| Stephen Robertson | |||
| We discuss the idea of modelling the statistical distributions of scores of documents, classified as relevant or non-relevant. Various specific combinations of standard statistical distributions have been used for this purpose. Some theoretical considerations indicate problems with some of the choices of pairs of distributions. Specifically, we revisit a generalisation of the well-known inverse relationship between recall and precision: some choices of pairs of distributions violate this generalised relationship. We identify the choices and the violations, and explore some of the consequences of this theoretical view. | |||
| Modeling Term Associations for Ad-Hoc Retrieval Performance Within Language Modeling Framework | | BIBAK | Full-Text | 52-63 | |
| Xing Wei; W. Bruce Croft | |||
| Previous research has shown that using term associations could improve the
effectiveness of information retrieval (IR) systems. However, most of the
existing approaches focus on query reformulation. Document reformulation has
just begun to be studied recently. In this paper, we study how to utilize term
association measures to do document modeling, and what types of measures are
effective in document language models. We propose a probabilistic term
association measure, compare it to some traditional methods, such as the
similarity co-efficient and window-based methods, in the language modeling (LM)
framework, and show that significant improvements over query likelihood (QL)
retrieval can be obtained. We also compare the method with state-of-the-art
document modeling techniques based on latent mixture models. Keywords: Information Retrieval; Language Model; Term/Word Associations/
Relationships; Term/Word similarity; Document Model; Topic Model | |||
| Static Pruning of Terms in Inverted Files | | BIBA | Full-Text | 64-75 | |
| Roi Blanco; Álvaro Barreiro | |||
| This paper addresses the problem of identifying collection dependent stop-words in order to reduce the size of inverted files. We present four methods to automatically recognise stop-words, analyse the tradeoff between efficiency and effectiveness, and compare them with a previous pruning approach. The experiments allow us to conclude that in some situations stop-words pruning is competitive with respect to other inverted file reduction techniques. | |||
| Efficient Indexing of Versioned Document Sequences | | BIBA | Full-Text | 76-87 | |
| Michael Herscovici; Ronny Lempel; Sivan Yogev | |||
| Many information systems keep multiple versions of documents. Examples include content management systems, version control systems (e.g. ClearCase, CVS), Wikis, and backup and archiving solutions. Often, it is desired to enable free-text search over such repositories, i.e. to enable submitting queries that may match any version of any document. We propose an indexing method that takes advantage of the inherent redundancy present in versioned documents by solving a variant of the multiple sequence alignment problem. The scheme produces an index that is much more compact than a standard index that treats each version independently. In experiments over publicly available versioned data, our method achieved compaction ratios of 81% as compared with standard indexing, while supporting the same retrieval capabilities. | |||
| Light Syntactically-Based Index Pruning for Information Retrieval | | BIBA | Full-Text | 88-100 | |
| Christina Lioma; Iadh Ounis | |||
| Most index pruning techniques eliminate terms from an index on the basis of the contribution of those terms to the content of the documents. We present a novel syntactically-based index pruning technique, which uses exclusively shallow syntactic evidence to decide upon which terms to prune. This type of evidence is document-independent, and is based on the assumption that, in a general collection of documents, there exists an approximately proportional relation between the frequency and content of 'blocks of parts of speech' (POS blocks) [5]. POS blocks are fixed-length sequences of nouns, verbs, and other parts of speech, extracted from a corpus. We remove from the index, terms that correspond to low-frequency POS blocks, using two different strategies: (i) considering that low-frequency POS blocks correspond to sequences of content-poor words, and (ii) considering that low-frequency POS blocks, which also contain 'non content-bearing parts of speech', such as prepositions for example, correspond to sequences of content-poor words. We experiment with two TREC test collections and two statistically different weighting models. Using full indices as our baseline, we show that syntactically-based index pruning overall enhances retrieval performance, in terms of both average and early precision, for light pruning levels, while also reducing the size of the index. Our novel low-cost technique performs at least similarly to other related work, even though it does not consider document-specific information, and as such it is more general. | |||
| Sorting Out the Document Identifier Assignment Problem | | BIBA | Full-Text | 101-112 | |
| Fabrizio Silvestri | |||
| The compression of Inverted File indexes in Web Search Engines has received a lot of attention in these last years. Compressing the index not only reduces space occupancy but also improves the overall retrieval performance since it allows a better exploitation of the memory hierarchy. In this paper we are going to empirically show that in the case of collections of Web Documents we can enhance the performance of compression algorithms by simply assigning identifiers to documents according to the lexicographical ordering of the URLs. We will validate this assumption by comparing several assignment techniques and several compression algorithms on a quite large document collection composed by about six million documents. The results are very encouraging since we can improve the compression ratio up to 40% using an algorithm that takes about ninety seconds to finish using only 100 MB of main memory. | |||
| Efficient Construction of FM-index Using Overlapping Block Processing for Large Scale Texts | | BIBAK | Full-Text | 113-123 | |
| Di Zhang; Yunquan Zhang; Jing Chen | |||
| In previous implementations of FM-index, the construction algorithms usually
need several times larger memory than text size. Sometimes the memory
requirement prevents the FM-index from being employed in processing large scale
texts. In this paper, we design an approach to constructing FM-index based on
overlapping block processing. It can build the FM-index in linear time and
constant temporary memory space, especially suitable for large scale texts.
Instead of loading and indexing text as a whole, the new approach splits the
text into blocks of fixed size, and then indexes them respectively. To assure
the correctness and effectiveness of query operation, before indexing, we
further append certain length of succeeding characters to the end of each
block. The experimental results show that, with a slight loss on the
compression ratio and query performance, our implementation provides a faster
and more flexible solution for the problem of construction efficiency. Keywords: FM-index; Self-index; Block processing | |||
| Performance Comparison of Clustered and Replicated Information Retrieval Systems | | BIBAK | Full-Text | 124-135 | |
| Fidel Cacheda; Victor Carneiro; Vassilis Plachouras; Iadh Ounis | |||
| The amount of information available over the Internet is increasing daily as
well as the importance and magnitude of Web search engines. Systems based on a
single centralised index present several problems (such as lack of
scalability), which lead to the use of distributed information retrieval
systems to effectively search for and locate the required information. A
distributed retrieval system can be clustered and/or replicated. In this paper,
using simulations, we present a detailed performance analysis, both in terms of
throughput and response time, of a clustered system compared to a replicated
system. In addition, we consider the effect of changes in the query topics over
time. We show that the performance obtained for a clustered system does not
improve the performance obtained by the best replicated system. Indeed, the
main advantage of a clustered system is the reduction of network traffic.
However, the use of a switched network eliminates the bottleneck in the
network, markedly improving the performance of the replicated systems.
Moreover, we illustrate the negative performance effect of the changes over
time in the query topics when a distributed clustered system is used. On the
contrary, the performance of a distributed replicated system is query
independent. Keywords: distributed information retrieval; performance; simulation | |||
| A Study of a Weighting Scheme for Information Retrieval in Hierarchical Peer-to-Peer Networks | | BIBA | Full-Text | 136-147 | |
| Massimo Melucci; Alberto Poggiani | |||
| The experimental results show that the proposed simple weighting scheme helps retrieve a significant proportion of relevant data after traversing only a small portion of a peer-to-peer hierarchical peer network in a depth-first manner. A real, large, highly heterogeneous test collection searched by very short, ambiguous queries was used for supporting the results. The efficiency and the effectiveness would suggest the implementation, for instance, in audio-video information retrieval systems, digital libraries or personal archives. | |||
| A Decision-Theoretic Model for Decentralised Query Routing in Hierarchical Peer-to-Peer Networks | | BIBA | Full-Text | 148-159 | |
| Henrik Nottelmann; Norbert Fuhr | |||
| Efficient and effective routing of content-based queries is an emerging problem in peer-to-peer networks, and can be seen as an extension of the traditional "resource selection" problem. The decision-theoretic framework for resource selection aims, in contrast to other approaches, at minimising overall costs including e.g. monetary costs, time and retrieval quality. A variant of this framework has been successfully applied to hierarchical peer-to-peer networks (where peers are partitioned into DL peers and hubs), but that approach considers retrieval quality only. This paper proposes a new model which is capable of considering also the time costs of hubs (i.e., the number of hops in subsequent steps). The evaluation on a large test-bed shows that this approach dramatically reduces the overall retrieval costs. | |||
| Central-Rank-Based Collection Selection in Uncooperative Distributed Information Retrieval | | BIBA | Full-Text | 160-172 | |
| Milad Shokouhi | |||
| Collection selection is one of the key problems in distributed information retrieval. Due to resource constraints it is not usually feasible to search all collections in response to a query. Therefore, the central component (broker) selects a limited number of collections to be searched for the submitted queries. During the past decade, several collection selection algorithms have been introduced. However, their performance varies on different testbeds. We propose a new collection-selection method based on the ranking of downloaded sample documents. We test our method on six testbeds and show that our technique can significantly outperform other state-of-the-art algorithms in most cases. We also introduce a new testbed based on the trec gov2 documents. | |||
| Results Merging Algorithm Using Multiple Regression Models | | BIBAK | Full-Text | 173-184 | |
| George Paltoglou; Michail Salampasis; Maria Satratzemi | |||
| This paper describes a new algorithm for merging the results of remote
collections in a distributed information retrieval environment. The algorithm
makes use only of the ranks of the returned documents, thus making it very
efficient in environments where the remote collections provide the minimum of
cooperation. Assuming that the correlation between the ranks and the relevancy
scores can be expressed through a logistic function and using sampled documents
from the remote collections the algorithm assigns local scores to the returned
ranked documents. Subsequently, using a centralized sample collection and
through linear regression, it assigns global scores, thus producing a final
merged document list for the user. The algorithm's effectiveness is measured
against two state-of-the-art results merging algorithms and its performance is
found to be superior to them in environments where the remote collections do
not provide relevancy scores. Keywords: Distributed Information Retrieval; Results Merging; Algorithms | |||
| Segmentation of Search Engine Results for Effective Data-Fusion | | BIBA | Full-Text | 185-197 | |
| Milad Shokouhi | |||
| Metasearch and data-fusion techniques combine the rank lists of multiple
document retrieval systems with the aim of improving search coverage and
precision.
We propose a new fusion method that partitions the rank lists of document retrieval systems into chunks. The size of chunks grows exponentially in the rank list. Using a small number of training queries, the probabilities of relevance of documents in different chunks are approximated for each search system. The estimated probabilities and normalized document scores are used to compute the final document ranks in the merged list. We show that our proposed method produces higher average precision values than previous systems across a range of testbeds. | |||
| Query Hardness Estimation Using Jensen-Shannon Divergence Among Multiple Scoring Functions | | BIBA | Full-Text | 198-209 | |
| Javed A. Aslam; Virgil Pavlu | |||
| We consider the issue of query performance, and we propose a novel method for automatically predicting the difficulty of a query. Unlike a number of existing techniques which are based on examining the ranked lists returned in response to perturbed versions of the query with respect to the given collection or perturbed versions of the collection with respect to the given query, our technique is based on examining the ranked lists returned by multiple scoring functions (retrieval engines) with respect to the given query and collection. In essence, we propose that the results returned by multiple retrieval engines will be relatively similar for "easy" queries but more diverse for "difficult" queries. By appropriately employing Jensen-Shannon divergence to measure the "diversity" of the returned results, we demonstrate a methodology for predicting query difficulty whose performance exceeds existing state-of-the-art techniques on TREC collections, often remarkably so. | |||
| Query Reformulation and Refinement Using NLP-Based Sentence Clustering | | BIBA | Full-Text | 210-221 | |
| Frédéric Roulland; Aaron Kaplan; Stefania Castellani; Claude Roux; Antonietta Grasso; Karin Pettersson; Jacki O'Neill | |||
| We have developed an interactive query refinement tool that helps users search a knowledge base for solutions to problems with electronic equipment. The system is targeted towards non-technical users, who are often unable to formulate precise problem descriptions on their own. Two distinct but interrelated functionalities support the refinement of a vague, non-technical initial query into a more precise problem description: a synonymy mechanism that allows the system to match non-technical words in the query with corresponding technical terms in the knowledge base, and a novel refinement mechanism that helps the user build up successively longer and more precise problem descriptions starting from the seed of the initial query. A natural language parser is used both in the application of context-sensitive synonymy rules and the construction of the refinement tree. | |||
| Automatic Morphological Query Expansion Using Analogy-Based Machine Learning | | BIBAK | Full-Text | 222-233 | |
| Fabienne Moreau; Vincent Claveau; Pascale Sébillot | |||
| Information retrieval systems (IRSs) usually suffer from a low ability to
recognize a same idea that is expressed in different forms. A way of improving
these systems is to take into account morphological variants. We propose here a
simple yet effective method to recognize these variants that are further used
so as to enrich queries. In comparison with already published methods, our
system does not need any external resources or a priori knowledge and thus
supports many languages. This new approach is evaluated against several
collections, 6 different languages and is compared to existing tools such as a
stemmer and a lemmatizer. Reported results show a significant and systematic
improvement of the whole IRS efficiency both in terms of precision and recall
for every language. Keywords: Morphological variation; query expansion; analogy-based machine learning;
unsupervised machine learning | |||
| Advanced Structural Representations for Question Classification and Answer Re-ranking | | BIBA | Full-Text | 234-245 | |
| Silvia Quarteroni; Alessandro Moschitti; Suresh Manandhar; Roberto Basili | |||
| In this paper, we study novel structures to represent information in three vital tasks in question answering: question classification, answer classification and answer reranking. We define a new tree structure called PAS to represent predicate-argument relations, as well as a new kernel function to exploit its representative power. Our experiments with Support Vector Machines and several tree kernel functions suggest that syntactic information helps specific task as question classification, whereas, when data sparseness is higher as in answer classification, studying coarse semantic information like PAS is a promising research area. | |||
| Incorporating Diversity and Density in Active Learning for Relevance Feedback | | BIBA | Full-Text | 246-257 | |
| Zuobing Xu; Ram Akella; Yi Zhang | |||
| Relevance feedback, which uses the terms in relevant documents to enrich the user's initial query, is an effective method for improving retrieval performance. An associated key research problem is the following: Which documents to present to the user so that the user's feedback on the documents can significantly impact relevance feedback performance. This paper views this as an active learning problem and proposes a new algorithm which can efficiently maximize the learning benefits of relevance feedback. This algorithm chooses a set of feedback documents based on relevancy, document diversity and document density. Experimental results show a statistically significant and appreciable improvement in the performance of our new approach over the existing active feedback methods. | |||
| Relevance Feedback Using Weight Propagation Compared with Information-Theoretic Query Expansion | | BIBAK | Full-Text | 258-270 | |
| Fadi Yamout; Michael Oakes; John Tait | |||
| A new Relevance Feedback (RF) technique called Weight Propagation has been
developed which provides greater retrieval effectiveness and computational
efficiency than previously described techniques. Documents judged relevant by
the user propagate positive weights to documents close by in vector similarity
space, while documents judged not relevant propagate negative weights to such
neighbouring documents. Retrieval effectiveness is improved since the documents
are treated as independent vectors rather than being merged into a single
vector as is the case with traditional vector model RF techniques, or by
determining the documents relevancy based in part on the lengths of all the
documents as with traditional probabilistic RF techniques. Improving the
computational efficiency of Relevance Feedback by considering only documents in
a given neighbourhood means that the Weight Propagation technique can be used
with large collections. Keywords: Relevance Feedback; Rocchio; Ide; Deviation From Randomness | |||
| A Retrieval Evaluation Methodology for Incomplete Relevance Assessments | | BIBA | Full-Text | 271-282 | |
| Mark Baillie; Leif Azzopardi; Ian Ruthven | |||
| In this paper we a propose an extended methodology for laboratory based Information Retrieval evaluation under incomplete relevance assessments. This new protocol aims to identify potential uncertainty during system comparison that may result from incompleteness. We demonstrate how this methodology can lead towards a finer grained analysis of systems. This is advantageous, because the detection of uncertainty during the evaluation process can guide and direct researchers when evaluating new systems over existing and future test collections. | |||
| Evaluating Query-Independent Object Features for Relevancy Prediction | | BIBA | Full-Text | 283-294 | |
| Andres R. Masegosa; Hideo Joho; Joemon M. Jose | |||
| This paper presents a series of experiments investigating the effectiveness of query-independent features extracted from retrieved objects to predict relevancy. Features were grouped into a set of conceptual categories, and individually evaluated based on click-through data collected in a laboratory-setting user study. The results showed that while textual and visual features were useful for relevancy prediction in a topic-independent condition, a range of features can be effective when topic knowledge was available. We also re-visited the original study from the perspective of significant features identified by our experiments. | |||
| The Utility of Information Extraction in the Classification of Books | | BIBAK | Full-Text | 295-306 | |
| Tom Betts; Maria Milosavljevic; Jon Oberlander | |||
| We describe work on automatically assigning classification labels to books
using the Library of Congress Classification scheme. This task is non-trivial
due to the volume and variety of books that exist. We explore the utility of
Information Extraction (IE) techniques within this text categorisation (TC)
task, automatically extracting structured information from the full text of
books. Experimental evaluation of performance involves a corpus of books from
Project Gutenberg. Results indicate that a classifier which combines methods
and tools from IE and TC significantly improves over a state-of-the-art text
classifier, achieving a classification performance of Fβ=1=0.8099. Keywords: Information Extraction; Named Entity Recognition; Book Categorisation;
Project Gutenberg; Ontologies; Digital Libraries | |||
| Combined Syntactic and Semantic Kernels for Text Classification | | BIBA | Full-Text | 307-318 | |
| Stephan Bloehdorn; Alessandro Moschitti | |||
| The exploitation of syntactic structures and semantic background knowledge has always been an appealing subject in the context of text retrieval and information management. The usefulness of this kind of information has been shown most prominently in highly specialized tasks, such as classification in Question Answering (QA) scenarios. So far, however, additional syntactic or semantic information has been used only individually. In this paper, we propose a principled approach for jointly exploiting both types of information. We propose a new type of kernel, the Semantic Syntactic Tree Kernel (SSTK), which incorporates linguistic structures, e.g. syntactic dependencies, and semantic background knowledge, e.g. term similarity based on WordNet, to automatically learn question categories in QA. We show the power of this approach in a series of experiments with a well known Question Classification dataset. | |||
| Fast Large-Scale Spectral Clustering by Sequential Shrinkage Optimization | | BIBA | Full-Text | 319-330 | |
| Tie-Yan Liu; Huai-Yuan Yang; Xin Zheng; Tao Qin; Wei-Ying Ma | |||
| In many applications, we need to cluster large-scale data objects. However, some recently proposed clustering algorithms such as spectral clustering can hardly handle large-scale applications due to the complexity issue, although their effectiveness has been demonstrated in previous work. In this paper, we propose a fast solver for spectral clustering. In contrast to traditional spectral clustering algorithms that first solve an eigenvalue decomposition problem, and then employ a clustering heuristic to obtain labels for the data points, our new approach sequentially decides the labels of relatively well-separated data points. Because the scale of the problem shrinks quickly during this process, it can be much faster than the traditional methods. Experiments on both synthetic data and a large collection of product records show that our algorithm can achieve significant improvement in speed as compared to traditional spectral clustering algorithms. | |||
| A Probabilistic Model for Clustering Text Documents with Multiple Fields | | BIBA | Full-Text | 331-342 | |
| Shanfeng Zhu; Ichigaku Takigawa; Shuqin Zhang; Hiroshi Mamitsuka | |||
| We address the problem of clustering documents with multiple fields, such as scientific literature with the distinct fields: title, abstract, keywords, main text and references. By taking into consideration of the distinct word distributions of each field, we propose a new probabilistic model, Field Independent Clustering Model (FICM), for clustering documents with multiple fields. The benefits of FICM come not only from integrating the discrimination abilities of each field but also from the power of selecting the most suitable component probabilistic model for each field. We examined the performance of FICM on the problem of clustering biomedical documents with three fields (title, abstract and MeSH). From the genomics track data of TREC 2004 and TREC 2005, we randomly generated 60 datasets where the number of classes in each dataset ranged from 3 to 12. By applying the appropriate configuration of generative models for each field, FICM outperformed a classical multinomial model in 59 out of the total 60 datasets, of which 47 were statistically significant at the 95% level, and FICM also outperformed a multivariate Bernoulli model in 52 out of the total 60 datasets, of which 36 were statistically significant at the 95% level. | |||
| Personalized Communities in a Distributed Recommender System | | BIBA | Full-Text | 343-355 | |
| Sylvain Castagnos; Anne Boyer | |||
| The amount of data exponentially increases in information systems and it becomes more and more difficult to extract the most relevant information within a very short time. Among others, collaborative filtering processes help users to find interesting items by modeling their preferences and by comparing them with users having the same tastes. Nevertheless, there are a lot of aspects to consider when implementing such a recommender system. The number of potential users and the confidential nature of some data are taken into account. This paper introduces a new distributed recommender system based on a user-based filtering algorithm. Our model has been transposed for Peer-to-Peer architectures. It has been especially designed to deal with problems of scalability and privacy. Moreover, it adapts its prediction computations to the density of the user neighborhood. | |||
| Information Recovery and Discovery in Collaborative Web Search | | BIBA | Full-Text | 356-367 | |
| Maurice Coyle; Barry Smyth | |||
| When we search for information we are usually either trying to recover something that we have found in the past or trying to discover some new information. In this paper we will evaluate how the collaborative Web search technique, which personalizes search results for communities of like-minded users, can help in recovery-and discovery-type search tasks in a corporate search scenario. | |||
| Collaborative Filtering Based on Transitive Correlations Between Items | | BIBA | Full-Text | 368-380 | |
| Alexandros Nanopoulos | |||
| With existing collaborative filtering algorithms, a user has to rate a sufficient number of items, before receiving reliable recommendations. To overcome this limitation, we provide the insight that correlations between items can form a network, in which we examine transitive correlations between items. The emergence of power laws in such networks signifies the existence of items with substantially more transitive correlations. The proposed algorithm finds highly correlative items and provides effective recommendations by adapting to user preferences. We also develop pruning criteria that reduce computation time. Detailed experimental results illustrate the superiority of the proposed method. | |||
| Entropy-Based Authorship Search in Large Document Collections | | BIBA | Full-Text | 381-392 | |
| Ying Zhao; Justin Zobel | |||
| The purpose of authorship search is to identify documents written by a particular author in large document collections. Standard search engines match documents to queries based on topic, and are not applicable to authorship search. In this paper we propose an approach to authorship search based on information theory. We propose relative entropy of style markers for ranking, inspired by the language models used in information retrieval. Our experiments on collections of newswire texts show that, with simple style markers and sufficient training data, documents by a particular author can be accurately found from within large collections. Although effectiveness does degrade as collection size is increased, with even 500,000 documents nearly half of the top-ranked documents are correct matches. We have also found that the authorship search approach can be used for authorship attribution, and is much more scalable than state-of-art approaches in terms of the collection size and the number of candidate authors. | |||
| Use of Topicality and Information Measures to Improve Document Representation for Story Link Detection | | BIBA | Full-Text | 393-404 | |
| Chirag Shah; Koji Eguchi | |||
| Several information organization, access, and filtering systems can benefit from different kind of document representations than those used in traditional Information Retrieval (IR). Topic Detection and Tracking (TDT) is an example of such a domain. In this paper we demonstrate that traditional methods for term weighing does not capture topical information and this leads to inadequate representation of documents for TDT applications. We present various hypotheses regarding the factors that can help in improving the document representation for Story Link Detection (SLD) -- a core task of TDT. These hypotheses are tested using various TDT corpora. From our experiments and analysis we found that in order to obtain a faithful representation of documents in TDT domain, we not only need to capture a term's importance in traditional IR sense, but also evaluate its topical behavior. Along with defining this behavior, we propose a novel measure that captures a term's importance at the corpus level as well as its discriminating power for topics. This new measure leads to a much better document representation as reflected by the significant improvements in the results. | |||
| Ad Hoc Retrieval of Documents with Topical Opinion | | BIBA | Full-Text | 405-417 | |
| Jason Skomorowski; Olga Vechtomova | |||
| With a growing amount of subjective content distributed across the Web, there is a need for a domain-independent information retrieval system that would support ad hoc retrieval of documents expressing opinions on a specific topic of the user's query. In this paper we present a lightweight method for ad hoc retrieval of documents which contain subjective content on the topic of the query. Documents are ranked by the likelihood each document expresses an opinion on a query term, approximated as the likelihood any occurrence of the query term is modified by a subjective adjective. Domain-independent user-based evaluation of the proposed method was conducted, and shows statistically significant gains over the baseline system. | |||
| Probabilistic Models for Expert Finding | | BIBA | Full-Text | 418-430 | |
| Hui Fang; ChengXiang Zhai | |||
| A common task in many applications is to find persons who are knowledgeable about a given topic (i.e., expert finding). In this paper, we propose and develop a general probabilistic framework for studying expert finding problem and derive two families of generative models (candidate generation models and topic generation models) from the framework. These models subsume most existing language models proposed for expert finding. We further propose several techniques to improve the estimation of the proposed models, including incorporating topic expansion, using a mixture model to model candidate mentions in the supporting documents, and defining an email count-based prior in the topic generation model. Our experiments show that the proposed estimation strategies are all effective to improve retrieval accuracy. | |||
| Using Relevance Feedback in Expert Search | | BIBA | Full-Text | 431-443 | |
| Craig Macdonald; Iadh Ounis | |||
| In Enterprise settings, expert search is considered an important task. In this search task, the user has a need for expertise -- for instance, they require assistance from someone about a topic of interest. An expert search system assists users with their "expertise need" by suggesting people with relevant expertise to the topic of interest. In this work, we apply an expert search approach that does not explicitly rank candidates in response to a query, but instead implicitly ranks candidates by taking into account a ranking of document with respect to the query topic. Pseudo-relevance feedback, aka query expansion, has been shown to improve retrieval performance in adhoc search tasks. In this work, we investigate to which extent query expansion can be applied in an expert search task to improve the accuracy of the generated ranking of candidates. We define two approaches for query expansion, one based on the initial of ranking of documents for the query topic. The second approach is based on the final ranking of candidates. The aims of this paper are two-fold. Firstly, to determine if query expansion can be successfully applied in the expert search task, and secondly, to ascertain if either of the two forms of query expansion can provide robust, improved retrieval performance. We perform a thorough evaluation contrasting the two query expansion approaches in the context of the TREC 2005 and 2006 Enterprise tracks. | |||
| Using Topic Shifts for Focussed Access to XML Repositories | | BIBA | Full-Text | 444-455 | |
| Elham Ashoori; Mounia Lalmas | |||
| In focussed XML retrieval, a retrieval unit is an XML element that not only contains information relevant to a user query, but also is specific to the query. INEX defines a relevant element to be at the right level of granularity if it is exhaustive and specific to the user's request -- i.e., it discusses fully the topic requested in the user's query and no other topics. The exhaustivity and specificity dimensions are both expressed in terms of the "quantity" of topics discussed within each element. We therefore propose to use the number of topic shifts in an XML element, to express the "quantity" of topics discussed in an element as a mean to capture specificity. We experimented with a number of element-specific smoothing methods within the language modelling framework. These methods enable us to adjust the amount of smoothing required for each XML element depending on its number of topic shifts, to capture specificity. Using the number of topic shifts combined with element length improves retrieval effectiveness, thus indicating that the number of topic shifts is a useful evidence in focussed XML retrieval. | |||
| Feature- and Query-Based Table of Contents Generation for XML Documents | | BIBA | Full-Text | 456-467 | |
| Zoltán Szlávik; Anastasios Tombros; Mounia Lalmas | |||
| The availability of a document's logical structure in XML retrieval allows retrieval systems to return document portions (elements) instead of whole documents. This helps searchers focusing their attention to the relevant content within a document. However, other, e.g. sibling or parent, elements of retrieved elements may also be important as they provide context to the retrieved elements. The use of table of contents (TOC) offers an overview of a document and shows the most important elements and their relations to each other. In this paper, we investigate what searchers think is important in automatic TOC generation. We ask searchers to indicate their preferences for element features (depth, length, relevance) in order to generate TOCs that help them complete information seeking tasks. We investigate what these preferences are, and what are the characteristics of the TOCs generated by searchers' settings. The results have implications for the design of intelligent TOC generation approaches for XML retrieval. | |||
| Setting Per-field Normalisation Hyper-parameters for the Named-Page Finding Search Task | | BIBA | Full-Text | 468-480 | |
| Ben He; Iadh Ounis | |||
| Per-field normalisation has been shown to be effective for Web search tasks, e.g. named-page finding. However, per-field normalisation also suffers from having hyper-parameters to tune on a per-field basis. In this paper, we argue that the purpose of per-field normalisation is to adjust the linear relationship between field length and term frequency. We experiment with standard Web test collections, using three document fields, namely the body of the document, its title, and the anchor text of its incoming links. From our experiments, we find that across different collections, the linear correlation values, given by the optimised hyper-parameter settings, are proportional to the maximum negative linear correlation. Based on this observation, we devise an automatic method for setting the per-field normalisation hyper-parameter values without the use of relevance assessment for tuning. According to the evaluation results, this method is shown to be effective for the body and title fields. In addition, the difficulty in setting the per-field normalisation hyper-parameter for the anchor text field is explained. | |||
| Combining Evidence for Relevance Criteria: A Framework and Experiments in Web Retrieval | | BIBAK | Full-Text | 481-493 | |
| Theodora Tsikrika; Mounia Lalmas | |||
| We present a framework that assesses relevance with respect to several
relevance criteria, by combining the query-dependent and query-independent
evidence indicating these criteria. This combination of evidence is modelled in
a uniform way, irrespective of whether the evidence is associated with a single
document or related documents. The framework is formally expressed within
Dempster-Shafer theory. It is evaluated for web retrieval in the context of
TREC's Topic Distillation task. Our results indicate that aggregating
content-based evidence from the linked pages of a page is beneficial, and that
the additional incorporation of their homepage evidence further improves the
effectiveness. Keywords: Dempster-Shafer theory; topic distillation; best entry point | |||
| Classifier Fusion for SVM-Based Multimedia Semantic Indexing | | BIBA | Full-Text | 494-504 | |
| Stéphane Ayache; Georges Quénot; Jérôme Gensel | |||
| Concept indexing in multimedia libraries is very useful for users searching and browsing but it is a very challenging research problem as well. Combining several modalities, features or concepts is one of the key issues for bridging the gap between signal and semantics. In this paper, we present three fusion schemes inspired from the classical early and late fusion schemes. First, we present a kernel-based fusion scheme which takes advantage of the kernel basis of classifiers such as SVMs. Second, we integrate a new normalization process into the early fusion scheme. Third, we present a contextual late fusion scheme to merge classification scores of several concepts. We conducted experiments in the framework of the official TRECVID'06 evaluation campaign and we obtained significant improvements with the proposed fusion schemes relatively to usual fusion schemes. | |||
| Search of Spoken Documents Retrieves Well Recognized Transcripts | | BIBA | Full-Text | 505-516 | |
| Mark Sanderson; Xiao Mang Shou | |||
| This paper presents a series of analyses and experiments on spoken document retrieval systems: search engines that retrieve transcripts produced by speech recognizers. Results show that transcripts that match queries well tend to be recognized more accurately than transcripts that match a query less well. This result was described in past literature, however, no study or explanation of the effect has been provided until now. This paper provides such an analysis showing a relationship between word error rate and query length. The paper expands on past research by increasing the number of recognitions systems that are tested as well as showing the effect in an operational speech retrieval system. Potential future lines of enquiry are also described. | |||
| Natural Language Processing for Usage Based Indexing of Web Resources | | BIBA | Full-Text | 517-524 | |
| Anne Boyer; Armelle Brun | |||
| The identification of reliable and interesting items on Internet becomes more and more difficult and time consuming. This paper is a position paper describing our intended work in the framework of multimedia information retrieval by browsing techniques within web navigation. It relies on a usage-based indexing of resources: we ignore the nature, the content and the structure of resources. We describe a new approach taking advantage of the similarity between statistical modeling of language and document retrieval systems. A syntax of usage is computed that designs a Statistical Grammar of Usage (SGU). A SGU enables resources classification to perform a personalized navigation assistant tool. It relies both on collaborative filtering to compute virtual communities of users and classical statistical language models. The resulting SGU is a community dependent SGU. | |||
| Harnessing Trust in Social Search | | BIBA | Full-Text | 525-532 | |
| Peter Briggs; Barry Smyth | |||
| The social Web emphasises the increased role of millions of users in the creation of a new type of online content, often expressed in the form of opinions or judgements. This has led to some novel approaches to information access that take advantage of user opinions and activities as a way to guide users as they browse or search for information. We describe a social search technique that harnesses the experiences of a network of searchers to generate result recommendations that can complement the search results that are returned by some standard Web search engine. | |||
| How to Compare Bilingual to Monolingual Cross-Language Information Retrieval | | BIBA | Full-Text | 533-540 | |
| Franco Crivellari; Giorgio Maria Di Nunzio; Nicola Ferro | |||
| The study of cross-lingual Information Retrieval Systems (IRSs) and a deep analysis of system performances should provide guidelines, hints, and directions to drive the design and development of the next generation MultiLingual Information Access (MLIA) systems. In addition, effective tools for interpreting and comparing the experimental results should be made easily available to the research community. To this end, we propose a twofold methodology for the evaluation of Cross Language Information Retrieval (CLIR) systems: statistical analyses to provide MLIA researchers with quantitative and more sophisticated analysis techniques; and graphical tools to allow for a more qualitative comparison and an easier presentation of the results. We provide concrete examples about how the proposed methodology can be applied by studying the monolingual and bilingual tasks of the Cross-Language Evaluation Forum (CLEF) 2005 and 2006 campaigns. | |||
| Multilingual Text Classification Using Ontologies | | BIBA | Full-Text | 541-548 | |
| Gerard de Melo; Stefan Siersdorfer | |||
| In this paper, we investigate strategies for automatically classifying documents in different languages thematically, geographically or according to other criteria. A novel linguistically motivated text representation scheme is presented that can be used with machine learning algorithms in order to learn classifications from pre-classified examples and then automatically classify documents that might be provided in entirely different languages. Our approach makes use of ontologies and lexical resources but goes beyond a simple mapping from terms to concepts by fully exploiting the external knowledge manifested in such resources and mapping to entire regions of concepts. For this, a graph traversal algorithm is used to explore related concepts that might be relevant. Extensive testing has shown that our methods lead to significant improvements compared to existing approaches. | |||
| Using Visual-Textual Mutual Information and Entropy for Inter-modal Document Indexing | | BIBAK | Full-Text | 549-556 | |
| Jean Martinet; Shin'ichi Satoh | |||
| This paper presents a contribution in the domain of automatic visual
document indexing based on inter-modal analysis, in the form of a statistical
indexing model. The approach is based on inter-modal document analysis, which
consists in modeling and learning some relationships between several modalities
from a data set of annotated documents in order to extract semantics. When one
of the modalities is textual, the learned associations can be used to predict a
textual index for visual data from a new document (image or video). More
specifically, the presented approach relies on a learning process in which
associations between visual and textual information are characterized by the
mutual information of the modalities. Besides, the model uses the information
entropy of the distribution of the visual modality against the textual modality
as a second source to select relevant indexing terms. We have implemented the
proposed information theoretic model, and the results of experiments assessing
its performance on two collections (image and video) show that information
theory is an interesting framework to automatically annotate documents. Keywords: Indexing model; mutual information; entropy; inter-modal analysis | |||
| A Study of Global Inference Algorithms in Multi-document Summarization | | BIBA | Full-Text | 557-564 | |
| Ryan McDonald | |||
| In this work we study the theoretical and empirical properties of various global inference algorithms for multi-document summarization. We start by defining a general framework for inference in summarization. We then present three algorithms: The first is a greedy approximate method, the second a dynamic programming approach based on solutions to the knapsack problem, and the third is an exact algorithm that uses an Integer Linear Programming formulation of the problem. We empirically evaluate all three algorithms and show that, relative to the exact solution, the dynamic programming algorithm provides near optimal results with preferable scaling properties. | |||
| Document Representation Using Global Association Distance Model | | BIBA | Full-Text | 565-572 | |
| José E. Medina-Pagola; Ansel Y. Rodríguez; Abdel Hechavarría; José Hernández Palancar | |||
| Text information processing depends critically on the proper representation of documents. Traditional models, like the vector space model, have significant limitations because they do not consider semantic relations amongst terms. In this paper we analyze a document representation using the association graph scheme and present a new approach called Global Association Distance Model (GADM). At the end, we compare GADM using K-NN classifier with the classical vector space model and the association graph model. | |||
| Sentence Level Sentiment Analysis in the Presence of Conjuncts Using Linguistic Analysis | | BIBAK | Full-Text | 573-580 | |
| Arun Meena; T. V. Prabhakar | |||
| In this paper we present an approach to extract sentiments associated with a
phrase or sentence. Sentiment analysis has been attempted mostly for documents
typically a review or a news item. Conjunctions have a substantial impact on
the overall sentiment of a sentence, so here we present how atomic sentiments
of individual phrases combine together in the presence of conjuncts to decide
the overall sentiment of a sentence. We used word dependencies and dependency
trees to analyze the sentence constructs and were able to get results close to
80%. We have also analyzed the effect of WordNet on the accuracy of the results
over General Inquirer. Keywords: Sentiment analysis; favorability analysis; text mining; information
extraction; semantic orientation; text classification | |||
| PageRank: When Order Changes | | BIBA | Full-Text | 581-588 | |
| Massimo Melucci; Luca Pretto | |||
| As PageRank is a ranking algorithm, it is of prime interest to study the order induced by its values on webpages. In this paper a thorough mathematical analysis of PageRank-induced order changes when the damping factor varies is provided. Conditions that do not allow variations in the order are studied, and the mechanisms that make the order change are mathematically investigated. Moreover the influence on the order of a truncation in the actual computation of PageRank through a power series is analysed. Experiments carried out on a large Web digraph to integrate the mathematical analysis show that PageRank -- while working on a real digraph -- tends to hinder variations in the order of large rankings, presenting a high stability in its induced order both in the face of large variations of the damping factor value and in the face of truncations in its computation. | |||
| Model Tree Learning for Query Term Weighting in Question Answering | | BIBA | Full-Text | 589-596 | |
| Christof Monz | |||
| Question answering systems rely on retrieval components to identify documents that contain an answer to a user's question. The formulation of queries that are used for retrieving those documents has a strong impact on the effectiveness of the retrieval component. Here, we focus on predicting the importance of terms from the original question. We use model tree machine learning techniques in order to assign weights to query terms according to their usefulness for identifying documents that contain an answer. Incorporating the learned weights into a state-of-the-art retrieval system results in statistically significant improvements. | |||
| Examining Repetition in User Search Behavior | | BIBA | Full-Text | 597-604 | |
| Mark Sanderson; Susan Dumais | |||
| This paper describes analyses of the repeated use of search engines. It is shown that users commonly re-issue queries, either to examine search results deeply or simply to query again, often days or weeks later. Hourly and weekly periodicities in behavior are observed for both queries and clicks. Navigational queries were found to be repeated differently from others. | |||
| Popularity Weighted Ranking for Academic Digital Libraries | | BIBAK | Full-Text | 605-612 | |
| Yang Sun; C. Lee Giles | |||
| We propose a popularity weighted ranking algorithm for academic digital
libraries that uses the popularity factor of a publication venue overcoming the
limitations of impact factors. We compare our method with the naive PageRank,
citation counts and HITS algorithm, three popular measures currently used to
rank papers beyond lexical similarity. The ranking results are evaluated by
discounted cumulative gain (DCG) method using four human evaluators. We show
that our proposed ranking algorithm improves the DCG performance by 8.5% on
average compared to naive PageRank, 16.3% compared to citation count and 23.2%
compared to HITS. The algorithm is also evaluated by click through data from
CiteSeer usage log. Keywords: weighted ranking; citation analysis; digital library | |||
| Naming Functions for the Vector Space Model | | BIBA | Full-Text | 613-620 | |
| Yannis Tzitzikas; Yannis Theoharis | |||
| The Vector Space Model (VSM) is probably the most widely used model for retrieving information from text collections (and recently from over other kinds of corpi). Assuming this model, we study the problem of finding the best query that "names" (or describes) a given (unordered or ordered) set of objects. We formulate several variations of this problem and we provide methods and algorithms for solving them. | |||
| Effective Use of Semantic Structure in XML Retrieval | | BIBA | Full-Text | 621-628 | |
| Roelof van Zwol; Tim van Loosbroek | |||
| The objective of XML retrieval is to return relevant XML document fragments
that answer a given user information need, by exploiting the document
structure. The focus in this article is on automatically deriving and using
semantic XML structure to enhance the retrieval performance of XML retrieval
systems. Based on a naive approach for named entity detection, we discuss how
the structure of an XML document can be enriched using the Reuters 21587 news
collection.
Based on a retrieval performance experiment, we study the effect of the additional semantic structure on the retrieval performance of our XSee search engine for XML documents. The experiment provides some initial evidence that an XML retrieval system significantly benefits from having meaningful XML structure. | |||
| Searching Documents Based on Relevance and Type | | BIBA | Full-Text | 629-636 | |
| Jun Xu; Yunbo Cao; Hang Li; Nick Craswell; Yalou Huang | |||
| This paper extends previous work on document retrieval and document type classification, addressing the problem of 'typed search'. Specifically, given a query and a designated document type, the search system retrieves and ranks documents not only based on the relevance to the query, but also based on the likelihood of being the designated document type. The paper formalizes the problem in a general framework consisting of 'relevance model' and 'type model'. The relevance model indicates whether or not a document is relevant to a query. The type model indicates whether or not a document belongs to the designated document type. We consider three methods for combing the models: linear combination of scores, thresholding on the type score, and a hybrid of the previous two methods. We take course page search and instruction document search as examples and have conducted a series of experiments. Experimental results show our proposed approaches can significantly outperform the baseline methods. | |||
| Investigation of the Effectiveness of Cross-Media Indexing | | BIBA | Full-Text | 637-644 | |
| Murat Yakici; Fabio Crestani | |||
| Cross-media analysis and indexing leverage the individual potential of each indexing information provided by different modalities, such as speech, text and image, to improve the effectiveness of information retrieval and filtering in later stages. The process does not only constitute generating a merged representation of the digital content, such as MPEG-7, but also enriching it in order to help remedy the imprecision and noise introduced during the low-level analysis phases. It has been hypothesized that a system that combines different media descriptions of the same multi-modal audio-visual segment in a semantic space will perform better at retrieval and filtering time. In order to validate this hypothesis, we have developed a cross-media indexing system which utilises the Multiple Evidence approach by establishing links among the modality specific textual descriptions in order to depict topical similarity. | |||
| Improve Ranking by Using Image Information | | BIBAK | Full-Text | 645-652 | |
| Qing Yu; Shuming Shi; Zhiwei Li; Ji-Rong Wen; Wei-Ying Ma | |||
| This paper explores the feasibility of including image information embedded
in Web pages in relevance computation to improve search performance. In
determining the ranking of Web pages against a given query, most (if not all)
modern Web search engines consider two kinds of factors: text information
(including title, URL, body text, anchor text, etc) and static ranking (e.g.
PageRank [1]). Although images have been widely used to help represent Web
pages and carry valuable information, little work has been done to take
advantage of them in computing the relevance score of a Web page given a query.
We propose, in this paper, a framework to contain image information in ranking
functions. Preliminary experimental results show that, when image information
is used properly, ranking results can be improved. Keywords: Web search; image information; image importance; relevance | |||
| N-Step PageRank for Web Search | | BIBA | Full-Text | 653-660 | |
| Li Zhang; Tao Qin; Tie-Yan Liu; Ying Bao; Hang Li | |||
| PageRank has been widely used to measure the importance of web pages based on their interconnections in the web graph. Mathematically speaking, PageRank can be explained using a Markov random walk model, in which only the direct outlinks of a page contribute to its transition probability. In this paper, we propose improving the PageRank algorithm by looking N-step ahead when constructing the transition probability matrix. The motivation comes from the similar "looking N-step ahead" strategy that is successfully used in computer chess. Specifically, we assume that if the random surfer knows the N-step outlinks of each web page, he/she can make a better decision on choosing which page to navigate for the next time. It is clear that the classical PageRank algorithm is a special case of our proposed N-step PageRank method. Experimental results on the dataset of TREC Web track show that our proposed algorithm can boost the search accuracy of classical PageRank by more than 15% in terms of mean average precision. | |||
| Authorship Attribution Via Combination of Evidence | | BIBA | Full-Text | 661-669 | |
| Ying Zhao; Phil Vines | |||
| Authorship attribution is a process of determining who wrote a particular document. We have found that different systems work well for particular sets of authors but not others. In this paper, we propose three authorship attribution systems, based on different ways of combining existing methodologies. All systems show better effectiveness than the state-of-art methods. | |||
| Cross-Document Entity Tracking | | BIBA | Full-Text | 670-673 | |
| Roxana Angheluta; Marie-Francine Moens | |||
| The main focus of current work is to analyze useful features for linking and disambiguating person entities across documents. The more general problem of linking and disambiguating any kind of entity is known as entity detection and tracking (EDT) or noun phrase coreference resolution. EDT has applications in many important areas of information retrieval: clustering results in search engines when looking for a particular person; possibility to answer questions such as "Who was Woodward's source in the Plame scandal?" with "senior administration official" or "Richard Armitage" and information fusion from multiple documents. In current work person entities are limited to names and nominal entities. We emphasize the linguistic aspect of cross-document EDT: testing novel features useful in EDT across documents, such as the syntactic and semantic characteristics of the entities. The most important class of new features are contextual features, at varying levels of detail: events, related named-entities, and local context. The validity of the features is evaluated on a corpus annotated for cross-document coreference resolution of person names and nominals, and also on a corpus annotated only for names. | |||
| Enterprise People and Skill Discovery Using Tolerant Retrieval and Visualization | | BIBA | Full-Text | 674-677 | |
| Jan Brunnert; Omar Alonso; Dirk Riehle | |||
| Understanding an enterprise's workforce and skill-set can be seen as the key to understanding an organization's capabilities. In today's large organizations it has become increasingly difficult to find people that have specific skills or expertise or to explore and understand the overall picture of an organization's portfolio of topic expertise. This article presents a case study of analyzing and visualizing such expertise with the goal of enabling human users to assess and quickly find people with a desired skill set. Our approach is based on techniques like n-grams, clustering, and visualization for improving the user search experience for people and skills. | |||
| Experimental Results of the Signal Processing Approach to Distributional Clustering of Terms on Reuters-21578 Collection | | BIBAK | Full-Text | 678-681 | |
| Marta Capdevila Dalmau; Oscar W. Márquez Flórez | |||
| Distributional Clustering has showed to be an effective and powerful
approach to supervised term extraction aimed at reducing the original indexing
space dimensionality for Automatic Text Categorization [2]. In a recent paper
[1] we introduced a new Signal Processing approach to Distributional Clustering
which reached categorization results on 20 Newsgroups dataset similar to those
obtained by other information-theoretic approaches [3][4][5]. Here we
re-validate our method by showing that the 90-categories Reuters-21578
benchmark collection can be indexed with a minimum loss of categorization
accuracy (around 2% with Naïve Bayes categorizer) with only 50 clusters. Keywords: Automatic text categorization; Distributional clustering; Signal processing;
Variance; Correlation coefficient | |||
| Overall Comparison at the Standard Levels of Recall of Multiple Retrieval Methods with the Friedman Test | | BIBA | Full-Text | 682-685 | |
| José M. Casanova; Manuel A. Presedo Quindimil; Álvaro Barreiro | |||
| We propose a new application of the Friedman statistical test of significance to compare multiple retrieval methods. After measuring the average precision at the eleven standard levels of recall, our application of the Friedman test provides a global comparison of the methods. In some experiments this test provides additional and useful information to decide if methods are different. | |||
| Building a Desktop Search Test-Bed | | BIBA | Full-Text | 686-690 | |
| Sergey Chernov; Pavel Serdyukov; Paul-Alexandru Chirita; Gianluca Demartini; Wolfgang Nejdl | |||
| In the last years several top-quality papers utilized temporary Desktop data and/or browsing activity logs for experimental evaluation. Building a common testbed for the Personal Information Management community is thus becoming an indispensable task. In this paper we present a possible dataset design and discuss the means to create it. | |||
| Hierarchical Browsing of Video Key Frames | | BIBA | Full-Text | 691-694 | |
| Gianluigi Ciocca; Raimondo Schettini | |||
| We propose an innovative, general purpose, method to the selection and hierarchical representation of key frames of a video sequence for video summarization. It is able to create a hierarchical storyboard that the user may easily browse. The method is composed by three different steps. The first removes meaningless key frames, using supervised classification performed by a neural network on the basis of pictorial features and a visual attention model algorithm. The second step provides for the grouping of the key frames into clusters to allow multilevel summary using both low level and high level features. The third step identifies the default summary level that is shown to the users: starting from this set of key frames, the users can then browse the video content at different levels of detail. | |||
| Active Learning with History-Based Query Selection for Text Categorisation | | BIBA | Full-Text | 695-698 | |
| Michael Davy; Saturnino Luz | |||
| Automated text categorisation systems learn a generalised hypothesis from large numbers of labelled examples. However, in many domains labelled data is scarce and expensive to obtain. Active learning is a technique that has shown to reduce the amount of training data required to produce an accurate hypothesis. This paper proposes a novel method of incorporating predictions made in previous iterations of active learning into the selection of informative unlabelled examples. We show empirically how this method can lead to increased classification accuracy compared to alternative techniques. | |||
| Fighting Link Spam with a Two-Stage Ranking Strategy | | BIBA | Full-Text | 699-702 | |
| Guang-Gang Geng; Chun-Heng Wang; Qiu-Dan Li; Yuan-Ping Zhu | |||
| Most of the existing combating web spam techniques focus on the spam detection itself, which are separated from the ranking process. In this paper, we propose a two-stage ranking strategy, which makes good use of hyperlink information among Websites and Website's intra structure information. The proposed method incorporates web spam detection into the ranking process and penalizes the ranking score of potential spam pages, instead of removing them arbitrarily. Preliminary experimental results show that our method is feasible and effective. | |||
| Improving Naive Bayes Text Classifier Using Smoothing Methods | | BIBA | Full-Text | 703-707 | |
| Feng He; Xiaoqing Ding | |||
| The performance of naive Bayes text classifier is greatly influenced by parameter estimation, while the large vocabulary and scarce labeled training set bring difficulty in parameter estimation. In this paper, several smoothing methods are introduced to estimate parameters in naive Bayes text classifier. The proposed approaches can achieve better and more stable performance than Laplace smoothing. | |||
| Term Selection and Query Operations for Video Retrieval | | BIBA | Full-Text | 708-711 | |
| Bouke Huurnink; Maarten de Rijke | |||
| We investigate the influence of term selection and query operations on the text retrieval component of video search. Our main finding is that the greatest gain is to be found in the combination of character n-grams, stemmed text, and proximity terms. | |||
| An Effective Threshold-Based Neighbor Selection in Collaborative Filtering | | BIBA | Full-Text | 712-715 | |
| Taek-Hun Kim; Sung-Bong Yang | |||
| In this paper we present a recommender system using an effective threshold-based neighbor selection in collaborative filtering. The proposed method uses the substitute neighbors of the test customer who may have an unusual preferences or who are the first rater. The experimental results show that the recommender systems using the proposed method find the proper neighbors and give a good prediction quality. | |||
| Combining Multiple Sources of Evidence in XML Multimedia Documents: An Inference Network Incorporating Element Language Models | | BIBA | Full-Text | 716-719 | |
| Zhigang Kong; Mounia Lalmas | |||
| This work makes use of the semantic structure and logical structure in XML documents, and their combination to represent and retrieve XML multimedia content. We develop a Bayesian network incorporating element language models for the retrieval of a mixture of text and image. In addition, an element-based collection language model is used in the element language model smoothing. The proposed approach was successfully evaluated on the INEX 2005 multimedia data set. | |||
| Language Model Based Query Classification | | BIBA | Full-Text | 720-723 | |
| Andreas Merkel; Dietrich Klakow | |||
| In this paper we propose a new way of using language models in query classification for question answering systems. We used a Bayes classifier as classification paradigm. Experimental results show that our approach outperforms current classification methods like Naive Bayes and SVM. | |||
| Integration of Text and Audio Features for Genre Classification in Music Information Retrieval | | BIBA | Full-Text | 724-727 | |
| Robert Neumayer; Andreas Rauber | |||
| Multimedia content can be described in versatile ways as its essence is not limited to one view. For music data these multiple views could be a song's audio features as well as its lyrics. Both of these modalities have their advantages as text may be easier to search in and could cover more of the 'content semantics' of a song, while omitting other types of semantic categorisation. (Psycho)acoustic feature sets, on the other hand, provide the means to identify tracks that 'sound similar' while less supporting other kinds of semantic categorisation. Those discerning characteristics of different feature sets meet users' differing information needs. We will explain the nature of text and audio feature sets which describe the same audio tracks. Moreover, we will propose the use of textual data on top of low level audio features for music genre classification. Further, we will show the impact of different combinations of audio features and textual features based on content words. | |||
| Retrieval Method for Video Content in Different Format Based on Spatiotemporal Features | | BIBA | Full-Text | 728-731 | |
| Xuefeng Pan; Jintao Li; Yongdong Zhang; Sheng Tang; Juan Cao | |||
| In this paper a robust video content retrieval method based on spatiotemporal features is proposed. To date, most video retrieval methods are using the character of video key frames. This kind of frame based methods is not robust enough for different video format. With our method, the temporal variation of visual information is presented using spatiotemporal slice. Then the DCT is used to extract feature of slice. With this kind of feature, a robust video content retrieval algorithm is developed. The experiment results show that the proposed feature is robust for variant video format. | |||
| Combination of Document Priors in Web Information Retrieval | | BIBA | Full-Text | 732-736 | |
| Jie Peng; Iadh Ounis | |||
| Query independent features (also called document priors), such as the number of incoming links to a document, its PageRank, or the length of its associated URL, have been explored to boost the retrieval effectiveness of Web Information Retrieval (IR) systems. The combination of such query independent features could further enhance the retrieval performance. However, most current combination approaches are based on heuristics, which ignore the possible dependence between the document priors. In this paper, we present a novel and robust method for combining document priors in a principled way. We use a conditional probability rule, which is derived from Kolmogorov's axioms. In particular, we investigate the retrieval performance attainable by our combination of priors method, in comparison to the use of single priors and a heuristic prior combination method. Furthermore, we examine when and how document priors should be combined. | |||
| Enhancing Expert Search Through Query Modeling | | BIBA | Full-Text | 737-740 | |
| Pavel Serdyukov; Sergey Chernov; Wolfgang Nejdl | |||
| An expert finding is a very common task among enterprise search activities, while its usual retrieval performance is far from the quality of the Web search. Query modeling helps to improve traditional document retrieval, so we propose to apply it in a new setting. We adopt a general framework of language modeling for expert finding. We show how expert language models can be used for advanced query modeling. A preliminary experimental evaluation on TREC Enterprise Track 2006 collection shows that our method improves the retrieval precision on the expert finding task. | |||
| A Hierarchical Consensus Architecture for Robust Document Clustering | | BIBA | Full-Text | 741-744 | |
| Xavier Sevillano; Germán Cobo; Francesc Alías; Joan Claudi Socoró | |||
| A major problem encountered by text clustering practitioners is the difficulty of determining a priori which is the optimal text representation and clustering technique for a given clustering problem. As a step towards building robust document partitioning systems, we present a strategy based on a hierarchical consensus clustering architecture that operates on a wide diversity of document representations and partitions. The conducted experiments show that the proposed method is capable of yielding a consensus clustering that is comparable to the best individual clustering available even in the presence of a large number of poor individual labelings, outperforming classic non-hierarchical consensus approaches in terms of performance and computational cost. | |||
| Summarisation and Novelty: An Experimental Investigation | | BIBA | Full-Text | 745-748 | |
| Simon Sweeney; Fabio Crestani; David E. Losada | |||
| The continued development of mobile device technologies, their supporting infrastructures and associated services is important to meet the anytime, anywhere information access demands of today's users. The growing need to deliver information on request, in a form that can be readily and easily digested on the move, continues to be a challenge. | |||
| A Layered Approach to Context-Dependent User Modelling | | BIBAK | Full-Text | 749-752 | |
| Elena Vildjiounaite; Sanna Kallio | |||
| This works presents a method for explicit acquisition of context-dependent
user preferences (preferences which change depending on a user situation, e.g.,
higher interest in outdoor activities if it is sunny than if it is raining) for
Smart Home -- intelligent environment, which recognises contexts of its
inhabitants (such as presence of people, activities, events, weather etc) via
home and mobile devices and provides personalized proactive support to the
users. Since a set of personally important situations, which affect user
preferences, is user-dependent, and since many situations can be described only
in fuzzy terms, we provide users with an easy way to develop personal context
ontology and to map it fuzzily into common ontology via GUI. Backward mapping,
by estimating the probability of occurrence of a user-defined situation, allows
retrieval of preferences from all components of the user model. Keywords: User Model; Context Awareness; Smart Home | |||
| A Bayesian Approach for Learning Document Type Relevance | | BIBA | Full-Text | 753-756 | |
| Peter C. K. Yeung; Stefan Büttcher; Charles L. A. Clarke; Maheedhar Kolla | |||
| Retrieval accuracy can be improved by considering which document type should be filtered out and which should be ranked higher in the result list. Hence, document type can be used as a key factor for building a re-ranking retrieval model. We take a simple approach for considering document type in the retrieval process. We adapt the BM25 scoring function to weight term frequency based on the document type and take the Bayesian approach to estimate the appropriate weight for each type. Experimental results show that our approach improves on search precision by as much as 19%. | |||