HCI Bibliography Home | HCI Conferences | ICTIR Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
ICTIR Tables of Contents: 091113

Proceedings of the 2009 Conference on Theory of Information Retrieval

Fullname:ICTIR 2009: Advances in Information Retrieval Theory: Second International Conference on the Theory of Information Retrieval
Editors:Leif Azzopardi; Gabriella Kazai; Stephen Robertson; Stefan Rüger; Milad Shokouhi; Dawei Song; Emine Yilmaz
Location:Cambridge, United Kingdom
Dates:2009-Sep-10 to 2009-Sep-12
Publisher:Springer Berlin Heidelberg
Series:Lecture Notes in Computer Science 5766
Standard No:DOI: 10.1007/978-3-642-04417-5 hcibib: ICTIR09; ISBN: 978-3-642-04416-8 (print), 978-3-642-04417-5 (online)
Links:Online Proceedings
  1. Invited Talk
  2. Efficiency
  3. Retrieval Models
  4. Query and Term Models
  5. Evaluation
  6. Novelty and Diversity
  7. New Perspectives in IR
  8. Retrieval Models
  9. User Aspects
  10. Web Models
  11. Posters

Invited Talk

Is There Something Quantum-Like about the Human Mental Lexicon? BIBAFull-Text 1
  Peter Bruza
This talk proceeds from the premise that IR should engage in a more substantial dialogue with cognitive science. After all, how users decide relevance, or how they chose terms to modify a query are processes rooted in human cognition. Recently, there has been a growing literature applying quantum theory (QT) to model cognitive phenomena. This talk will survey recent research, in particular, modelling interference effects in human decision making. One aspect of QT will be illustrated -- how quantum entanglement can be used to model word associations in human memory. The implications of this will be briefly discussed in terms of a new approach for modelling concept combinations. Tentative links to human abductive reasoning will also be drawn. The basic theme behind this talk is QT can potentially provide a new genre of information processing models (including search) more aligned with human cognition.


Probably Approximately Correct Search BIBAFull-Text 2-16
  Ingemar J. Cox; Ruoxun Fu; Lars Kai Hansen
We consider the problem of searching a document collection using a set of independent computers. That is, the computers do not cooperate with one another either (i) to acquire their local index of documents or (ii) during the retrieval of a document. During the acquisition phase, each computer is assumed to randomly sample a subset of the entire collection. During retrieval, the query is issued to a random subset of computers, each of which returns its results to the query-issuer, who consolidates the results. We examine how the number of computers, and the fraction of the collection that each computer indexes, affects performance in comparison to a traditional deterministic configuration. We provide analytic formulae that, given the number of computers and the fraction of the collection each computer indexes, provide the probability of an approximately correct search, where a "correct search" is defined to be the result of a deterministic search on the entire collection. We show that the randomized distributed search algorithm can have acceptable performance under a range of parameters settings. Simulation results confirm our analysis.
PageRank: Splitting Homogeneous Singular Linear Systems of Index One BIBAFull-Text 17-28
  Douglas V. de Jager; Jeremy T. Bradley
The PageRank algorithm is used today within web information retrieval to provide a content-neutral ranking metric over web pages. It employs power method iterations to solve for the steady-state vector of a DTMC. The defining one-step probability transition matrix of this DTMC is derived from the hyperlink structure of the web and a model of web surfing behaviour which accounts for user bookmarks and memorised URLs.
   In this paper we look to provide a more accessible, more broadly applicable explanation than has been given in the literature of how to make PageRank calculation more tractable through removal of the dangling-page matrix. This allows web pages without outgoing links to be removed before we employ power method iterations. It also allows decomposition of the problem according to irreducible subcomponents of the original transition matrix. Our explanation also covers a PageRank extension to accommodate TrustRank. In setting out our alternative explanation, we introduce and apply a general linear algebraic theorem which allows us to map homogeneous singular linear systems of index one to inhomogeneous non-singular linear systems with a shared solution vector. As an aside, we show in this paper that irreducibility is not required for PageRank to be well-defined.
Training Data Cleaning for Text Classification BIBAFull-Text 29-41
  Andrea Esuli; Fabrizio Sebastiani
In text classification (TC) and other tasks involving supervised learning, labelled data may be scarce or expensive to obtain; strategies are thus needed for maximizing the effectiveness of the resulting classifiers while minimizing the required amount of training effort. Training data cleaning (TDC) consists in devising ranking functions that sort the original training examples in terms of how likely it is that the human annotator has misclassified them, thereby providing a convenient means for the human annotator to revise the training set so as to improve its quality. Working in the context of boosting-based learning methods we present three different techniques for performing TDC and, on two widely used TC benchmarks, evaluate them by their capability of spotting misclassified texts purposefully inserted in the training set.

Retrieval Models

Semi-parametric and Non-parametric Term Weighting for Information Retrieval BIBAFull-Text 42-53
  Donald Metzler; Hugo Zaragoza
Most of the previous research on term weighting for information retrieval has focused on developing specialized parametric term weighting functions. Examples include TF.IDF vector-space formulations, BM25, and language modeling weighting. Each of these term weighting functions takes on a specific parametric form. While these weighting functions have proven to be highly effective, they impose strict constraints on the functional form of the term weights. Such constraints may possibly degrade retrieval effectiveness. In this paper we propose two new classes of term weighting schemes that we call semi-parametric and non-parametric weighting. These weighting schemes make fewer assumptions about the underlying term weights and allow the data to speak for itself. We argue that these robust weighting schemes have the potential to be significantly more effective compared to existing parametric schemes, especially with the growing amount of training data becoming available.
Bridging Language Modeling and Divergence from Randomness Models: A Log-Logistic Model for IR BIBAFull-Text 54-65
  Stéphane Clinchant; Eric Gaussier
We are interested in this paper in revisiting the Divergence from Randomness (DFR) approach to Information Retrieval (IR), so as to better understand the different contributions it relies on, and thus be able to simplify it. To do so, we first introduce an analytical characterization of heuristic retrieval constraints and review several DFR models wrt this characterization. This review shows that the first normalization principle of DFR is necessary to make the model compliant with retrieval constraints. We then show that the log-logistic distribution can be used to derive a simplified DFR model. Interestingly, this simplified model contains Language Models (LM) with Jelinek-Mercer smoothing. The relation we propose here is, to our knowledge, the first connection between the DFR and LM approaches. Lastly, we present experimental results obtained on several standard collections which validate the simplification and the model we propose.
Ordinal Regression Based Model for Personalized Information Retrieval BIBAKFull-Text 66-78
  Mohamed Farah
Retrieving relevant items as a response to a user query is the aim of each information retrieval system. But 'without an understanding of what relevance means to users, it is difficult to imagine how a system can retrieve relevant information for users' [1]. In this paper, we try to capture what relevance is for a particular user and model his profile implicitly considering his non declared preferences that are inferred from a ranking of a reduced set of retrieved documents that he produces. We propose an ordinal regression based model for interactive ranking which uses both the information given by this subjective ranking, as well as the multicriteria evaluation of these ranked documents, to adjust optimally the parameters of a ranking model. This model consists of a set of additive value functions which are built so as they are as compatible as possible with the subjective ranking. The preference information used in our model requires reasonable cognitive effort from the user.
Keywords: Ordinal Regression; UTA Method; Multiple Criteria Analysis; Interactive Information Retrieval Model; Aggregation; Personalization; Implicit User Profile; Relevance Feedback
Navigating in the Dark: Modeling Uncertainty in Ad Hoc Retrieval Using Multiple Relevance Models BIBAKFull-Text 79-91
  Natali Soskin; Oren Kurland; Carmel Domshlak
We develop a novel probabilistic approach to ad hoc retrieval that explicitly addresses the uncertainty about the information need underlying a given query. In doing so, we account for the special role of the corpus in the retrieval process. The derived retrieval method integrates multiple relevance models by using estimates of their faithfulness to the presumed information need. Empirical evaluation demonstrates the performance merits of the proposed approach.
Keywords: relevance models; ad hoc retrieval; faithfulness measures

Query and Term Models

A Belief Model of Query Difficulty That Uses Subjective Logic BIBAFull-Text 92-103
  Christina Lioma; Roi Blanco; Raquel Mochales Palau; Marie-Francine Moens
The difficulty of a user query can affect the performance of Information Retrieval (IR) systems. This work presents a formal model for quantifying and reasoning about query difficulty as follows: Query difficulty is considered to be a subjective belief, which is formulated on the basis of various types of evidence. This allows us to define a belief model and a set of operators for combining evidence of query difficulty. The belief model uses subjective logic, a type of probabilistic logic for modeling uncertainties. An application of this model with semantic and pragmatic evidence about 150 TREC queries illustrates the potential flexibility of this framework in expressing and combining evidence. To our knowledge, this is the first application of subjective logic to IR.
"A term is known by the company it keeps": On Selecting a Good Expansion Set in Pseudo-Relevance Feedback BIBAKFull-Text 104-115
  Raghavendra Udupa; Abhijit Bhole; Pushpak Bhattacharyya
It is well known that pseudo-relevance feedback (PRF) improves the retrieval performance of Information Retrieval (IR) systems in general. However, a recent study by Cao et al [3] has shown that a non-negligible fraction of expansion terms used by PRF algorithms are harmful to the retrieval. In other words, a PRF algorithm would be better off if it were to use only a subset of the feedback terms. The challenge then is to find a good expansion set from the set of all candidate expansion terms. A natural approach to solve the problem is to make term independence assumption and use one or more term selection criteria or a statistical classifier to identify good expansion terms independent of each other. In this work, we challenge this approach and show empirically that a feedback term is neither good nor bad in itself in general; the behavior of a term depends very much on other expansion terms. Our finding implies that a good expansion set can not be found by making term independence assumption in general. As a principled solution to the problem, we propose spectral partitioning of expansion terms using a specific term-term interaction matrix. We demonstrate on several test collections that expansion terms can be partitioned into two sets and the best of the two sets gives substantial improvements in retrieval performance over model-based feedback.
Keywords: Information Retrieval; Relevance Feedback; Pseudo-relevance Feedback; Expansion Terms; Term-Document Matrix
An Effective Approach to Verbose Queries Using a Limited Dependencies Language Model BIBAFull-Text 116-127
  Eduard Hoenkamp; Peter Bruza; Dawei Song; Qiang Huang
Intuitively, any 'bag of words' approach in IR should benefit from taking term dependencies into account. Unfortunately, for years the results of exploiting such dependencies have been mixed or inconclusive. To improve the situation, this paper shows how the natural language properties of the target documents can be used to transform and enrich the term dependencies to more useful statistics. This is done in three steps. The term co-occurrence statistics of queries and documents are each represented by a Markov chain. The paper proves that such a chain is ergodic, and therefore its asymptotic behavior is unique, stationary, and independent of the initial state. Next, the stationary distribution is taken to model queries and documents, rather than their initial distributions. Finally, ranking is achieved following the customary language modeling paradigm. The main contribution of this paper is to argue why the asymptotic behavior of the document model is a better representation then just the document's initial distribution. A secondary contribution is to investigate the practical application of this representation in case the queries become increasingly verbose. In the experiments (based on Lemur's search engine substrate) the default query model was replaced by the stable distribution of the query. Just modeling the query this way already resulted in significant improvements over a standard language model baseline. The results were on a par or better than more sophisticated algorithms that use fine-tuned parameters or extensive training. Moreover, the more verbose the query, the more effective the approach seems to become.
Time-Sensitive Language Modelling for Online Term Recurrence Prediction BIBAFull-Text 128-138
  Dell Zhang; Jinsong Lu; Robert Mao; Jian-Yun Nie
We address the problem of online term recurrence prediction: for a stream of terms, at each time point predict what term is going to recur next in the stream given the term occurrence history so far. It has many applications, for example, in Web search and social tagging. In this paper, we propose a time-sensitive language modelling approach to this problem that effectively combines term frequency and term recency information, and describe how this approach can be implemented efficiently by an online learning algorithm. Our experiments on a real-world Web query log dataset show significant improvements over standard language modelling.


Score Distributions in Information Retrieval BIBAFull-Text 139-151
  Avi Arampatzis; Stephen Robertson; Jaap Kamps
We review the history of modeling score distributions, focusing on the mixture of normal-exponential by investigating the theoretical as well as the empirical evidence supporting its use. We discuss previously suggested conditions which valid binary mixture models should satisfy, such as the Recall-Fallout Convexity Hypothesis, and formulate two new hypotheses considering the component distributions under some limiting conditions of parameter values. From all the mixtures suggested in the past, the current theoretical argument points to the two gamma as the most-likely universal model, with the normal-exponential being a usable approximation. Beyond the theoretical contribution, we provide new experimental evidence showing vector space or geometric models, and BM25, as being "friendly" to the normal-exponential, and that the non-convexity problem that the mixture possesses is practically not severe.
Modeling the Score Distributions of Relevant and Non-relevant Documents BIBAFull-Text 152-163
  Evangelos Kanoulas; Virgil Pavlu; Keshi Dai; Javed A. Aslam
Empirical modeling of the score distributions associated with retrieved documents is an essential task for many retrieval applications. In this work, we propose modeling the relevant documents' scores by a mixture of Gaussians and modeling the non-relevant scores by a Gamma distribution. Applying variational inference we automatically trade-off the goodness-of-fit with the complexity of the model. We test our model on traditional retrieval functions and actual search engines submitted to TREC. We demonstrate the utility of our model in inferring precision-recall curves. In all experiments our model outperforms the dominant exponential-Gaussian model.
Modeling Expected Utility of Multi-session Information Distillation BIBAKFull-Text 164-175
  Yiming Yang; Abhimanyu Lad
An open challenge in information distillation is the evaluation and optimization of the utility of ranked lists with respect to flexible user interactions over multiple sessions. Utility depends on both the relevance and novelty of documents, and the novelty in turn depends on the user interaction history. However, user behavior is non-deterministic. We propose a new probabilistic framework for stochastic modeling of user behavior when browsing multi-session ranked lists, and a novel approximation method for efficient computation of the expected utility over numerous user-interaction patterns. Using this framework, we present the first utility-based evaluation over multi-session search scenarios defined on the TDT4 corpus of news stories, using a state-of-the-art information distillation system. We demonstrate that the distillation system obtains a 56.6% utility enhancement by combining multi-session adaptive filtering with novelty detection and utility-based optimization of system parameters for optimal ranked list lengths.
Keywords: Multi-session distillation; utility evaluation based both on novelty and relevance; stochastic modeling of user browsing behavior
Specificity Aboutness in XML Retrieval BIBAFull-Text 176-187
  Tobias Blanke; Mounia Lalmas
This paper presents a theoretical methodology to evaluate filters in XML retrieval. Theoretical evaluation is concerned with the formal investigation of qualitative properties of retrieval models. XML retrieval deals with retrieving those document components that specifically answer a query, and filters are a method of delivering the most focused answers. Our theoretical evaluation will critically analyse how filters achieve this.

Novelty and Diversity

An Effectiveness Measure for Ambiguous and Underspecified Queries BIBAFull-Text 188-199
  Charles L. A. Clarke; Maheedhar Kolla; Olga Vechtomova
Building upon simple models of user needs and behavior, we propose a new measure of novelty and diversity for information retrieval evaluation. We combine ideas from three recently proposed effectiveness measures in an attempt to achieve a balance between the complexity of genuine users needs and the simplicity required for feasible evaluation.
An Analysis of NP-Completeness in Novelty and Diversity Ranking BIBAFull-Text 200-211
  Ben Carterette
A useful ability for search engines is to be able to rank objects with novelty and diversity: the top k documents retrieved should cover possible interpretations of a query with some distribution, or should contain a diverse set of subtopics related to the user's information need, or contain nuggets of information with little redundancy. Evaluation measures have been introduced to measure the effectiveness of systems at this task, but these measures have worst-case NP-complete computation time. We use simulation to investigate the implications of this for optimization and evaluation of retrieval systems.
From "Identical" to "Similar": Fusing Retrieved Lists Based on Inter-document Similarities BIBAKFull-Text 212-223
  Anna Khudyak Kozorovitzky; Oren Kurland
We present a novel approach to fusing document lists that are retrieved in response to a query. Our approach is based on utilizing information induced from inter-document similarities. Specifically, the key insight guiding the derivation of our methods is that similar documents from different lists can provide relevance-status support to each other. We use a graph-based method to model relevance-status propagation between documents. The propagation is governed by inter-document-similarities and by retrieval scores of documents in the lists. Empirical evaluation shows the effectiveness of our methods in fusing TREC runs.
Keywords: fusion; inter-document-similarities; similarity-based fusion

New Perspectives in IR

A Quantum-Based Model for Interactive Information Retrieval BIBAFull-Text 224-231
  Benjamin Piwowarski; Mounia Lalmas
Even the best information retrieval model cannot always identify the most useful answers to a user query. This is in particular the case with web search systems, where it is known that users tend to minimise their effort to access relevant information. It is, however, believed that the interaction between users and a retrieval system, such as a web search engine, can be exploited to provide better answers to users. Interactive Information Retrieval (IR) systems, in which users access information through a series of interactions with the search system, are concerned with building models for IR, where interaction plays a central role. In this paper, we propose a general framework for interactive IR that is able to capture the full interaction process in a principled way. Our approach relies upon a generalisation of the probability framework of quantum physics.
The Quantum Probability Ranking Principle for Information Retrieval BIBAFull-Text 232-240
  Guido Zuccon; Leif A. Azzopardi; Keith van Rijsbergen
While the Probability Ranking Principle for Information Retrieval provides the basis for formal models, it makes a very strong assumption regarding the dependence between documents. However, it has been observed that in real situations this assumption does not always hold. In this paper we propose a reformulation of the Probability Ranking Principle based on quantum theory. Quantum probability theory naturally includes interference effects between events. We posit that this interference captures the dependency between the judgement of document relevance. The outcome is a more sophisticated principle, the Quantum Probability Ranking Principle, that provides a more sensitive ranking which caters for interference/dependence between documents' relevance.
Written Texts as Statistical Mechanical Problem BIBAFull-Text 241-248
  Kostadin Koroutchev; Elka Korutcheva; Jian Shen
In this article we present a model of human written text based on statistical mechanics consideration. The empirical derivation of the potential energy for the parts of the text and the calculation of the thermodynamic parameters of the system, show that the "specific heat" corresponds to the semantic classification of the words in the text, separating keywords, function words and common words. This can give advantages when the model is used in text searching mechanisms.
What Happened to Content-Based Information Filtering? BIBAFull-Text 249-256
  Nikolaos Nanas; Anne De Roeck; Manolis Vavalis
Personalisation can have a significant impact on the way information is disseminated on the web today. Information Filtering can be a significant ingredient towards a personalised web. Collaborative Filtering is already being applied successfully for generating personalised recommendations of music tracks, books, movies and more. The same is not true for Content-Based Filtering. In this paper, we identify some possible reasons for the notable absence of a broad range of personalised information delivery and dissemination services on the web today. We advocate that a more holistic approach to user profiling is required and we discuss the series of still open, challenging research issues raised.

Retrieval Models

Prior Information and the Determination of Event Spaces in Probabilistic Information Retrieval Models BIBAFull-Text 257-264
  Corrado Boscarino; Arjen P. de Vries
A mismatch between different event spaces has been used to argue against rank equivalence of classic probabilistic models of information retrieval and language models. We question the effectiveness of this strategy and we argue that a convincing solution should be sought in a correct procedure to design adequate priors for probabilistic reasoning. Acknowledging our solution of the event space issue invites to rethink the relation between probabilistic models, statistics and logic in the context of IR.
Robust Word Similarity Estimation Using Perturbation Kernels BIBAFull-Text 265-272
  Kevyn Collins-Thompson
We introduce perturbation kernels, a new class of similarity measure for information retrieval that casts word similarity in terms of multi-task learning. Perturbation kernels model uncertainty in the user's query by choosing a small number of variations in the relative weights of the query terms to build a more complete picture of the query context, which is then used to compute a form of expected distance between words. Our approach has a principled mathematical foundation, a simple analytical form, and makes few assumptions about the underlying retrieval model, making it easy to apply in a broad family of existing query expansion and model estimation algorithms.
Possibilistic Similarity Estimation and Visualization BIBAKFull-Text 273-280
  Anas Dahabiah; John Puentes; Basel Solaiman
In this paper, we present a very general and powerful approach to represent and to visualize the similarity between the objects that contain heterogeneous, imperfect and missing attributes in order to easily achieve efficient analysis and retrieval of information by organizing and gathering these objects into meaningful groups. Our method is essentially based on possibility theory to estimate the similarity and on the spatial, the graphical, and the clustering-based representational models to visualize and represent its structure. Our approach will be applied to a real digestive image database (http://i3se009d.univ-brest.fr/ password view2006 [4]). Without any a priori medical knowledge concerning the key attributes of the pathologies, and without any complicated preprocessing of the imperfect data, results show that we are capable to visualize and to organize the different categories of the digestive pathologies. These results were validated by the doctor.
Keywords: Similarity; Possibility Theory; Graph Theory; Scaling; Clustering
A New Measure of the Cluster Hypothesis BIBAKFull-Text 281-288
  Mark D. Smucker; James Allan
We have found that the nearest neighbor (NN) test is an insufficient measure of the cluster hypothesis. The NN test is a local measure of the cluster hypothesis. Designers of new document-to-document similarity measures may incorrectly report effective clustering of relevant documents if they use the NN test alone. Utilizing a measure from network analysis, we present a new, global measure of the cluster hypothesis: normalized mean reciprocal distance. When used together with a local measure, such as the NN test, this new global measure allows researchers to better measure the cluster hypothesis.
Keywords: Cluster hypothesis; nearest neighbor test; relevant document networks; normalized mean reciprocal distance

User Aspects

Explaining User Performance in Information Retrieval: Challenges to IR Evaluation BIBAFull-Text 289-296
  Kalervo Järvelin
The paper makes three points of significance for IR research: (1) The Cranfield paradigm of IR evaluation seems to lose power when one looks at human instead of system performance. (2) Searchers using IR systems in real-life use rather short queries, which individually often have poor performance. However, when used in sessions, they may be surprisingly effective. The searcher's strategies have not been sufficiently described and cannot therefore be properly understood, supported nor evaluated. (3) Searchers in real-life seek to optimize the entire information access process, not just result quality. Evaluation of output alone is insufficient to explain searcher behavior.
A Four-Factor User Interaction Model for Content-Based Image Retrieval BIBAKFull-Text 297-304
  Haiming Liu; Victoria Uren; Dawei Song; Stefan Rüger
In order to bridge the "Semantic gap", a number of relevance feedback (RF) mechanisms have been applied to content-based image retrieval (CBIR). However current RF techniques in most existing CBIR systems still lack satisfactory user interaction although some work has been done to improve the interaction as well as the search accuracy. In this paper, we propose a four-factor user interaction model and investigate its effects on CBIR by an empirical evaluation. Whilst the model was developed for our research purposes, we believe the model could be adapted to any content-based search system.
Keywords: User interaction; Relevance feedback; Content-based image retrieval
Predicting Query Performance by Query-Drift Estimation BIBAKFull-Text 305-312
  Anna Shtok; Oren Kurland; David Carmel
Predicting query performance, that is, the effectiveness of a search performed in response to a query, is a highly important and challenging problem. Our novel approach to addressing this challenge is based on estimating the potential amount of query drift in the result list, i.e., the presence (and dominance) of aspects or topics not related to the query in top-retrieved documents. We argue that query-drift can potentially be estimated by measuring the diversity (e.g., standard deviation) of the retrieval scores of these documents. Empirical evaluation demonstrates the prediction effectiveness of our approach for several retrieval models. Specifically, the prediction success is better, over most tested TREC corpora, than that of state-of-the-art prediction methods.
Keywords: query-performance prediction; query drift; score distribution

Web Models

What's in a Link? From Document Importance to Topical Relevance BIBAFull-Text 313-321
  Marijn Koolen; Jaap Kamps
Web information retrieval is best known for its use of the Web's link structure as a source of evidence. Global link evidence is by nature query-independent, and is therefore no direct indicator of the topical relevance of a document for a given search request. As a result, link information is usually considered to be useful to identify the 'importance' of documents. Local link evidence, in contrast, is query-dependent and could in principle be related to the topical relevance. We analyse the link evidence in Wikipedia using a large set of ad hoc retrieval topics and relevance judgements to investigate the relation between link evidence and topical relevance.
Avoiding Bias in Text Clustering Using Constrained K-means and May-Not-Links BIBAFull-Text 322-329
  M. Eduardo Ares; Javier Parapar; Álvaro Barreiro
In this paper we present a new clustering algorithm which extends the traditional batch k-means enabling the introduction of domain knowledge in the form of Must, Cannot, May and May-Not rules between the data points. Besides, we have applied the presented method to the task of avoiding bias in clustering. Evaluation carried out in standard collections showed considerable improvements in effectiveness against previous constrained and non-constrained algorithms for the given task.
Optimizing WebPage Interest BIBAKFull-Text 330-337
  Willem Elbers; Theo van der Weide
In the rapidly evolving and growing environment of the internet, web site owners aim to maximize interest for their web site. In this article we propose a model, which combines the static structure of the internet with activity based data, to compute an interest based ranking. This ranking can be used to gain more insight into the flow of users over the internet, optimize the position of a web site and improve strategic decisions and investments. The model consists of a static centrality based component and a dynamic activity based component. The components are used to create a Markov Model in order to compute a ranking.
Keywords: web graph; interest; centrality; user flow; Markov Model


The "Beautiful" in Information BIBAFull-Text 338-341
  Gerald Benoit
At the intersection of retrieval and visualization are opportunities to learn more about IR's view of knowledge and evidence by considering Kantian and post-modern philosophies of aesthetics.
IR Evaluation without a Common Set of Topics BIBAKFull-Text 342-345
  Matteo Cattelan; Stefano Mizzaro
Usually, system effectiveness evaluation in a TREC-like environment is performed on a common set of topics. We show that even when using different topics for different systems, a reliable evaluation can be obtained, and that reliability increases by using appropriate topic selection strategies and metric normalizations.
Keywords: IR effectiveness; TREC; topics
An Ad Hoc Information Retrieval Perspective on PLSI through Language Model Identification BIBAFull-Text 346-349
  Jean-Cédric Chappelier; Emmanuel Eckard
This paper proposes a new document-query similarity for PLSI that allows queries to be used in PLSI without folding-in. We compare this similarity to Fisher kernels, the state-of-the-art approach for PLSI, on a corpus of 1M+ word occurrences coming from TREC-AP.
Less Is More: Maximal Marginal Relevance as a Summarisation Feature BIBAFull-Text 350-353
  Jan Frederik Forst; Anastasios Tombros; Thomas Roelleke
Summarisation approaches aim to provide the most salient concepts of a text in a condensed representation. Repetition of extracted material in the generated summary should be avoided. Carbonell and Goldstein proposed Maximal Marginal Relevance as a measure to increase the diversity of documents retrieved by an IR system, and developed a summariser based on MMR. In this paper, we look at the viability of MMR as a feature in the traditional feature-based summarisation approach proposed by Edmundson.
On the Notion of "An Information Need" BIBAFull-Text 354-357
  Eduard Hoenkamp
'Information need' is a notion in IR that is ubiquitous, important, and intuitively clear. So far, surprisingly, the term seems to have defied formal definition. Of course, IR can continue to prosper without a formalization of 'information need'. Yet when a field gets more mature there comes a time that frequently used notions should be formalized to make them susceptible of scrutiny. For IR such formalization should (1) be independent of a particular query language or document model, (2) allow that users formulate a need for information that may be unavailable or even nonexistent, and (3) allow that users try to circumscribe the very information they do not possess. To this end, the paper uses lattice theory to define a 'formal information need', which, we argue, coincides with the intuitive notion precisely when a user's need for information can actually be filled.
A Logical Inference Approach to Query Expansion with Social Tags BIBAFull-Text 358-361
  Christina Lioma; Roi Blanco; Marie-Francine Moens
Query Expansion (QE) refers to the Information Retrieval (IR) technique of adding assumed relevant terms to a query in order to render it more informative, and hence more likely to retrieve relevant documents. A key problem is how to identify the terms to be added, and how to integrate them into the original query. We address this problem by using as expansion terms social tags that are freely available on the Web. We integrate these tags into the query by treating the QE process as a logical inference (initially proposed in [3]) and by considering the addition of tags as an extra deduction to this process. This work extends Nie's logical inference formalisation of QE to process social tags, and proposes an estimation of tag salience, which is experimentally shown to yield competitive retrieval performance.
Evaluating Mobile Proactive Context-Aware Retrieval: An Incremental Benchmark BIBAKFull-Text 362-365
  Davide Menegon; Stefano Mizzaro; Elena Nazzi; Luca Vassena
We present the evaluation of a novel application for Web content perusal by means of context-aware mobile devices that proactively query an external search engine. To this aim, we develop a TREC-like benchmark and we use it to evaluate different strategies for automatic query construction on the basis of user's current context. We discuss both the methodology and the results.
Keywords: TREC; IR effectiveness; location based systems
Predicting the Usefulness of Collection Enrichment for Enterprise Search BIBAFull-Text 366-370
  Jie Peng; Ben He; Iadh Ounis
Query Expansion (QE) often improves the retrieval performance of an Information Retrieval (IR) system. However, as enterprise intranets are often sparse in nature, with limited use of alternative lexical representations between authors, it can be advantageous to use Collection Enrichment (CE) to gather higher quality pseudo-feedback documents. In this paper, we propose the use of query performance predictors to selectively apply CE on a per-query basis. We thoroughly evaluate our approach on the CERC standard test collection and its corresponding topic sets from the TREC 2007 & 2008 Enterprise track document search tasks. We experiment with 3 different external resources and 3 different query performance predictors. Our experimental results demonstrate that our proposed approach leads to a significant improvement in retrieval performance.
Ranking List Dispersion as a Query Performance Predictor BIBAFull-Text 371-374
  Joaquín Pérez-Iglesias; Lourdes Araujo
In this paper we introduce a novel approach for query performance prediction based on ranking list scores dispersion. Starting from the hypothesis that different score distributions appear for good and poor performance queries, we introduce a set of measures that capture these differences between both types of distributions. The use of measures based on standard deviation of ranking list scores, as a prediction value, shows a significant correlation degree in terms of average precision.
Semi-subsumed Events: A Probabilistic Semantics of the BM25 Term Frequency Quantification BIBAFull-Text 375-379
  Hengzhi Wu; Thomas Roelleke
Through BM25, the asymptotic term frequency quantification TF = tf/(tf+K), where tf is the within-document term frequency and K is a normalisation factor, became popular. This paper reports a finding regarding the meaning of the TF quantification: in the triangle of independence and subsumption, the TF quantification forms the altitude, that is, the middle between independent and subsumed events. We refer to this new assumption as semi-subsumed. While this finding of a well-defined probabilistic assumption solves the probabilistic interpretation of the BM25 TF quantification, it is also of wider impact regarding probability theory.
Batch-Mode Computational Advertising Based on Modern Portfolio Theory BIBAFull-Text 380-383
  Dell Zhang; Jinsong Lu
The research on computational advertising so far has focused on finding the single best ad. However, in many real situations, more than one ad can be presented. Although it is possible to address this problem myopically by using a single-ad optimisation technique in serial-mode, i.e., one at a time, this approach can be ineffective and inefficient because it ignores the correlation between ads. In this paper, we make a leap forward to address the problem of finding the best ads in batch-mode, i.e., assembling the optimal set of ads to be presented altogether. The key idea is to achieve maximum revenue while controlling the level of risk by diversifying the set of ads. We show how the Modern Portfolio Theory can be applied to this problem to provide elegant solutions and deep insights.