HCI Bibliography Home | HCI Conferences | IR Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
IR Tables of Contents: 00010203040506070809101112131415

Proceedings of the 33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

Fullname:Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval
Editors:Fabio Crestani; Stéphane Marchand-Maillet; Hsin-Hsi Chen; Efthimis N. Efthimiadis; Jacques Savoy
Location:Geneva, Switzerland
Dates:2010-Jul-19 to 2010-Jul-23
Standard No:ISBN: 1-4503-0153-3, 978-1-4503-0153-4; ACM DL: Table of Contents hcibib: IR10
Links:Conference Home Page
  1. Keynote
  2. Clustering I
  3. User models
  4. Applications I
  5. Search engine architectures and scalability
  6. Link analysis & advertising
  7. Learning to rank
  8. Clustering II
  9. Filtering and recommendation
  10. Information retrieval theory
  11. Keynote
  12. Language models & IR theory
  13. Query representations & reformulations
  14. Automatic classification
  15. Retrieval models and ranking
  16. User feedback & user models
  17. Web IR and social media search
  18. Document structure & adversarial information retrieval
  19. Users and interactive IR
  20. Document representation and content analysis
  21. Summarization & user feedback
  22. Query log analysis
  23. Test-collections
  24. Query analysis
  25. Effectiveness measures
  26. Multimedia information retrieval
  27. Non-English IR & evaluation
  28. Applications II
  29. Demonstrations
  30. Poster presentations
  31. Tutorials
  32. Doctoral consortium


Is the Cranfield paradigm outdated? BIBKFull-Text 1
  Donna Harman
Keywords: information retrieval evaluation

Clustering I

Prototype hierarchy based clustering for the categorization and navigation of web collections BIBAKFull-Text 2-9
  Zhao-Yan Ming; Kai Wang; Tat-Seng Chua
This paper presents a novel prototype hierarchy based clustering (PHC) framework for the organization of web collections. It solves simultaneously the problem of categorizing web collections and interpreting the clustering results for navigation. By utilizing prototype hierarchies and the underlying topic structures of the collections, PHC is modeled as a multi-criterion optimization problem based on minimizing the hierarchy evolution, maximizing category cohesiveness and inter-hierarchy structural and semantic resemblance. The flexible design of metrics enables PHC to be a general framework for applications in various domains. In the experiments on categorizing 4 collections of distinct domains, PHC achieves 30% improvement in üF1 over the state-of-the-art techniques. Further experiments provide insights on performance variations with abstract and concrete domains, completeness of the prototype hierarchy, and effects of different combinations of optimization criteria.
Keywords: criterion function, hierarchical clustering, hierarchy induction, prototype hierarchy
Person name disambiguation by bootstrapping BIBAKFull-Text 10-17
  Minoru Yoshida; Masaki Ikeda; Shingo Ono; Issei Sato; Hiroshi Nakagawa
In this paper, we report our system that disambiguates person names in Web search results. The system uses named entities, compound key words, and URLs as features for document similarity calculation, which typically show high precision but low recall clustering results. We propose to use a two-stage clustering algorithm by bootstrapping to improve the low recall values, in which clustering results of the first stage are used to extract features used in the second stage clustering. Experimental results revealed that our algorithm yields better score than the best systems at the latest WePS workshop.
Keywords: clustering, person name disambiguation, web people search
Self-taught hashing for fast similarity search BIBAKFull-Text 18-25
  Dell Zhang; Jun Wang; Deng Cai; Jinsong Lu
The ability of fast similarity search at large scale is of great importance to many Information Retrieval (IR) applications. A promising way to accelerate similarity search is semantic hashing which designs compact binary codes for a large number of documents so that semantically similar documents are mapped to similar codes (within a short Hamming distance). Although some recently proposed techniques are able to generate high-quality codes for documents known in advance, obtaining the codes for previously unseen documents remains to be a very challenging problem. In this paper, we emphasise this issue and propose a novel Self-Taught Hashing (STH) approach to semantic hashing: we first find the optimal l-bit binary codes for all documents in the given corpus via unsupervised learning, and then train l classifiers via supervised learning to predict the l-bit code for any query document unseen before. Our experiments on three real-world text datasets show that the proposed approach using binarised Laplacian Eigenmap (LapEig) and linear Support Vector Machine (SVM) outperforms state-of-the-art techniques significantly.
Keywords: Laplacian Eigenmap, semantic hashing, similarity search, support vector machine

User models

Personalizing information retrieval for multi-session tasks: the roles of task stage and task type BIBAKFull-Text 26-33
  Jingjing Liu; Nicholas J. Belkin
Dwell time as a user behavior has been found in previous studies to be an unreliable predictor of document usefulness, with contextual factors such as the user's task needing to be considered in its interpretation. Task stage has been shown to influence search behaviors including usefulness judgments, as has task type. This paper reports on an investigation of how task stage and task type may help predict usefulness from the time that users spend on retrieved documents, over the course of several information seeking episodes. A 3-stage controlled experiment was conducted with 24 participants, each coming 3 times to work on 3 sub-tasks of a general task, couched either as "parallel" or "dependent" task type. The full task was to write a report on the general topic, with interim documents produced for each sub-task. Results show that task stage can help in inferring document usefulness from decision time, especially in the parallel task. The findings can be used to increase accuracy in predicting document usefulness and accordingly in personalizing search for multi-session tasks.
Keywords: contextual factors in IR, dwell time, personalization, task stage, task type
Predicting searcher frustration BIBAKFull-Text 34-41
  Henry A. Feild; James Allan; Rosie Jones
When search engine users have trouble finding information, they may become frustrated, possibly resulting in a bad experience (even if they are ultimately successful). In a user study in which participants were given difficult information seeking tasks, half of all queries submitted resulted in some degree of self-reported frustration. A third of all successful tasks involved at least one instance of frustration. By modeling searcher frustration, search engines can predict the current state of user frustration and decide when to intervene with alternative search strategies to prevent the user from becoming more frustrated, giving up, or switching to another search engine. We present several models to predict frustration using features extracted from query logs and physical sensors. We are able to predict frustration with a mean average precision of 65% from the physical sensors, and 87% from the query log features.
Keywords: query logs, searcher frustration, user modeling
The good, the bad, and the random: an eye-tracking study of ad quality in web search BIBAKFull-Text 42-49
  Georg Buscher; Susan T. Dumais; Edward Cutrell
We investigate how people interact with Web search engine result pages using eye-tracking. While previous research has focused on the visual attention devoted to the 10 organic search results, this paper examines other components of contemporary search engines, such as ads and related searches. We systematically varied the type of task (informational or navigational), the quality of the ads (relevant or irrelevant to the query), and the sequence in which ads of different quality were presented. We measured the effects of these variables on the distribution of visual attention and on task performance. Our results show significant effects of each variable. The amount of visual attention that people devote to organic results depends on both task type and ad quality. The amount of visual attention that people devote to ads depends on their quality, but not the type of task. Interestingly, the sequence and predictability of ad quality is also an important factor in determining how much people attend to ads. When the quality of ads varied randomly from task to task, people paid little attention to the ads, even when they were good. These results further our understanding of how attention devoted to search results is influenced by other page elements, and how previous search experiences influence how people attend to the current page.
Keywords: gaze tracking, search engine results pages, user study

Applications I

Ranking using multiple document types in desktop search BIBAKFull-Text 50-57
  Jinyoung Kim; W. Bruce Croft
A typical desktop environment contains many document types (email, presentations, web pages, pdfs, etc.) each with different metadata. Predicting which types of documents a user is looking for in the context of a given query is a crucial part of providing effective desktop search. The problem is similar to selecting resources in distributed IR, but there are some important differences.
   In this paper, we quantify the impact of type prediction in producing a merged ranking for desktop search and introduce a new prediction method that exploits type-specific metadata. In addition, we show that type prediction performance and search effectiveness can be further enhanced by combining existing methods of type prediction using discriminative learning models. Our experiments employ pseudo-desktop collections and a human computation game for acquiring realistic and reusable queries.
Keywords: desktop search, human computation game, information retrieval, semi-structured document retrieval, type prediction
Acquisition of instance attributes via labeled and related instances BIBAKFull-Text 58-65
  Enrique Alfonseca; Marius Pasca; Enrique Robledo-Arnuncio
This paper presents a method for increasing the quality of automatically extracted instance attributes by exploiting weakly-supervised and unsupervised instance relatedness data. This data consists of (a) class labels for instances and (b) distributional similarity scores. The method organizes the text-derived data into a graph, and automatically propagates attributes among related instances, through random walks over the graph. Experiments on various graph topologies illustrate the advantage of the method over both the original attribute lists and a per-class attribute extractor, both in terms of the number of attributes extracted per instance and the accuracy of the top-ranked attributes.
Keywords: distributional similarities, information extraction, instance attributes, labeled instances, unstructured text
Relevance and ranking in online dating systems BIBAKFull-Text 66-73
  Fernando Diaz; Donald Metzler; Sihem Amer-Yahia
Match-making systems refer to systems where users want to meet other individuals to satisfy some underlying need. Examples of match-making systems include dating services, resume/job bulletin boards, community based question answering, and consumer-to-consumer marketplaces. One fundamental component of a match-making system is the retrieval and ranking of candidate matches for a given user. We present the first in-depth study of information retrieval approaches applied to match-making systems. Specifically, we focus on retrieval for a dating service. This domain offers several unique problems not found in traditional information retrieval tasks. These include two-sided relevance, very subjective relevance, extremely few relevant matches, and structured queries. We propose a machine learned ranking function that makes use of features extracted from the uniquely rich user profiles that consist of both structured and unstructured attributes. An extensive evaluation carried out using data gathered from a real online dating service shows the benefits of our proposed methodology with respect to traditional match-making baseline systems. Our analysis also provides deep insights into the aspects of match-making that are particularly important for producing highly relevant matches.
Keywords: dating systems, relevance

Search engine architectures and scalability

Scalability of findability: effective and efficient IR operations in large information networks BIBAKFull-Text 74-81
  Weimao Ke; Javed Mostafa
It is crucial to study basic principles that support adaptive and scalable retrieval functions in large networked environments such as the Web, where information is distributed among dynamic systems. We conducted experiments on decentralized IR operations on various scales of information networks and analyzed effectiveness, efficiency, and scalability of various search methods. Results showed network structure, i.e., how distributed systems connect to one another, is crucial for retrieval performance. Relying on partial indexes of distributed systems, some level of network clustering enabled very efficient and effective discovery of relevant information in large scale networks. For a given network clustering level, search time was well explained by a poly-logarithmic relation to network size (i.e., the number of distributed systems), indicating a high scalability potential for searching in a growing information space. In addition, network clustering only involved local self-organization and required no global control -- clustering time remained roughly constant across the various scales of networks.
Keywords: clustering paradox, connectivity, decentralized search, distributed IR, network clustering, scalability, strong tie, weak tie
Caching search engine results over incremental indices BIBAKFull-Text 82-89
  Roi Blanco; Edward Bortnikov; Flavio Junqueira; Ronny Lempel; Luca Telloli; Hugo Zaragoza
A Web search engine must update its index periodically to incorporate changes to the Web. We argue in this paper that index updates fundamentally impact the design of search engine result caches, a performance-critical component of modern search engines. Index updates lead to the problem of cache invalidation: invalidating cached entries of queries whose results have changed. Naive approaches, such as flushing the entire cache upon every index update, lead to poor performance and in fact, render caching futile when the frequency of updates is high. Solving the invalidation problem efficiently corresponds to predicting accurately which queries will produce different results if re-evaluated, given the actual changes to the index.
   To obtain this property, we propose a framework for developing invalidation predictors and define metrics to evaluate invalidation schemes. We describe concrete predictors using this framework and compare them against a baseline that uses a cache invalidation scheme based on time-to-live (TTL). Evaluation over Wikipedia documents using a query log from the Yahoo! search engine shows that selective invalidation of cached search results can lower the number of unnecessary query evaluations by as much as 30% compared to a baseline scheme, while returning results of similar freshness. In general, our predictors enable fewer unnecessary invalidations and fewer stale results compared to a TTL-only scheme for similar freshness of results.
Keywords: real-time indexing, search engine caching
Query forwarding in geographically distributed search engines BIBAKFull-Text 90-97
  B. Barla Cambazoglu; Emre Varol; Enver Kayaaslan; Cevdet Aykanat; Ricardo Baeza-Yates
Query forwarding is an important technique for preserving the result quality in distributed search engines where the index is geographically partitioned over multiple search sites. The key component in query forwarding is the thresholding algorithm by which the forwarding decisions are given. In this paper, we propose a linear-programming-based thresholding algorithm that significantly outperforms the current state-of-the-art in terms of achieved search efficiency values. Moreover, we evaluate a greedy heuristic for partial index replication and investigate the impact of result cache freshness on query forwarding performance. Finally, we present some optimizations that improve the performance further, under certain conditions. We evaluate the proposed techniques by simulations over a real-life setting, using a large query log and a document collection obtained from Yahoo!.
Keywords: distributed IR, index replication, linear programming, optimization, query forwarding, result caching, search engines
A joint probabilistic classification model for resource selection BIBAKFull-Text 98-105
  Dzung Hong; Luo Si; Paul Bracke; Michael Witt; Tim Juchcinski
Resource selection is an important task in Federated Search to select a small number of most relevant information sources. Current resource selection algorithms such as GlOSS, CORI, ReDDE, Geometric Average and the recent classification-based method focus on the evidence of individual information sources to determine the relevance of available sources. Current algorithms do not model the important relationship information among individual sources. For example, an information source tends to be relevant to a user query if it is similar to another source with high probability of being relevant. This paper proposes a joint probabilistic classification model for resource selection. The model estimates the probability of relevance of information sources in a joint manner by considering both the evidence of individual sources and their relationship. An extensive set of experiments have been conducted on several datasets to demonstrate the advantage of the proposed model.
Keywords: federated search, joint classification, resource selection

Link analysis & advertising

Temporal click model for sponsored search BIBAKFull-Text 106-113
  Wanhong Xu; Eren Manavoglu; Erick Cantu-Paz
Previous studies on search engine click modeling have identified two presentation factors that affect users' behavior: (1) position bias: the same result will get a different number of clicks when displayed in different positions and (2) externalities: the same result might get more clicks when displayed with results of relatively lower quality than when shown with higher quality results. In this paper we focus on analyzing the sequence of user actions to model users' click behavior on sponsored listings shown on the search results page. We first show that temporal click sequences are good indicators of externalities in the advertising domain. We then describe the positional rationality hypothesis to explain both the position bias and the externalities, and based on this hypothesis we further propose the temporal click model (TCM), a Bayesian framework that is scalable and computationally efficient. To the best of our knowledge, this is the first attempt in the literature to estimate positional bias, externalities and unbiased user-perceived ad quality from user click logs in a combined model. We finally evaluate the proposed model on two real datasets, each containing over 100 million ad impressions obtained from a commercial search engine. The experimental results show that TCM outperforms two other competitive methods at click prediction.
Keywords: advertising, Bayesian model, click log analysis, click-through rate, externalities, sponsored search
Freshness matters: in flowers, food, and web authority BIBAKFull-Text 114-121
  Na Dai; Brian D. Davison
The collective contributions of billions of users across the globe each day result in an ever-changing web. In verticals like news and real-time search, recency is an obvious significant factor for ranking. However, traditional link-based web ranking algorithms typically run on a single web snapshot without concern for user activities associated with the dynamics of web pages and links. Therefore, a stale page popular many years ago may still achieve a high authority score due to its accumulated in-links. To remedy this situation, we propose a temporal web link-based ranking scheme, which incorporates features from historical author activities. We quantify web page freshness over time from page and in-link activity, and design a web surfer model that incorporates web freshness, based on a temporal web graph composed of multiple web snapshots at different time points. It includes authority propagation among snapshots, enabling link structures at distinct time points to influence each other when estimating web page authority. Experiments on a real-world archival web corpus show our approach improves upon PageRank in both relevance and freshness of the search results.
Keywords: pagerank, temporal link analysis, web freshness, web search engine
The importance of anchor text for ad hoc search revisited BIBAKFull-Text 122-129
  Marijn Koolen; Jaap Kamps
It is generally believed that propagated anchor text is very important for effective Web search as offered by the commercial search engines. "Google Bombs" are a notable illustration of this. However, many years of TREC Web retrieval research failed to establish the effectiveness of link evidence for ad hoc retrieval on Web collections. The ultimate resolution to this dilemma was that typical Web search is very different from the traditional ad hoc methodology. So far, however, no one has established why link information, like incoming link degree or anchor text, does not help ad hoc retrieval effectiveness. Several possible explanations were given, including the collections being too small for anchors to be effective, and the density of the link graph being too low.
   The new TREC 2009 Web Track collection is substantially larger than previous collections and has a dense link graph. Our main finding is that propagated anchor text outperforms full-text retrieval in terms of early precision, and in combination with it, gives an improvement in overall precision. We then analyse the impact of link density and collection size by down-sampling the number of links and the number of pages respectively.
   Other findings are that, contrary to expectations, (inter-server) link density has little impact on effectiveness, while the size of the collection has a substantial impact on the quantity, quality and effectiveness of anchor text. We also compare the diversity of the search results of anchor text and full-text approaches, which show that anchor text performs significantly better than full-text search and confirm our findings for the ad hoc search task.
Keywords: ad hoc, anchor text, collection size, link density
Ready to buy or just browsing?: detecting web searcher goals from interaction data BIBAKFull-Text 130-137
  Qi Guo; Eugene Agichtein
An improved understanding of the relationship between search intent, result quality, and searcher behavior is crucial for improving the effectiveness of web search. While recent progress in user behavior mining has been largely focused on aggregate server-side click logs, we present a new class of search behavior models that also exploit fine-grained user interactions with the search results.
   We show that mining these interactions, such as mouse movements and scrolling, can enable more effective detection of the user's search goals. Potential applications include automatic search evaluation, improving search ranking, result presentation, and search advertising. We describe extensive experimental evaluation over both controlled user studies, and logs of interaction data collected from hundreds of real users. The results show that our method is more effective than the current state-of-the-art techniques, both for detection of searcher goals, and for an important practical application of predicting ad clicks for a given search session.
Keywords: search advertising, search behavior modeling, user intent inference

Learning to rank

Learning to efficiently rank BIBAKFull-Text 138-145
  Lidan Wang; Jimmy Lin; Donald Metzler
It has been shown that learning to rank approaches are capable of learning highly effective ranking functions. However, these approaches have mostly ignored the important issue of efficiency. Given that both efficiency and effectiveness are important for real search engines, models that are optimized for effectiveness may not meet the strict efficiency requirements necessary to deploy in a production environment. In this work, we present a unified framework for jointly optimizing effectiveness and efficiency. We propose new metrics that capture the tradeoff between these two competing forces and devise a strategy for automatically learning models that directly optimize the tradeoff metrics. Experiments indicate that models learned in this way provide a good balance between retrieval effectiveness and efficiency. With specific loss functions, learned models converge to familiar existing ones, which demonstrates the generality of our framework. Finally, we show that our approach naturally leads to a reduction in the variance of query execution times, which is important for query load balancing and user satisfaction.
Keywords: effectiveness and efficiency tradeoff, learning to rank, linear models
Ranking for the conversion funnel BIBAKFull-Text 146-153
  Abraham Bagherjeiran; Andrew O. Hatch; Adwait Ratnaparkhi
In contextual advertising advertisers show ads to users so that they will click on them and eventually purchase a product. Optimizing this action sequence, called the conversion funnel, is the ultimate goal of advertising. Advertisers, however, often have very different sub-goals for their ads such as purchase, request for a quote, or simply a site visit. Often an improvement for one advertiser's goal comes at the expense of others. A single ranking function must balance these different goals in order to make an efficient system for all advertisers. We propose a ranking method that globally balances the goals of all advertisers, while simultaneously improving overall performance. Our method has been shown to improve significantly over the baseline in online traffic at a major ad network.
Keywords: online advertising, ranking
How good is a span of terms?: exploiting proximity to improve web retrieval BIBAKFull-Text 154-161
  Krysta M. Svore; Pallika H. Kanani; Nazan Khan
Ranking search results is a fundamental problem in information retrieval. In this paper we explore whether the use of proximity and phrase information can improve web retrieval accuracy. We build on existing research by incorporating novel ranking features based on flexible proximity terms with recent state-of-the-art machine learning ranking models. We introduce a method of determining the goodness of a set of proximity terms that takes advantage of the structured nature of web documents, document metadata, and phrasal information from search engine user query logs. We perform experiments on a large real-world Web data collection and show that using the goodness score of flexible proximity terms can improve ranking accuracy over state-of-the-art ranking methods by as much as 13%. We also show that we can improve accuracy on the hardest queries by as much as 9% relative to state-of-the-art approaches.
Keywords: BM25, learning to rank, proximity, retrieval models, web search
Learning to rank only using training data from related domain BIBAKFull-Text 162-169
  Wei Gao; Peng Cai; Kam-Fai Wong; Aoying Zhou
Like traditional supervised and semi-supervised algorithms, learning to rank for information retrieval requires document annotations provided by domain experts. It is costly to annotate training data for different search domains and tasks. We propose to exploit training data annotated for a related domain to learn to rank retrieved documents in the target domain, in which no labeled data is available. We present a simple yet effective approach based on instance-weighting scheme. Our method first estimates the importance of each related-domain document relative to the target domain. Then heuristics are studied to transform the importance of individual documents to the pairwise weights of document pairs, which can be directly incorporated into the popular ranking algorithms. Due to importance weighting, ranking model trained on related domain is highly adaptable to the data of target domain. Ranking adaptation experiments on LETOR3.0 dataset [27] demonstrate that with a fair amount of related-domain training data, our method significantly outperforms the baseline without weighting, and most of time is not significantly worse than an "ideal" model directly trained on target domain.
Keywords: domain adaptation, instance weighting, learning to rank, ranknet, ranksvm, related domain

Clustering II

Optimal meta search results clustering BIBAKFull-Text 170-177
  Claudio Carpineto; Giovanni Romano
By analogy with merging documents rankings, the outputs from multiple search results clustering algorithms can be combined into a single output. In this paper we study the feasibility of meta search results clustering, which has unique features compared to the general meta clustering problem. After showing that the combination of multiple search results clusterings is empirically justified, we cast meta clustering as an optimization problem of an objective function measuring the probabilistic concordance between the clustering combination and the single clusterings. We then show, using an easily computable upper bound on such a function, that a simple stochastic optimization algorithm delivers reasonable approximations of the optimal value very efficiently, and we also provide a method for labeling the generated clusters with the most agreed upon cluster labels. Optimal meta clustering with meta labeling is applied to three description-centric, state-of-the-art search results clustering algorithms. The performance improvement is demonstrated through a range of evaluation techniques (i.e., internal, classification-oriented, and information retrieval-oriented), using suitable test collections of search results with document-level relevance judgments per subtopic.
Keywords: meta clustering, optimization, search results clustering
Analysis of structural relationships for hierarchical cluster labeling BIBAKFull-Text 178-185
  Markus Muhr; Roman Kern; Michael Granitzer
Cluster label quality is crucial for browsing topic hierarchies obtained via document clustering. Intuitively, the hierarchical structure should influence the labeling accuracy. However, most labeling algorithms ignore such structural properties and therefore, the impact of hierarchical structures on the labeling accuracy is yet unclear. In our work we integrate hierarchical information, i.e. sibling and parent-child relations, in the cluster labeling process. We adapt standard labeling approaches, namely Maximum Term Frequency, Jensen-Shannon Divergence, Chi Square Test, and Information Gain, to take use of those relationships and evaluate their impact on 4 different datasets, namely the Open Directory Project, Wikipedia, TREC Ohsumed and the CLEF IP European Patent dataset. We show, that hierarchical relationships can be exploited to increase labeling accuracy especially on high-level nodes.
Keywords: cluster labeling, statistical methods, structural information, topic hierarchies
On the existence of obstinate results in vector space models BIBAKFull-Text 186-193
  Milos Radovanović; Alexandros Nanopoulos; Mirjana Ivanović
The vector space model (VSM) is a popular and widely applied model in information retrieval (IR). VSM creates vector spaces whose dimensionality is usually high (e.g., tens of thousands of terms). This may cause various problems, such as susceptibility to noise and difficulty in capturing the underlying semantic structure, which are commonly recognized as different aspects of the "curse of dimensionality." In this paper, we investigate a novel aspect of the dimensionality curse, which is referred to as hubness and manifested by the tendency of some documents (called hubs) to be included in unexpectedly many search result lists. Hubness may impact VSM considerably since hubs can become obstinate results, irrelevant to a large number of queries, thus harming the performance of an IR system and the experience of its users. We analyze the origins of hubness, showing it is primarily a consequence of high (intrinsic) dimensionality of data, and not a result of other factors such as sparsity and skewness of the distribution of term frequencies. We describe the mechanisms through which hubness emerges by exploring the behavior of similarity measures in high-dimensional vector spaces. Our consideration begins with the classical VSM (tf-idf term weighting and cosine similarity), but the conclusions generalize to more advanced variations, such as Okapi BM25. Moreover, we explain why hubness may not be easily mitigated by dimensionality reduction, and propose a similarity adjustment scheme that takes into account the existence of hubs. Experimental results over real data indicate that significant improvement can be obtained through consideration of hubness.
Keywords: cosine similarity, curse of dimensionality, hubs, nearest neighbors, similarity concentration, text retrieval, vector space model

Filtering and recommendation

Social media recommendation based on people and tags BIBAKFull-Text 194-201
  Ido Guy; Naama Zwerdling; Inbal Ronen; David Carmel; Erel Uziel
We study personalized item recommendation within an enterprise social media application suite that includes blogs, bookmarks, communities, wikis, and shared files. Recommendations are based on two of the core elements of social media -- people and tags. Relationship information among people, tags, and items, is collected and aggregated across different sources within the enterprise. Based on these aggregated relationships, the system recommends items related to people and tags that are related to the user. Each recommended item is accompanied by an explanation that includes the people and tags that led to its recommendation, as well as their relationships with the user and the item. We evaluated our recommender system through an extensive user study. Results show a significantly better interest ratio for the tag-based recommender than for the people-based recommender, and an even better performance for a combined recommender. Tags applied on the user by other people are found to be highly effective in representing that user's topics of interest.
Keywords: collaborative tagging, personalization, recommender systems, social media, social networks, social software
A network-based model for high-dimensional information filtering BIBAKFull-Text 202-209
  Nikolaos Nanas; Manolis Vavalis; Anne De Roeck
The Vector Space Model has been and to a great extent still is the de facto choice for profile representation in content-based Information Filtering. However, user profiles represented as weighted keyword vectors have inherent dimensionality problems. As the number of profile keywords increases, the vector representation becomes ambiguous, due to the exponential increase in the volume of the vector space and in the number of possible keyword combinations. We argue that the complexity and dynamics of Information Filtering require user profile representations which are resilient and resistant to this "curse of dimensionality". A user profile has to be able to incorporate many features and to adapt to a variety of interest changes. We propose an alternative, network-based profile representation that meets these challenging requirements. Experiments show that the network profile representation can more effectively capture additional information about a user's interests and thus achieve significant performance improvements over a vector-based representation comprising the same weighted keywords.
Keywords: content-based information filtering, curse of dimensionality, user profiling
Temporal diversity in recommender systems BIBAKFull-Text 210-217
  Neal Lathia; Stephen Hailes; Licia Capra; Xavier Amatriain
Collaborative Filtering (CF) algorithms, used to build web-based recommender systems, are often evaluated in terms of how accurately they predict user ratings. However, current evaluation techniques disregard the fact that users continue to rate items over time: the temporal characteristics of the system's top-N recommendations are not investigated. In particular, there is no means of measuring the extent that the same items are being recommended to users over and over again. In this work, we show that temporal diversity is an important facet of recommender systems, by showing how CF data changes over time and performing a user survey. We then evaluate three CF algorithms from the point of view of the diversity in the sequence of recommendation lists they produce over time. We examine how a number of characteristics of user rating patterns (including profile size and time between rating) affect diversity. We then propose and evaluate set methods that maximise temporal recommendation diversity without extensively penalising accuracy.
Keywords: evaluation, recommender systems
Serendipitous recommendations via innovators BIBAKFull-Text 218-225
  Noriaki Kawamae
To realize services that provide serendipity, this paper assesses the surprise of each user when presented recommendations. We propose a recommendation algorithm that focuses on the search time that, in the absence of any recommendation, each user would need to find a desirable and novel item by himself. Following the hypothesis that the degree of user's surprise is proportional to the estimated search time, we consider both innovators' preferences and trends for identifying items with long estimated search times. To predict which items the target user is likely to purchase in the near future, the candidate items, this algorithm weights each item that innovators have purchased and that reflect one or more current trends; it then lists them in order of decreasing weight. Experiments demonstrate that this algorithm outputs recommendations that offer high user/item coverage, a low Gini coefficient, and long estimated search times, and so offers a high degree of recommendation serendipitousness.
Keywords: collaborative filtering, innovator, personalization, ranking, serendipitous recommendations, user flow

Information retrieval theory

On statistical analysis and optimization of information retrieval effectiveness metrics BIBAKFull-Text 226-233
  Jun Wang; Jianhan Zhu
This paper presents a new way of thinking for IR metric optimization. It is argued that the optimal ranking problem should be factorized into two distinct yet interrelated stages: the relevance prediction stage and ranking decision stage. During retrieval the relevance of documents is not known a priori, and the joint probability of relevance is used to measure the uncertainty of documents' relevance in the collection as a whole. The resulting optimization objective function in the latter stage is, thus, the expected value of the IR metric with respect to this probability measure of relevance. Through statistically analyzing the expected values of IR metrics under such uncertainty, we discover and explain some interesting properties of IR metrics that have not been known before. Our analysis and optimization framework do not assume a particular (relevance) retrieval model and metric, making it applicable to many existing IR models and metrics. The experiments on one of resulting applications have demonstrated its significance in adapting to various IR metrics.
Keywords: IR metrics, learning to rank, optimal ranking, optimization, ranking under uncertainty, retrieval models
Information-based models for ad hoc IR BIBAKFull-Text 234-241
  Stéphane Clinchant; Eric Gaussier
We introduce in this paper the family of information-based models for ad hoc information retrieval. These models draw their inspiration from a long-standing hypothesis in IR, namely the fact that the difference in the behaviors of a word at the document and collection levels brings information on the significance of the word for the document. This hypothesis has been exploited in the 2-Poisson mixture models, in the notion of eliteness in BM25, and more recently in DFR models. We show here that, combined with notions related to burstiness, it can lead to simpler and better models.
Keywords: burstiness, information-based models, log-logistic, power laws, pseudo-relevance feedback, retrieval constraints
Score distribution models: assumptions, intuition, and robustness to score manipulation BIBAKFull-Text 242-249
  Evangelos Kanoulas; Keshi Dai; Virgil Pavlu; Javed A. Aslam
Inferring the score distribution of relevant and non-relevant documents is an essential task for many IR applications (e.g. information filtering, recall-oriented IR, meta-search, distributed IR). Modeling score distributions in an accurate manner is the basis of any inference. Thus, numerous score distribution models have been proposed in the literature. Most of the models were proposed on the basis of empirical evidence and goodness-of-fit. In this work, we model score distributions in a rather different, systematic manner. We start with a basic assumption on the distribution of terms in a document. Following the transformations applied on term frequencies by two basic ranking functions, BM25 and Language Models, we derive the distribution of the produced scores for all documents. Then we focus on the relevant documents. We detach our analysis from particular ranking functions. Instead, we consider a model for precision-recall curves, and given this model, we present a general mathematical framework which, given any score distribution for all retrieved documents, produces an analytical formula for the score distribution of relevant documents that is consistent with the precision-recall curves that follow the aforementioned model. In particular, assuming a Gamma distribution for all retrieved documents, we show that the derived distribution for the relevant documents resembles a Gaussian distribution with a heavy right tail.
Keywords: density functions, information retrieval, recall-precision curve, score distribution


Refactoring the search problem BIBAKFull-Text 250
  Gary William Flake
The most common way of framing the search problem is as an exchange between a user and a database, where the user issues queries and the database replies with results that satisfy constraints imposed by the query but that also optimize some notion of relevance. There are several variations to this basic model that augment the dialogue between humans and machines through query refinement, relevance feedback, and other mechanism. However, rarely is this problem ever posed in a way in which the properties of the client and server are fundamentally different and in a way in which exploiting the differences can be used to yield substantially different experiences.
   I propose a reframing of the basic search problem which presupposes that servers are scalable on most dimensions but suffer from low communication latencies while clients have lower scalability but support vastly richer user interactions because of lower communication latencies. Framed in this manner, there is clear utility in refactoring the search problem so that user interactions are processed fluidly by a client while the server is relegated to pre-computing the properties of a result set that cannot be efficiently left to the client.
   I will demonstrate Pivot, an experimental client application that allows the user to visually interact with thousands of search results at once, while using facetted-based exploration in a zoomable interface. I will argue that the evolving structure of the Web will tend to push all IR-based applications in a similar direction, which has the algorithmic intelligence increasingly split between clients and servers. Put another way, my claim is that future clients will be neither thin nor dumb.
Keywords: search, visualization, web

Language models & IR theory

Geometric representations for multiple documents BIBAKFull-Text 251-258
  Jangwon Seo; W. Bruce Croft
Combining multiple documents to represent an information object is well-known as an effective approach for many Information Retrieval tasks. For example, passages can be combined to represent a document for retrieval, document clusters are represented using combinations of the documents they contain, and feedback documents can be combined to represent a query model. Various techniques for combination have been introduced, and among them, representation techniques based on concatenation and the arithmetic mean are frequently used. Some recent work has shown the potential of a new representation technique using the geometric mean. However, these studies lack a theoretical foundation explaining why the geometric mean should have advantages for representing multiple documents. In this paper, we show that the arithmetic mean and the geometric mean are approximations to the center of mass in certain geometries, and show empirically that the geometric mean is closer to the center. Through experiments with two IR tasks, we show the potential benefits for geometric representations, including a geometry-based pseudo-relevance feedback method that outperforms state-of-the-art techniques.
Keywords: geometric mean, information geometry, multiple documents
Using statistical decision theory and relevance models for query-performance prediction BIBAKFull-Text 259-266
  Anna Shtok; Oren Kurland; David Carmel
We present a novel framework for the query-performance prediction task. That is, estimating the effectiveness of a search performed in response to a query in lack of relevance judgments. Our approach is based on using statistical decision theory for estimating the utility that a document ranking provides with respect to an information need expressed by the query. To address the uncertainty in inferring the information need, we estimate utility by the expected similarity between the given ranking and those induced by relevance models; the impact of a relevance model is based on its presumed representativeness of the information need. Specific query-performance predictors instantiated from the framework substantially outperform state-of-the-art predictors over five TREC corpora.
Keywords: query-performance prediction, rank correlation, relevance models, statistical decision theory
Active learning for ranking through expected loss optimization BIBAKFull-Text 267-274
  Bo Long; Olivier Chapelle; Ya Zhang; Yi Chang; Zhaohui Zheng; Belle Tseng
Learning to rank arises in many information retrieval applications, ranging from Web search engine, online advertising to recommendation system. In learning to rank, the performance of a ranking model is strongly affected by the number of labeled examples in the training set; on the other hand, obtaining labeled examples for training data is very expensive and time-consuming. This presents a great need for the active learning approaches to select most informative examples for ranking learning; however, in the literature there is still very limited work to address active learning for ranking. In this paper, we propose a general active learning framework, Expected Loss Optimization (ELO), for ranking. The ELO framework is applicable to a wide range of ranking functions. Under this framework, we derive a novel algorithm, Expected DCG Loss Optimization (ELO-DCG), to select most informative examples. Furthermore, we investigate both query and document level active learning for raking and propose a two-stage ELO-DCG algorithm which incorporate both query and document selection into active learning. Extensive experiments on real-world Web search data sets have demonstrated great potential and effectiveness of the proposed framework and algorithms.
Keywords: active learning, expected loss optimization, ranking

Query representations & reformulations

Image search by concept map BIBAKFull-Text 275-282
  Hao Xu; Jingdong Wang; Xian-Sheng Hua; Shipeng Li
In this paper, we present a novel image search system, image search by concept map. This system enables users to indicate not only what semantic concepts are expected to appear but also how these concepts are spatially distributed in the desired images. To this end, we propose a new image search interface to enable users to formulate a query, called concept map, by intuitively typing textual queries in a blank canvas to indicate the desired spatial positions of the concepts. In the ranking process, by interpreting each textual concept as a set of representative visual instances, the concept map query is translated into a visual instance map, which is then used to evaluate the relevance of the image in the database. Experimental results demonstrate the effectiveness of the proposed system.
Keywords: concept map, image search
Generalized syntactic and semantic models of query reformulation BIBAKFull-Text 283-290
  Amac Herdagdelen; Massimiliano Ciaramita; Daniel Mahler; Maria Holmqvist; Keith Hall; Stefan Riezler; Enrique Alfonseca
We present a novel approach to query reformulation which combines syntactic and semantic information by means of generalized Levenshtein distance algorithms where the substitution operation costs are based on probabilistic term rewrite functions. We investigate unsupervised, compact and efficient models, and provide empirical evidence of their effectiveness. We further explore a generative model of query reformulation and supervised combination methods providing improved performance at variable computational costs.
   Among other desirable properties, our similarity measures incorporate information-theoretic interpretations of taxonomic relations such as specification and generalization.
Keywords: generalized edit distance, query reformulation, query rewriting, similarity metrics
Evaluating verbose query processing techniques BIBAKFull-Text 291-298
  Samuel Huston; W. Bruce Croft
Verbose or long queries are a small but significant part of the query stream in web search, and are common in other applications such as collaborative question answering (CQA). Current search engines perform well with keyword queries but are not, in general, effective for verbose queries. In this paper, we examine query processing techniques which can be applied to verbose queries prior to submission to a search engine in order to improve the search engine's results. We focus on verbose queries that have sentence-like structure, but are not simple "wh-" questions, and assume the search engine is a "black box." We evaluated the output of two search engines using queries from a CQA service and our results show that, among a broad range of techniques, the most effective approach is to simply reduce the length of the query. This can be achieved effectively by removing "stop structure" instead of only stop words. We show that the process of learning and removing stop structure from a query can be effectively automated.
Keywords: black box, query reformulation, verbose queries

Automatic classification

SED: supervised experimental design and its application to text classification BIBAKFull-Text 299-306
  Yi Zhen; Dit-Yan Yeung
In recent years, active learning methods based on experimental design achieve state-of-the-art performance in text classification applications. Although these methods can exploit the distribution of unlabeled data and support batch selection, they cannot make use of labeled data which often carry useful information for active learning. In this paper, we propose a novel active learning method for text classification, called supervised experimental design (SED), which seamlessly incorporates label information into experimental design. Experimental results show that SED outperforms its counterparts which either discard the label information even when it is available or fail to exploit the distribution of unlabeled data.
Keywords: active learning, convex optimization, supervised experimental design, text classification
Temporally-aware algorithms for document classification BIBAKFull-Text 307-314
  Thiago Salles; Leonardo Rocha; Gisele L. Pappa; Fernando Mourão; Wagner, Jr. Meira; Marcos Gonçalves
Automatic Document Classification (ADC) is still one of the major information retrieval problems. It usually employs a supervised learning strategy, where we first build a classification model using pre-classified documents and then use this model to classify unseen documents. The majority of supervised algorithms consider that all documents provide equally important information. However, in practice, a document may be considered more or less important to build the classification model according to several factors, such as its timeliness, the venue where it was published in, its authors, among others. In this paper, we are particularly concerned with the impact that temporal effects may have on ADC and how to minimize such impact. In order to deal with these effects, we introduce a temporal weighting function (TWF) and propose a methodology to determine it for document collections. We applied the proposed methodology to ACM-DL and Medline and found that the TWF of both follows a lognormal. We then extend three ADC algorithms (namely kNN, Rocchio and Naïve Bayes) to incorporate the TWF. Experiments showed that the temporally-aware classifiers achieved significant gains, outperforming (or at least matching) state-of-the-art algorithms.
Keywords: classification and clustering, text mining
Multilabel classification with meta-level features BIBAKFull-Text 315-322
  Siddharth Gopal; Yiming Yang
Effective learning in multi-label classification (MLC) requires an appropriate level of abstraction for representing the relationship between each instance and multiple categories. Current MLC methods have been focused on learning-to-map from instances to ranked lists of categories in a relatively high-dimensional space. The fine-grained features in such a space may not be sufficiently expressive for characterizing discriminative patterns, and worse, make the model complexity unnecessarily high. This paper proposes an alternative approach by transforming conventional representations of instances and categories into a relatively small set of link-based meta-level features, and leveraging successful learning-to-rank retrieval algorithms (e.g., SVM-MAP) over this reduced feature space. Controlled experiments on multiple benchmark datasets show strong empirical evidence for the strength of the proposed approach, as it significantly outperformed several state-of-the-art methods, including Rank-SVM, ML-kNN and IBLR-ML (Instance-based Logistic Regression for Multi-label Classification) in most cases.
Keywords: comparative evaluation, learning to rank, model design, multi-label classification

Retrieval models and ranking

Estimation of statistical translation models based on mutual information for ad hoc information retrieval BIBAKFull-Text 323-330
  Maryam Karimzadehgan; ChengXiang Zhai
As a principled approach to capturing semantic relations of words in information retrieval, statistical translation models have been shown to outperform simple document language models which rely on exact matching of words in the query and documents. A main challenge in applying translation models to ad hoc information retrieval is to estimate a translation model without training data. Existing work has relied on training on synthetic queries generated based on a document collection. However, this method is computationally expensive and does not have a good coverage of query words. In this paper, we propose an alternative way to estimate a translation model based on normalized mutual information between words, which is less computationally expensive and has better coverage of query words than the synthetic query method of estimation. We also propose to regularize estimated translation probabilities to ensure sufficient probability mass for self-translation. Experiment results show that the proposed mutual information-based estimation method is not only more efficient, but also more effective than the synthetic query-based method, and it can be combined with pseudo-relevance feedback to further improve retrieval accuracy. The results also show that the proposed regularization strategy is effective and can improve retrieval accuracy for both synthetic query-based estimation and mutual information-based estimation.
Keywords: estimation, language models, smoothing, statistical machine translation
DivQ: diversification for keyword search over structured databases BIBAKFull-Text 331-338
  Elena Demidova; Peter Fankhauser; Xuan Zhou; Wolfgang Nejdl
Keyword queries over structured databases are notoriously ambiguous. No single interpretation of a keyword query can satisfy all users, and multiple interpretations may yield overlapping results. This paper proposes a scheme to balance the relevance and novelty of keyword search results over structured databases. Firstly, we present a probabilistic model which effectively ranks the possible interpretations of a keyword query over structured data. Then, we introduce a scheme to diversify the search results by re-ranking query interpretations, taking into account redundancy of query results. Finally, we propose α-nDCG-W and WS-recall, an adaptation of α-nDCG and S-recall metrics, taking into account graded relevance of subtopics. Our evaluation on two real-world datasets demonstrates that search results obtained using the proposed diversification algorithms better characterize possible answers available in the database than the results of the initial relevance ranking.
Keywords: diversity, query intent, ranking in databases
Finding support sentences for entities BIBAKFull-Text 339-346
  Roi Blanco; Hugo Zaragoza
We study the problem of finding sentences that explain the relationship between a named entity and an ad-hoc query, which we refer to as entity support sentences. This is an important sub-problem of entity ranking which, to the best of our knowledge, has not been addressed before. In this paper we give the first formalization of the problem, how it can be evaluated, and present a full evaluation dataset. We propose several methods to rank these sentences, namely retrieval-based, entity-ranking based and position-based. We found that traditional bag-of-words models perform relatively well when there is a match between an entity and a query in a given sentence, but they fail to find a support sentence for a substantial portion of entities. This can be improved by incorporating small windows of context sentences and ranking them appropriately.
Keywords: entity ranking, sentence retrieval
Estimating probabilities for effective data fusion BIBAKFull-Text 347-354
  David Lillis; Lusheng Zhang; Fergus Toolan; Rem W. Collier; David Leonard; John Dunnion
Data Fusion is the combination of a number of independent search results, relating to the same document collection, into a single result to be presented to the user. A number of probabilistic data fusion models have been shown to be effective in empirical studies. These typically attempt to estimate the probability that particular documents will be relevant, based on training data. However, little attempt has been made to gauge how the accuracy of these estimations affect fusion performance. The focus of this paper is twofold: firstly, that accurate estimation of the probability of relevance results in effective data fusion; and secondly, that an effective approximation of this probability can be made based on less training data that has previously been employed. This is based on the observation that the distribution of relevant documents follows a similar pattern in most high-quality result sets. Curve fitting suggests that this can be modelled by a simple function that is less complex than other models that have been proposed. The use of existing IR evaluation metrics is proposed as a substitution for probability calculations. Mean Average Precision is used to demonstrate the effectiveness of this approach, with evaluation results demonstrating competitive performance when compared with related algorithms with more onerous requirements for training data.
Keywords: information retrieval, probabilistic data fusion, results merging

User feedback & user models

Incorporating post-click behaviors into a click model BIBAKFull-Text 355-362
  Feimin Zhong; Dong Wang; Gang Wang; Weizhu Chen; Yuchen Zhang; Zheng Chen; Haixun Wang
Much work has attempted to model a user's click-through behavior by mining the click logs. The task is not trivial due to the well-known position bias problem. Some break-throughs have been made: two newly proposed click models, DBN and CCM, addressed this problem and improved document relevance estimation. However, to further improve the estimation, we need a model that can capture more sophisticated user behaviors. In particular, after clicking a search result, a user's behavior (such as the dwell time on the clicked document, and whether there are further clicks on the clicked document) can be highly indicative of the relevance of the document. Unfortunately, such measures have not been incorporated in previous click models. In this paper, we introduce a novel click model, called the post-click click model (PCC), which provides an unbiased estimation of document relevance through leveraging both click behaviors on the search page and post-click behaviors beyond the search page. The PCC model is based on the Bayesian approach, and because of its incremental nature, it is highly scalable to large scale and constantly growing log data. Extensive experimental results illustrate that the proposed method significantly outperforms the state of the art methods merely relying on click logs.
Keywords: Bayesian model, click log analysis, post-click behavior
Interactive retrieval based on faceted feedback BIBAKFull-Text 363-370
  Lanbo Zhang; Yi Zhang
Motivated by the commonly used faceted search interface in e-commerce, this paper investigates interactive relevance feedback mechanism based on faceted document metadata. In this mechanism, the system recommends a group of document facet-value pairs, and lets users select relevant ones to restrict the returned documents. We propose four facet-value pair recommendation approaches and two retrieval models that incorporate user feedback on document facets. Evaluated based on user feedback collected through Amazon Mechanical Turk, our experimental results show that the Boolean filtering approach, which is widely used in faceted search in e-commerce, doesn't work well for text document retrieval, due to the incompleteness (low recall) of metadata assignment in semi-structured text documents. Instead, a soft model performs more effectively. The faceted feedback mechanism can also be combined with document-based relevance feedback and pseudo relevance feedback to further improve the retrieval performance.
Keywords: faceted feedback, interactive retrieval, metadata-based retrieval, relevance feedback
A comparison of general vs personalised affective models for the prediction of topical relevance BIBAKFull-Text 371-378
  Ioannis Arapakis; Konstantinos Athanasakos; Joemon M. Jose
Information retrieval systems face a number of challenges, originating mainly from the semantic gap problem. Implicit feedback techniques have been employed in the past to address many of these issues. Although this was a step towards the right direction, a need to personalise and tailor the search experience to the user-specific needs has become evident. In this study we examine ways of personalising affective models trained on facial expression data. Using personalised data we adapt these models to individual users and compare their performance to a general model. The main goal is to determine whether the behavioural differences of users have an impact on the models' ability to determine topical relevance and if, by personalising them, we can improve their accuracy. For modelling relevance we extract a set of features from the facial expression data and classify them using Support Vector Machines. Our initial evaluation indicates that accounting for individual differences and applying personalisation introduces, in most cases, a noticeable improvement in the models' performance.
Keywords: affective feedback, classification, facial expression analysis, personalisation, support vector machines
Understanding web browsing behaviors through Weibull analysis of dwell time BIBAKFull-Text 379-386
  Chao Liu; Ryen W. White; Susan Dumais
Dwell time on Web pages has been extensively used for various information retrieval tasks. However, some basic yet important questions have not been sufficiently addressed, eg, what distribution is appropriate to model the distribution of dwell times on a Web page, and furthermore, what the distribution tells us about the underlying browsing behaviors. In this paper, we draw an analogy between abandoning a page during Web browsing and a system failure in reliability analysis, and propose to model the dwell time using the Weibull distribution. Using this distribution provides better goodness-of-fit to real world data, and it uncovers some interesting patterns of user browsing behaviors not previously reported. For example, our analysis reveals that Web browsing in general exhibits a significant "negative aging" phenomenon, which means that some initial screening has to be passed before a page is examined in detail, giving rise to the browsing behavior that we call "screen-and-glean." In addition, we demonstrate that dwell time distributions can be reasonably predicted purely based on low-level page features, which broadens the possible applications of this study to situations where log data may be unavailable.
Keywords: Weibull analysis, dwell time, user behaviors, web browsing

Web IR and social media search

Segmentation of multi-sentence questions: towards effective question retrieval in cQA services BIBAKFull-Text 387-394
  Kai Wang; Zhao-Yan Ming; Xia Hu; Tat-Seng Chua
Existing question retrieval models work relatively well in finding similar questions in community-based question answering (cQA) services. However, they are designed for single-sentence queries or bag-of-word representations, and are not sufficient to handle multi-sentence questions complemented with various contexts. Segmenting questions into parts that are topically related could assist the retrieval system to not only better understand the user's different information needs but also fetch the most appropriate fragments of questions and answers in cQA archive that are relevant to user's query. In this paper, we propose a graph based approach to segmenting multi-sentence questions. The results from user studies show that our segmentation model outperforms traditional systems in question segmentation by over 30% in user's satisfaction. We incorporate the segmentation model into existing cQA question retrieval framework for more targeted question matching, and the empirical evaluation results demonstrate that the segmentation boosts the question retrieval performance by up to 12.93% in Mean Average Precision and 11.72% in Top One Precision. Our model comes with a comprehensive question detector equipped with both lexical and syntactic features.
Keywords: Yahoo! answers, question answering, question matching, question segmentation
Mining the blogosphere for top news stories identification BIBAKFull-Text 395-402
  Yeha Lee; Hun-young Jung; Woosang Song; Jong-Hyeok Lee
The analysis of query logs from blog search engines show that news-related queries occupy a significant portion of the logs. This raises a interesting research question on whether the blogosphere can be used to identify important news stories. In this paper, we present novel approaches to identify important news story headlines from the blogosphere for a given day. The proposed system consists of two components based on the language model framework, the query likelihood and the news headline prior. For the query likelihood, we propose several approaches to estimate the query language model and the news headline language model. We also suggest several criteria to evaluate the news headline prior that is the prior belief about the importance or newsworthiness of the news headline for a given day. Experimental results show that our system significantly outperforms a baseline system. Specifically, the proposed approach gives 2.62% and 10.19% further increases in MAP and P@5 over the best performing result of the TREC'09 Top Stories Identification Task.
Keywords: blog retrieval, blogosphere, top news stories identification
Proximity-based opinion retrieval BIBAKFull-Text 403-410
  Shima Gerani; Mark James Carman; Fabio Crestani
Blog post opinion retrieval aims at finding blog posts that are relevant and opinionated about a user's query. In this paper we propose a simple probabilistic model for assigning relevant opinion scores to documents. The key problem is how to capture opinion expressions in the document, that are related to the query topic. Current solutions enrich general opinion lexicons by finding query-specific opinion lexicons using pseudo-relevance feedback on external corpora or the collection itself. In this paper we use a general opinion lexicon and propose using proximity information in order to capture opinion term relatedness to the query. We propose a proximity-based opinion propagation method to calculate the opinion density at each point in a document. The opinion density at the position of a query term in the document can then be considered as the probability of opinion about the query term at that position. The effect of different kernels for capturing the proximity is also discussed. Experimental results on the BLOG06 dataset show that the proposed method provides significant improvement over standard TREC baselines and achieves a 2.5% increase in MAP over the best performing run in the TREC 2008 blog track.
Keywords: blog, opinion, proximity, retrieval, sentiment
Evaluating and predicting answer quality in community QA BIBAKFull-Text 411-418
  Chirag Shah; Jefferey Pomerantz
Question answering (QA) helps one go beyond traditional keywords-based querying and retrieve information in more precise form than given by a document or a list of documents. Several community-based QA (CQA) services have emerged allowing information seekers pose their information need as questions and receive answers from their fellow users. A question may receive multiple answers from multiple users and the asker or the community can choose the best answer. While the asker can thus indicate if he was satisfied with the information he received, there is no clear way of evaluating the quality of that information. We present a study to evaluate and predict the quality of an answer in a CQA setting. We chose Yahoo! Answers as such CQA service and selected a small set of questions, each with at least five answers. We asked Amazon Mechanical Turk workers to rate the quality of each answer for a given question based on 13 different criteria. Each answer was rated by five different workers. We then matched their assessments with the actual asker's rating of a given answer. We show that the quality criteria we used faithfully match with asker's perception of a quality answer. We furthered our investigation by extracting various features from questions, answers, and the users who posted them, and training a number of classifiers to select the best answer using those features. We demonstrate a high predictability of our trained models along with the relative merits of each of the features for such prediction. These models support our argument that in case of CQA, contextual information such as a user's profile, can be critical in evaluating and predicting content quality.
Keywords: answer quality evaluation and prediction, community question answering

Document structure & adversarial information retrieval

Adaptive near-duplicate detection via similarity learning BIBAKFull-Text 419-426
  Hannaneh Hajishirzi; Wen-tau Yih; Aleksander Kolcz
In this paper, we present a novel near-duplicate document detection method that can easily be tuned for a particular domain. Our method represents each document as a real-valued sparse k-gram vector, where the weights are learned to optimize for a specified similarity function, such as the cosine similarity or the Jaccard coefficient. Near-duplicate documents can be reliably detected through this improved similarity measure. In addition, these vectors can be mapped to a small number of hash-values as document signatures through the locality sensitive hashing scheme for efficient similarity computation. We demonstrate our approach in two target domains: Web news articles and email messages. Our method is not only more accurate than the commonly used methods such as Shingles and I-Match, but also shows consistent improvement across the domains, which is a desired property lacked by existing methods.
Keywords: near-duplicate detection, similarity learning, spam detection
A content based approach for discovering missing anchor text for web search BIBAKFull-Text 427-434
  Xing Yi; James Allan
Although anchor text provides very useful information for web search, a large portion of web pages have few or no incoming hyperlinks (anchors), which is known as the anchor text sparsity problem. In this paper, we propose a language modeling based technique for overcoming anchor text sparsity by discovering a web page's plausible missing anchor text from its similar web pages' in-link anchor text. We design experiments with two publicly available TREC web corpora (GOV2 and ClueWeb09) to evaluate different approaches for discovering missing anchor text. Experimental results show that our approach can effectively discover plausible missing anchor terms. We then use the web named page finding task in the TREC Terabyte track to explore the utility of missing anchor text information discovered by our approach for helping retrieval. Experimental results show that our approach can statistically significantly improve retrieval performance, compared with several approaches that only use anchor text aggregated over the web graph.
Keywords: anchor text, anchor text sparsity, content similarity, language models, relevance models, web search
Uncovering social spammers: social honeypots + machine learning BIBAKFull-Text 435-442
  Kyumin Lee; James Caverlee; Steve Webb
Web-based social systems enable new community-based opportunities for participants to engage, share, and interact. This community value and related services like search and advertising are threatened by spammers, content polluters, and malware disseminators. In an effort to preserve community value and ensure longterm success, we propose and evaluate a honeypot-based approach for uncovering social spammers in online social systems. Two of the key components of the proposed approach are: (1) The deployment of social honeypots for harvesting deceptive spam profiles from social networking communities; and (2) Statistical analysis of the properties of these spam profiles for creating spam classifiers to actively filter out existing and new spammers. We describe the conceptual framework and design considerations of the proposed approach, and we present concrete observations from the deployment of social honeypots in MySpace and Twitter. We find that the deployed social honeypots identify social spammers with low false positive rates and that the harvested spam data contains signals that are strongly correlated with observable profile features (e.g., content, friend information, posting patterns, etc.). Based on these profile features, we develop machine learning based classifiers for identifying previously unknown spammers with high precision and a low rate of false positives.
Keywords: social honeypots, social media, spam

Users and interactive IR

Studying trailfinding algorithms for enhanced web search BIBAKFull-Text 443-450
  Adish Singla; Ryen White; Jeff Huang
Search engines return ranked lists of Web pages in response to queries. These pages are starting points for post-query navigation, but may be insufficient for search tasks involving multiple steps. Search trails mined from toolbar logs start with a query and contain pages visited by one user during post-query navigation. Implicit endorsements from many trails can enhance result ranking. Rather than using trails solely to improve ranking, it may also be worth providing trail information directly to users. In this paper, we quantify the benefit that users currently obtain from trail-following and compare different methods for finding the best trail for a given query and each top-ranked result. We compare the relevance, topic coverage, topic diversity, and utility of trails selected using different methods, and break out findings by factors such as query type and origin relevance. Our findings demonstrate value in trails, highlight interesting differences in the performance of trailfinding algorithms, and show we can find best-trails for a query that outperform the trails most users follow. Findings have implications for enhancing Web information seeking using trails.
Keywords: best-trail selection, search trails, trailfinding
Context-aware ranking in web search BIBAKFull-Text 451-458
  Biao Xiang; Daxin Jiang; Jian Pei; Xiaohui Sun; Enhong Chen; Hang Li
The context of a search query often provides a search engine meaningful hints for answering the current query better. Previous studies on context-aware search were either focused on the development of context models or limited to a relatively small scale investigation under a controlled laboratory setting. Particularly, about context-aware ranking for Web search, the following two critical problems are largely remained unsolved. First, how can we take advantage of different types of contexts in ranking? Second, how can we integrate context information into a ranking model? In this paper, we tackle the above two essential problems analytically and empirically. We develop different ranking principles for different types of contexts. Moreover, we adopt a learning-to-rank approach and integrate the ranking principles into a state-of-the-art ranking model by encoding the context information as features of the model. We empirically test our approach using a large search log data set obtained from a major commercial search engine. Our evaluation uses both human judgments and implicit user click data. The experimental results clearly show that our context-aware ranking approach improves the ranking of a commercial search engine which ignores context information. Furthermore, our method outperforms a baseline method which considers context information in ranking.
Keywords: context-aware ranking, learning-to-rank application
Collecting high quality overlapping labels at low cost BIBAKFull-Text 459-466
  Hui Yang; Anton Mityagin; Krysta M. Svore; Sergey Markov
This paper studies quality of human labels used to train search engines' rankers. Our specific focus is performance improvements obtained by using overlapping relevance labels, which is by collecting multiple human judgments for each training sample. The paper explores whether, when, and for which samples one should obtain overlapping training labels, as well as how many labels per sample are needed. The proposed selective labeling scheme collects additional labels only for a subset of training samples, specifically for those that are labeled relevant by a judge. Our experiments show that this labeling scheme improves the NDCG of two Web search rankers on several real-world test sets, with a low labeling overhead of around 1.4 labels per sample. This labeling scheme also outperforms several methods of using overlapping labels, such as simple k-overlap, majority vote, the highest labels, etc. Finally, the paper presents a study of how many overlapping labels are needed to get the best improvement in retrieval accuracy.
Keywords: learning to rank, overlapping labels, relevance labels

Document representation and content analysis

Multi-style language model for web scale information retrieval BIBAKFull-Text 467-474
  Kuansan Wang; Xiaolong Li; Jianfeng Gao
Web documents are typically associated with many text streams, including the body, the title and the URL that are determined by the authors, and the anchor text or search queries used by others to refer to the documents. Through a systematic large scale analysis on their cross entropy, we show that these text streams appear to be composed in different language styles, and hence warrant respective language models to properly describe their properties. We propose a language modeling approach to Web document retrieval in which each document is characterized by a mixture model with components corresponding to the various text streams associated with the document. Immediate issues for such a mixture model arise as all the text streams are not always present for the documents, and they do not share the same lexicon, making it challenging to properly combine the statistics from the mixture components. To address these issues, we introduce an 'open-vocabulary' smoothing technique so that all the component language models have the same cardinality and their scores can simply be linearly combined. To ensure that the approach can cope with Web scale applications, the model training algorithm is designed to require no labeled data and can be fully automated with few heuristics and no empirical parameter tunings. The evaluation on Web document ranking tasks shows that the component language models indeed have varying degrees of capabilities as predicted by the cross-entropy analysis, and the combined mixture model outperforms the state-of-the-art BM25F based system.
Keywords: information retrieval, mixture language models, parameter estimation, probabilistic relevance model, smoothing
Combining coregularization and consensus-based self-training for multilingual text categorization BIBAKFull-Text 475-482
  Massih R. Amini; Cyril Goutte; Nicolas Usunier
We investigate the problem of learning document classifiers in a multilingual setting, from collections where labels are only partially available. We address this problem in the framework of multiview learning, where different languages correspond to different views of the same document, combined with semi-supervised learning in order to benefit from unlabeled documents. We rely on two techniques, coregularization and consensus-based self-training, that combine multiview and semi-supervised learning in different ways. Our approach trains different monolingual classifiers on each of the views, such that the classifiers' decisions over a set of unlabeled examples are in agreement as much as possible, and iteratively labels new examples from another unlabeled training set based on a consensus across language-specific classifiers. We derive a boosting-based training algorithm for this task, and analyze the impact of the number of views on the semi-supervised learning results on a multilingual extension of the Reuters RCV1/RCV2 corpus using five different languages. Our experiments show that coregularization and consensus-based self-training are complementary and that their combination is especially effective in the interesting and very common situation where there are few views (languages) and few labeled documents available.
Keywords: learning from multiple views, multilingual document classification, semi-supervised learning
Towards subjectifying text clustering BIBAKFull-Text 483-490
  Sajib Dasgupta; Vincent Ng
Although it is common practice to produce only a single clustering of a dataset, in many cases text documents can be clustered along different dimensions. Unfortunately, not only do traditional text clustering algorithms fail to produce multiple clusterings of a dataset, the only clustering they produce may not be the one that the user desires. In this paper, we propose a simple active clustering algorithm that is capable of producing multiple clusterings of the same data according to user interest. In comparison to previous work on feedback-oriented clustering, the amount of user feedback required by our algorithm is minimal. In fact, the feedback turns out to be as simple as a cursory look at a list of words. Experimental results are very promising: our system is able to generate clusterings along the user-specified dimensions with reasonable accuracies on several challenging text classification tasks, thus providing suggestive evidence that our approach is viable.
Keywords: active clustering, disparate clusterings, interactive clustering, multiple clusterings, spectral clustering

Summarization & user feedback

EUSUM: extracting easy-to-understand English summaries for non-native readers BIBAKFull-Text 491-498
  Xiaojun Wan; Huiying Li; Jianguo Xiao
In this paper we investigate a novel and important problem in multi-document summarization, i.e., how to extract an easy-to-understand English summary for non-native readers. Existing summarization systems extract the same kind of English summaries from English news documents for both native and non-native readers. However, the non-native readers have different English reading skills because they have different English education and learning backgrounds. An English summary which can be easily understood by native readers may be hardly understood by non-native readers. We propose to add the dimension of reading easiness or difficulty to multi-document summarization, and the proposed EUSUM system can produce easy-to-understand summaries according to the English reading skills of the readers. The sentence-level reading easiness (or difficulty) is predicted by using the SVM regression method. And the reading easiness score of each sentence is then incorporated into the summarization process. Empirical evaluation and user study have been performed and the results demonstrate that the EUSUM system can produce more easy-to-understand summaries for non-native readers than existing summarization systems, with very little sacrifice of the summary's informativeness.
Keywords: EUSUM, multi-document summarization, reading easiness
Visual summarization of web pages BIBAKFull-Text 499-506
  Binxing Jiao; Linjun Yang; Jizheng Xu; Feng Wu
Visual summarization is a attractive new scheme to summarize web pages, which can help achieve a more friendly user experience in search and re-finding tasks by allowing users quickly get the idea of what the web page is about and helping users recall the visited web page. In this paper, we perform a careful study on the recently proposed visual summarization approaches, including the thumbnail of the web page snapshot, the internal image in the web page which is representative of the content in the page, and the visual snippet which is a synthesized image based on the internal image, the title, and the logo found in the web page. Moreover, since the internal image based summarization approach hardly works when the representative internal images are unavailable, we propose a new strategy, which retrieves the representative image from the external to summarize the web page. The experimental results suggest that the various summarization approaches have respective advantages on different types of web pages. While internal images and thumbnails can provide a reliable summarization on web pages with dominant images and web pages with simple structure respectively, the external images are regarded as a useful information to complement the internal images and are demonstrated very useful in helping users understanding new web pages. The visual snippet performs well on the re-finding tasks since it incorporates the title and logo which are advantageous on identifying the visited web pages.
Keywords: visual summarization, web page summarization
Learning more powerful test statistics for click-based retrieval evaluation BIBAKFull-Text 507-514
  Yisong Yue; Yue Gao; Oliver Chapelle; Ya Zhang; Thorsten Joachims
Interleaving experiments are an attractive methodology for evaluating retrieval functions through implicit feedback. Designed as a blind and unbiased test for eliciting a preference between two retrieval functions, an interleaved ranking of the results of two retrieval functions is presented to the users. It is then observed whether the users click more on results from one retrieval function or the other. While it was shown that such interleaving experiments reliably identify the better of the two retrieval functions, the naive approach of counting all clicks equally leads to a suboptimal test. We present new methods for learning how to score different types of clicks so that the resulting test statistic optimizes the statistical power of the experiment. This can lead to substantial savings in the amount of data required for reaching a target confidence level. Our methods are evaluated on an operational search engine over a collection of scientific articles.
Keywords: click-through data, implicit feedback, retrieval evaluation

Query log analysis

Query similarity by projecting the query-flow graph BIBAKFull-Text 515-522
  Ilaria Bordino; Carlos Castillo; Debora Donato; Aristides Gionis
Defining a measure of similarity between queries is an interesting and difficult problem. A reliable query-similarity measure can be used in a variety of applications such as query recommendation, query expansion, and advertising.
   In this paper, we exploit the information present in query logs in order to develop a measure of semantic similarity between queries. Our approach relies on the concept of the query-flow graph. The query-flow graph aggregates query reformulations from many users: nodes in the graph represent queries, and two queries are connected if they are likely to appear as part of the same search goal. Our query similarity measure is obtained by projecting the graph (or appropriate subgraphs of it) on a low-dimensional Euclidean space. Our experiments show that the measure we obtain captures a notion of semantic similarity between queries and it is useful for diversifying query recommendations.
Keywords: query reformulations, query similarity, spectral projections
The demographics of web search BIBAKFull-Text 523-530
  Ingmar Weber; Carlos Castillo
How does the web search behavior of "rich" and "poor" people differ? Do men and women tend to click on different results for the same query? What are some queries almost exclusively issued by African Americans? These are some of the questions we address in this study.
   Our research combines three data sources: the query log of a major US-based web search engine, profile information provided by 28 million of its users (birth year, gender and ZIP code), and US-census information including detailed demographic information aggregated at the level of ZIP code. Through this combination we can annotate each query with, e.g. the average per-capita income in the ZIP code it originated from. Though conceptually simple, this combination immediately creates a powerful user modeling tool.
   The main contributions of this work are the following. First, we provide a demographic description of a large sample of search engine users in the US and show that it agrees well with the distribution of the US population. Second, we describe how different segments of the population differ in their search behavior, e.g. with respect to the queries they formulate or the URLs they click. Third, we explore applications of our methodology to improve web search relevance and to provide better query suggestions.
   These results enable a wide range of applications including improving web search and advertising where, for instance, targeted advertisements for "family vacations" could be adapted to the (expected) income.
Keywords: demographic factors, web search
A user behavior model for average precision and its generalization to graded judgments BIBAKFull-Text 531-538
  Georges Dupret; Benjamin Piwowarski
We explore a set of hypothesis on user behavior that are potentially at the origin of the (Mean) Average Precision (AP) metric. This allows us to propose a more realistic version of AP where users click non-deterministically on relevant documents and where the number of relevant documents in the collection needs not be known in advance. We then depart from the assumption that a document is either relevant or irrelevant and we use instead relevance judgment similar to editorial labels used for Discounted Cumulated Gain (DCG). We assume that clicked documents provide users with a certain level of "utility" and that a user ends a search when she gathered enough utility. Based on the query logs of a commercial search engine we show how to evaluate the utility associated with a label from the record of past user interactions with the search engine and we show how the two different user models can be evaluated based on their ability to predict accurately future clicks. Finally, based on these user models, we propose a measure that captures the relative quality of two rankings.
Keywords: metrics, search engines, statistical model, user behavior, user model


The effect of assessor error on IR system evaluation BIBAKFull-Text 539-546
  Ben Carterette; Ian Soboroff
Recent efforts in test collection building have focused on scaling back the number of necessary relevance judgments and then scaling up the number of search topics. Since the largest source of variation in a Cranfield-style experiment comes from the topics, this is a reasonable approach. However, as topic set sizes grow, and researchers look to crowdsourcing and Amazon's Mechanical Turk to collect relevance judgments, we are faced with issues of quality control. This paper examines the robustness of the TREC Million Query track methods when some assessors make significant and systematic errors. We find that while averages are robust, assessor errors can have a large effect on system rankings.
Keywords: assessor error, retrieval test collections
Reusable test collections through experimental design BIBAKFull-Text 547-554
  Ben Carterette; Evangelos Kanoulas; Virgil Pavlu; Hui Fang
Portable, reusable test collections are a vital part of research and development in information retrieval. Reusability is difficult to assess, however. The standard approach -- simulating judgment collection when groups of systems are held out, then evaluating those held-out systems -- only works when there is a large set of relevance judgments to draw on during the simulation. As test collections adapt to larger and larger corpora, it becomes less and less likely that there will be sufficient judgments for such simulation experiments. Thus we propose a methodology for information retrieval experimentation that collects evidence for or against the reusability of a test collection while judgments are being made. Using this methodology along with the appropriate statistical analyses, researchers will be able to estimate the reusability of their test collections while building them and implement "course corrections" if the collection does not seem to be achieving desired levels of reusability. We show the robustness of our design to inherent sources of variance, and provide a description of an actual implementation of the framework for creating a large test collection.
Keywords: evaluation, information retrieval, reusability, test collections
Do user preferences and evaluation measures line up? BIBAKFull-Text 555-562
  Mark Sanderson; Monica Lestari Paramita; Paul Clough; Evangelos Kanoulas
This paper presents results comparing user preference for search engine rankings with measures of effectiveness computed from a test collection. It establishes that preferences and evaluation measures correlate: systems measured as better on a test collection are preferred by users. This correlation is established for both "conventional web retrieval" and for retrieval that emphasizes diverse results. The nDCG measure is found to correlate best with user preferences compared to a selection of other well known measures. Unlike previous studies in this area, this examination involved a large population of users, gathered through crowd sourcing, exposed to a wide range of retrieval systems, test collections and search tasks. Reasons for user preferences were also gathered and analyzed. The work revealed a number of new results, but also showed that there is much scope for future work refining effectiveness measures to better capture user preferences.
Keywords: evaluation measures, mechanical turk, user experiment

Query analysis

Estimating advertisability of tail queries for sponsored search BIBAKFull-Text 563-570
  Sandeep Pandey; Kunal Punera; Marcus Fontoura; Vanja Josifovski
Sponsored search is one of the major sources of revenue for search engines on the World Wide Web. It has been observed that while showing ads for every query maximizes short-term revenue, irrelevant ads lead to poor user experience and less revenue in the long-term. Hence, it is in search engines' interest to place ads only for queries that are likely to attract ad-clicks. Many algorithms for estimating query advertisability exist in literature, but most of these methods have been proposed for and tested on the frequent or "head" queries. Since query frequencies on search engine are known to be distributed as a power-law, this leaves a huge fraction of the queries uncovered.
   In this paper we focus on the more challenging problem of estimating query advertisability for infrequent or "tail" queries. These require fundamentally different methods than head queries: for e.g., tail queries are almost all unique and require the estimation method to be online and inexpensive. We show that previously proposed methods do not apply to tail queries, and when modified for our scenario they do not work well. Further, we give a simple, yet effective, approach, which estimates query advertisability using only the words present in the queries. We evaluate our approach on a real-world dataset consisting of search engine queries and user clicks. Our results show that our simple approach outperforms a more complex one based on regularized regression.
Keywords: click estimation, sponsored search, tail queries
Exploring reductions for long web queries BIBAKFull-Text 571-578
  Niranjan Balasubramanian; Giridhar Kumaran; Vitor R. Carvalho
Long queries form a difficult, but increasingly important segment for web search engines. Query reduction, a technique for dropping unnecessary query terms from long queries, improves performance of ad-hoc retrieval on TREC collections. Also, it has great potential for improving long web queries (up to 25% improvement in NDCG@5). However, query reduction on the web is hampered by the lack of accurate query performance predictors and the constraints imposed by search engine architectures and ranking algorithms.
   In this paper, we present query reduction techniques for long web queries that leverage effective and efficient query performance predictors. We propose three learning formulations that combine these predictors to perform automatic query reduction. These formulations enable trading of average improvements for the number of queries impacted, and enable easy integration into the search engine's architecture for rank-time query reduction. Experiments on a large collection of long queries issued to a commercial search engine show that the proposed techniques significantly outperform baselines, with more than 12% improvement in NDCG@5 in the impacted set of queries. Extension to the formulations such as result interleaving further improves results. We find that the proposed techniques deliver consistent retrieval gains where it matters most: poorly performing long web queries.
Keywords: combining searches, learning to rank, query reformulation
Positional relevance model for pseudo-relevance feedback BIBAKFull-Text 579-586
  Yuanhua Lv; ChengXiang Zhai
Pseudo-relevance feedback is an effective technique for improving retrieval results. Traditional feedback algorithms use a whole feedback document as a unit to extract words for query expansion, which is not optimal as a document may cover several different topics and thus contain much irrelevant information. In this paper, we study how to effectively select from feedback documents those words that are focused on the query topic based on positions of terms in feedback documents. We propose a positional relevance model (PRM) to address this problem in a unified probabilistic way. The proposed PRM is an extension of the relevance model to exploit term positions and proximity so as to assign more weights to words closer to query words based on the intuition that words closer to query words are more likely to be related to the query topic. We develop two methods to estimate PRM based on different sampling processes. Experiment results on two large retrieval datasets show that the proposed PRM is effective and robust for pseudo-relevance feedback, significantly outperforming the relevance model in both document-based feedback and passage-based feedback.
Keywords: passage-based feedback, positional language model, positional relevance model, proximity, pseudo relevance feedback, query expansion

Effectiveness measures

Assessing the scenic route: measuring the value of search trails in web logs BIBAKFull-Text 587-594
  Ryen W. White; Jeff Huang
Search trails mined from browser or toolbar logs comprise queries and the post-query pages that users visit. Implicit endorsements from many trails can be useful for search result ranking, where the presence of a page on a trail increases its query relevance. Following a search trail requires user effort, yet little is known about the benefit that users obtain from this activity versus, say, sticking with the clicked search result or jumping directly to the destination page at the end of the trail. In this paper, we present a log-based study estimating the user value of trail following. We compare the relevance, topic coverage, topic diversity, novelty, and utility of full trails over that provided by sub-trails, trail origins (landing pages), and trail destinations (pages where trails end). Our findings demonstrate significant value to users in following trails, especially for certain query types. The findings have implications for the design of search systems, including trail recommendation systems that display trails on search result pages.
Keywords: log analysis, search trails, trail following
Human performance and retrieval precision revisited BIBAKFull-Text 595-602
  Mark D. Smucker; Chandra Prakash Jethani
Several studies have found that the Cranfield approach to evaluation can report significant performance differences between retrieval systems for which little to no performance difference is found for humans completing tasks with these systems. We revisit the relationship between precision and performance by measuring human performance on tightly controlled search tasks and with user interfaces offering limited interaction. We find that human performance and retrieval precision are strongly related. We also find that users change their relevance judging behavior based on the precision of the results. This change in behavior coupled with the well-known lack of perfect inter-assessor agreement can reduce the measured performance gains predicted by increased precision.
Keywords: Cranfield, evaluation metrics, human performance, interaction, precision, user studies
Extending average precision to graded relevance judgments BIBAKFull-Text 603-610
  Stephen E. Robertson; Evangelos Kanoulas; Emine Yilmaz
Evaluation metrics play a critical role both in the context of comparative evaluation of the performance of retrieval systems and in the context of learning-to-rank (LTR) as objective functions to be optimized. Many different evaluation metrics have been proposed in the IR literature, with average precision (AP) being the dominant one due a number of desirable properties it possesses. However, most of these measures, including average precision, do not incorporate graded relevance.
   In this work, we propose a new measure of retrieval effectiveness, the Graded Average Precision (GAP). GAP generalizes average precision to the case of multi-graded relevance and inherits all the desirable characteristics of AP: it has a nice probabilistic interpretation, it approximates the area under a graded precision-recall curve and it can be justified in terms of a simple but moderately plausible user model. We then evaluate GAP in terms of its informativeness and discriminative power. Finally, we show that GAP can reliably be used as an objective metric in learning to rank by illustrating that optimizing for GAP using SoftRank and LambdaRank leads to better performing ranking functions than the ones constructed by algorithms tuned to optimize for AP or NDCG even when using AP or NDCG as the test metrics.
Keywords: average precision, effectiveness metrics, graded relevance, information retrieval, learning to rank
PRES: a score metric for evaluating recall-oriented information retrieval applications BIBAKFull-Text 611-618
  Walid Magdy; Gareth J. F. Jones
Information retrieval (IR) evaluation scores are generally designed to measure the effectiveness with which relevant documents are identified and retrieved. Many scores have been proposed for this purpose over the years. These have primarily focused on aspects of precision and recall, and while these are often discussed with equal importance, in practice most attention has been given to precision focused metrics. Even for recall-oriented IR tasks of growing importance, such as patent retrieval, these precision based scores remain the primary evaluation measures. Our study examines different evaluation measures for a recall-oriented patent retrieval task and demonstrates the limitations of the current scores in comparing different IR systems for this task. We introduce PRES, a novel evaluation metric for this type of application taking account of recall and the user's search effort. The behaviour of PRES is demonstrated on 48 runs from the CLEF-IP 2009 patent retrieval track. A full analysis of the performance of PRES shows its suitability for measuring the retrieval effectiveness of systems from a recall focused perspective taking into account the user's expected search effort.
Keywords: PRES, evaluation metric, patent retrieval, recall-oriented information retrieval

Multimedia information retrieval

Content-enriched classifier for web video classification BIBAKFull-Text 619-626
  Bin Cui; Ce Zhang; Gao Cong
With the explosive growth of online videos, automatic real-time categorization of Web videos plays a key role for organizing, browsing and retrieving the huge amount of videos on the Web. Previous work shows that, in addition to text features, content features of videos are also useful for Web video classification. Unfortunately, extracting content features is computationally prohibitive for real-time video classification. In this paper we propose a novel video classification framework that is able to exploit both content and text features for video classification while avoiding the expensive computation of extracting content features at classification time. The main idea of our approach is to utilize the content features extracted from training data to enrich the text based semantic kernels, yielding content-enriched semantic kernels. The content-enriched semantic kernels enable to utilize both content and text features for classifying new videos without extracting their content features. The experimental results show that our approach significantly outperforms the state-of-the-art video classification methods.
Keywords: classification, content, text, video, web
Robust audio identification for MP3 popular music BIBAKFull-Text 627-634
  Wei Li; Yaduo Liu; Xiangyang Xue
Audio identification via fingerprint has been an active research field with wide applications for years. Many technical papers were published and commercial software systems were also employed. However, most of these previously reported methods work on the raw audio format in spite of the fact that nowadays compressed format audio, especially MP3 music, has grown into the dominant way to store on personal computers and transmit on the Internet. It would be interesting if a compressed unknown audio fragment is able to be directly recognized from the database without the fussy and time-consuming decompression-identification-recompression procedure. So far, very few algorithms run directly in the compressed domain for music information retrieval, and most of them take advantage of MDCT coefficients or derived energy type of features. As a first attempt, we propose in this paper utilizing compressed-domain spectral entropy as the audio feature to implement a novel audio fingerprinting algorithm. The compressed songs stored in a music database and the possibly distorted compressed query excerpts are first partially decompressed to obtain the MDCT coefficients as the intermediate result. Then by grouping granules into longer blocks, remapping the MDCT coefficients into 192 new frequency lines to unify the frequency distribution of long and short windows, and defining 9 new subbands which cover the main frequency bandwidth of popular songs in accordance with the scale-factor bands of short windows, we calculate the spectral entropy of all consecutive blocks and come to the final fingerprint sequence by means of magnitude relationship modeling. Experiments show that such fingerprints exhibit strong robustness against various audio signal distortions like recompression, noise interference, echo addition, equalization, band-pass filtering, pitch shifting, and slight time-scale modification etc. For 5s-long query examples which might be severely degraded, an average top-five retrieval precision rate of more than 90% can be obtained in our test data set composed of 1822 popular songs.
Keywords: MDCT spectral entropy, audio identification, compressed domain
Effective music tagging through advanced statistical modeling BIBAKFull-Text 635-642
  Jialie Shen; Wang Meng; Shuichang Yan; HweeHwa Pang; Xiansheng Hua
Music information retrieval (MIR) holds great promise as a technology for managing large music archives. One of the key components of MIR that has been actively researched into is music tagging. While significant progress has been achieved, most of the existing systems still adopt a simple classification approach, and apply machine learning classifiers directly on low level acoustic features. Consequently, they suffer the shortcomings of (1) poor accuracy, (2) lack of comprehensive evaluation results and the associated analysis based on large scale datasets, and (3) incomplete content representation, arising from the lack of multimodal and temporal information integration.
   In this paper, we introduce a novel system called MMTagger that effectively integrates both multimodal and temporal information in the representation of music signal. The carefully designed multilayer architecture of the proposed classification framework seamlessly combines Multiple Gaussian Mixture Models (GMMs) and Support Vector Machine (SVM) into a single framework. The structure preserves more discriminative information, leading to more accurate and robust tagging. Experiment results obtained with two large music collections highlight the various advantages of our multilayer framework over state of the art techniques.
Keywords: browsing, music, recommendation, search, tagging
Properties of optimally weighted data fusion in CBMIR BIBAKFull-Text 643-650
  Peter Wilkins; Alan F. Smeaton; Paul Ferguson
Content-Based Multimedia Information Retrieval (CBMIR) systems which leverage multiple retrieval experts (En) often employ a weighting scheme when combining expert results through data fusion. Typically however a query will comprise multiple query images (Im) leading to potentially N x M weights to be assigned. Because of the large number of potential weights, existing approaches impose a hierarchy for data fusion, such as uniformly combining query image results from a single retrieval expert into a single list and then weighting the results of each expert. In this paper we will demonstrate that this approach is sub-optimal and leads to the poor state of CBMIR performance in benchmarking evaluations. We utilize an optimization method known as Coordinate Ascent to discover the optimal set of weights (|En| * |Im|) which demonstrates a dramatic difference between known results and the theoretical maximum. We find that imposing common combinatorial hierarchies for data fusion will half the optimal performance that can be achieved. By examining the optimal weight sets at the topic level, we observe that approximately 15% of the weights (from set |En| * |Im|) for any given query, are assigned 70%-82% of the total weight mass for that topic. Furthermore we discover that the ideal distribution of weights follows a log-normal distribution. We find that we can achieve up to 88% of the performance of fully optimized query using just these 15% of the weights. Our investigation was conducted on TRECVID evaluations 2003 to 2007 inclusive and ImageCLEFPhoto 2007, totalling 181 search topics optimized over a combined collection size of 661,213 images and 1,594 topic images.
Keywords: content-based, data fusion, multimedia fusion

Non-English IR & evaluation

To translate or not to translate? BIBAKFull-Text 651-658
  Chia-Jung Lee; Chin-Hui Chen; Shao-Hang Kao; Pu-Jen Cheng
Query translation is an important task in cross-language information retrieval (CLIR) aiming to translate queries into languages used in documents. The purpose of this paper is to investigate the necessity of translating query terms, which might differ from one term to another. Some untranslated terms cause irreparable performance drop while others do not. We propose an approach to estimate the translation probability of a query term, which helps decide if it should be translated or not. The approach learns regression and classification models based on a rich set of linguistic and statistical properties of the term. Experiments on NTCIR-4 and NTCIR-5 English-Chinese CLIR tasks demonstrate that the proposed approach can significantly improve CLIR performance. An in-depth analysis is also provided for discussing the impact of untranslated out-of-vocabulary (OOV) query terms and translation quality of non-OOV query terms on CLIR performance.
Keywords: cross-language information retrieval, query term performance, query translation, translation quality
Multilingual PRF: English lends a helping hand BIBAKFull-Text 659-666
  Manoj K. Chinnakotla; Karthik Raman; Pushpak Bhattacharyya
In this paper, we present a novel approach to Pseudo-Relevance Feedback (PRF) called Multilingual PRF (MultiPRF). The key idea is to harness multilinguality. Given a query in a language, we take the help of another language to ameliorate the well known problems of PRF, viz. (a) The expansion terms from PRF are primarily based on co-occurrence relationships with query terms, and thus other terms which are lexically and semantically related, such as morphological variants and synonyms, are not explicitly captured, and (b) PRF is quite sensitive to the quality of the initially retrieved top k documents and is thus not robust. In MultiPRF, given a query in language L1, it is translated into language L2 and PRF is performed on a collection in language L2 and the resultant feedback model is translated from L2 back into L1. The final feedback model is obtained by combining the translated model with the original feedback model of the query in L1.
   Experiments were performed on standard CLEF collections in languages with widely differing characteristics, viz., French, German, Finnish and Hungarian with English as the assisting language. We observe that MultiPRF outperforms PRF and is more robust with consistent and significant improvements in the above widely differing languages. A thorough analysis of the results reveal that the second language helps in obtaining both co-occurrence based conceptual terms as well as lexically and semantically related terms. Additionally, the use of the second language collection reduces the sensitivity to performance of initial retrieval, thereby making it more robust.
Keywords: language models, multilingual, pseudo-relevance feedback, query expansion
Comparing the sensitivity of information retrieval metrics BIBAKFull-Text 667-674
  Filip Radlinski; Nick Craswell
Information retrieval effectiveness is usually evaluated using measures such as Normalized Discounted Cumulative Gain (NDCG), Mean Average Precision (MAP) and Precision at some cutoff (Precision@k) on a set of judged queries. Recent research has suggested an alternative, evaluating information retrieval systems based on user behavior. Particularly promising are experiments that interleave two rankings and track user clicks. According to a recent study, interleaving experiments can identify large differences in retrieval effectiveness with much better reliability than other click-based methods.
   We study interleaving in more detail, comparing it with traditional measures in terms of reliability, sensitivity and agreement. To detect very small differences in retrieval effectiveness, a reliable outcome with standard metrics requires about 5,000 judged queries, and this is about as reliable as interleaving with 50,000 user impressions. Amongst the traditional measures, NDCG has the strongest correlation with interleaving. Finally, we present some new forms of analysis, including an approach to enhance interleaving sensitivity.
Keywords: evaluation, interleaving, search

Applications II

Efficient partial-duplicate detection based on sequence matching BIBAKFull-Text 675-682
  Qi Zhang; Yue Zhang; Haomin Yu; Xuanjing Huang
With the ever-increasing growth of the Internet, numerous copies of documents become serious problem for search engine, opinion mining and many other web applications. Since partial-duplicates only contain a small piece of text taken from other sources and most existing near-duplicate detection approaches focus on document level, partial duplicates can not be dealt with well. In this paper, we propose a novel algorithm to realize the partial-duplicate detection task. Besides the similarities between documents, our proposed algorithm can simultaneously locate the duplicated parts. The main idea is to divide the partial-duplicate detection task into two subtasks: sentence level near-duplicate detection and sequence matching. For evaluation, we compare the proposed method with other approaches on both English and Chinese web collections. Experimental results appear to support that our proposed method is effectively and efficiently to detect both partial-duplicates on large web collections.
Keywords: MapReduce, partial-duplicate detection, sequence matching
Discriminative models of integrating document evidence and document-candidate associations for expert search BIBAKFull-Text 683-690
  Yi Fang; Luo Si; Aditya P. Mathur
Generative models such as statistical language modeling have been widely studied in the task of expert search to model the relationship between experts and their expertise indicated in supporting documents. On the other hand, discriminative models have received little attention in expert search research, although they have been shown to outperform generative models in many other information retrieval and machine learning applications. In this paper, we propose a principled relevance-based discriminative learning framework for expert search and derive specific discriminative models from the framework. Compared with the state-of-the-art language models for expert search, the proposed research can naturally integrate various document evidence and document-candidate associations into a single model without extra modeling assumptions or effort. An extensive set of experiments have been conducted on two TREC Enterprise track corpora (i.e., W3C and CERC) to demonstrate the effectiveness and robustness of the proposed framework.
Keywords: discriminative models, enterprise search, expert search
Vertical selection in the presence of unlabeled verticals BIBAKFull-Text 691-698
  Jaime Arguello; Fernando Diaz; Jean-François Paiement
Vertical aggregation is the task of incorporating results from specialized search engines or verticals (e.g., images, video, news) into Web search results. Vertical selection is the subtask of deciding, given a query, which verticals, if any, are relevant. State of the art approaches use machine learned models to predict which verticals are relevant to a query. When trained using a large set of labeled data, a machine learned vertical selection model outperforms baselines which require no training data. Unfortunately, whenever a new vertical is introduced, a costly new set of editorial data must be gathered. In this paper, we propose methods for reusing training data from a set of existing (source) verticals to learn a predictive model for a new (target) vertical. We study methods for learning robust, portable, and adaptive cross-vertical models. Experiments show the need to focus on different types of features when maximizing portability (the ability for a single model to make accurate predictions across multiple verticals) than when maximizing adaptability (the ability for a single model to make accurate predictions for a specific vertical). We demonstrate the efficacy of our methods through extensive experimentation for 11 verticals.
Keywords: distributed information retrieval, domain adaptation, query intent, vertical search


iCollaborate: harvesting value from enterprise web usage BIBAKFull-Text 699
  Ajinkya Kale; Thomas Burris; Bhavesh Shah; T. L. Prasanna Venkatesan; Lakshmanan Velusamy; Manish Gupta; Melania Degerattu
We are in a phase of 'Participatory Web' in which users add value' to the information on the web by publishing, tagging and sharing. The Participatory Web has enormous potential for an enterprise because unlike the users of the internet an enterprise is a community that shares common goals, assumptions, vocabulary and interest and has reliable user identification and mutual trust along with a central governance and incentives to collaborate. Everyday, the employees of an organization locate content relevant to their work on the web. Finding this information takes time, expertise and creativity, which costs an organization money. That is, the web pages employees find are knowledge assets owned by the enterprise. This investment in web-based knowledge assets is lost every time the enterprise fails to capture and reuse them. iCollaborate is tooled to capture user's web interaction, persist and analyze it, and feed that interaction back into the community -- the enterprise.
Keywords: enterprise social data, social browsing
Exploring desktop resources based on user activity analysis BIBAKFull-Text 700
  Yukun Li; Xiangyu Zhang; Xiaofeng Meng
Relocation in personal desktop resources is an interesting and promising research topic. This demonstration illustrates a new perspective in exploring desktop resources to help users re-find expected data resources more effectively. Different from existing works, our prototype OrientSpace has two features: automatically extract and maintain user tasks to support task-based exploration, and support vague search by exploiting associations between desktop resources.
Keywords: association exploration, desktop resources, task exploration
A data-parallel toolkit for information retrieval BIBKFull-Text 701
  Dennis Fetterly; Frank McSherry
Keywords: data parallel, information retrieval
Finding and filtering information for children BIBAKFull-Text 702
  Desmond Elliot; Richard Glassey; Tamara Polajnar; Leif Azzopardi
Children face several challenges when using information access systems. These include formulating queries, judging the relevance of documents, and focusing attention on interface cues, such as query suggestions, while typing queries. It has also been shown that children want a personalised Web experience and prefer content presented to them that matches their long-term entertainment and education needs. To this end, we have developed an interaction-based information filtering system to address these challenges.
Keywords: children, information filtering
Automatic content linking: speech-based just-in-time retrieval for multimedia archives BIBAKFull-Text 703
  Andrei Popescu-Belis; Jonathan Kilgour; Peter Poller; Alexandre Nanchen; Erik Boertjes; Joost de Wit
The Automatic Content Linking Device monitors a conversation and uses automatically recognized words to retrieve documents that are of potential use to the participants. The document set includes project related reports or emails, transcribed snippets of past meetings, and websites. Retrieval results are displayed at regular intervals.
Keywords: just-in-time retrieval
Si-Fi: interactive similar item finder BIBKFull-Text 704
  Inbeom Hwang; Minsuk Kahng; Sung Eun Park; Jinwook Seo; Sang-goo Lee
Keywords: hierarchical clustering, interactive search
Suggesting related topics in web search BIBAKFull-Text 705
  Santosh Raju; Shaishav Kumar; Raghavendra Udupa
Suggesting topics that are related to user's goal or interest is very important in web search. However, search engines today focus on suggesting mainly reformulations and lexical variants of the query mined from query logs. In this demonstration, we show a system that can suggest related topics for a query based on the top search results for the query. It can help users in exploring the topics related to their information need. The topic suggestion system can be integrated with any search engine or it can be easily installed on the client machine as a browser plugin.
Keywords: topic suggestions, web search
Agro-Gator: digesting experts, logs, and N-grams BIBAKFull-Text 706
  Michael Huggett
As research includes more and larger user studies, a significant problem lies in combining the many types of data files into a single table suitable for analysis by common statistical tools.
   We have developed a data-aggregation tool that combines user logs, expert scoring, and task/session attributes. The tool also integrates the n-grams derived from a given sequence of actions in the user tasks. The tool provides a GUI for quick and easy configuration.
Keywords: N-gram analysis, data aggregation, data analysis, user studies
Medical search and classification tools for recommendation BIBAKFull-Text 707
  Jimmy Xiangji Huang; Aijun An; Qinmin Hu
their patients' records from paper to computer, enormous amounts of electronic medical records (EMR) have become available for medical research. Some of the EMR data are well-structured, for which traditional database management systems can provide effective retrieval and management functions. However, most of the EMR data (such as progress notes and consultation letters) are in free text formats. How to effectively and efficiently retrieve and discover useful information from the vast amount of such semi-structured data is a challenge faced by medical professionals. Without proper tools, the rich information and knowledge buried in the medical health records are unavailable for clinical research and decision-making. The objective of our research is to develop text analytics tools that are capable of parsing clinical medical data so that predefined search subjects that correspond to a list of medical diagnoses can be extracted. In addition to this particular core functionality, it is also desired that several important assets should be present within the text-analytics tools in order to improve its overall ability to be used as recommendation tools. In this research, we work with research scientists at the Institute for Clinical Evaluative Sciences (ICES) in Toronto and examine a number of techniques for structuring and processing free text documents in order to effectively and efficiently search and analyze vast amount of medical records. We implement several powerful medical text analytics tools for clinical data searching and classification. For data classification, our tools sort through a great amount of patient records to identify the likelihood of a patient having myocardial infarction (MI) or hypertension (HTN), and classify the patients accordingly. Our tools can also identify the likelihood of a patient being a smoker, previous smoker or non-smoker based on the text data of medical records.
Keywords: EMR, classification, medical search, recommendation
Multilingual people search BIBAKFull-Text 708
  Shaishav Kumar; Raghavendra Udupa
People Search is an important search service with multiple applications (eg. looking up a friend on Facebook, finding colleagues in corporate email directories etc). With the proportion of non-English users on a steady rise, people search services are being used by users from diverse language demographics. Users may issue name search queries against these directories in languages other than the language of the directory, in which case the present monolingual name search approaches will not work. In this demo, we present a Multilingual People Search system capable of performing fast name lookups on large user directories, independent of the directory language. Our system has applications in areas like social networking, enterprise search and email address book search.
Keywords: multilingual name search, people search

Poster presentations

Closed form solution of similarity algorithms BIBAKFull-Text 709-710
  Yuanzhe Cai; Miao Zhang; Chris Ding; Sharma Chakravarthy
Algorithms defining similarities between objects of an information network are important of many IR tasks. SimRank algorithm and its variations are popularly used in many applications. Many fast algorithms are also developed. In this note, we first reformulate them as random walks on the network and express them using forward and backward transition probably in a matrix form. Second, we show that P-Rank (SimRank is only the special case of P-Rank) has a unique solution of eeT when decay factor c is equal to 1. We also show that SimFusion algorithm is a special case of P-Rank algorithm and prove that the similarity matrix of SimFusion is the product of PageRank vector. Our experiments on the web datasets show that for P-Rank the decay factor c doesn't seriously affect the similarity accuracy and accuracy of P-Rank is also higher than SimFusion and SimRank.
Keywords: linkage mining, similarity calculation
Blog snippets: a comments-biased approach BIBAKFull-Text 711-712
  Javier Parapar; Jorge López-Castro; Álvaro Barreiro
In the last years Blog Search has been a new exciting task in Information Retrieval. The presence of user generated information with valuable opinions makes this field of huge interest. In this poster we use part of this information, the readers' comments, to improve the quality of post snippets with the objective of enhancing the user access to the relevant posts in a result list. We propose a simple method for snippet generation based on sentence selection, using the comments to guide the selection process. We evaluated our approach with standard TREC methodology in the Blogs06 collection showing significant improvements up to 32% in terms of MAP over the baseline.
Keywords: blogs, comments, snippets
SIGIR: scholar vs. scholars' interpretation BIBAKFull-Text 713-714
  James Lanagan; Alan F. Smeaton
Google Scholar allows researchers to search through a free and extensive source of information on scientific publications. In this paper we show that within the limited context of SIGIR proceedings, the rankings created by Google Scholar are both significantly different and very negatively correlated with those of domain experts.
Keywords: Google scholar, citation analysis, information retrieval
Effective query expansion with the resistance distance based term similarity metric BIBAKFull-Text 715-716
  Shuguang Wang; Milos Hauskrecht
In this paper, we define a new query expansion method that relies on term similarity metric derived from the electric resistance network. This proposed metric lets us measure the mutual relevancy in between terms and between their groups. This paper shows how to define this metric automatically from the document collection, and then apply it in query expansion for document retrieval tasks. The experiments show this method can be used to find good expansion terms of search queries and improve document retrieval performance on two TREC genomic track datasets.
Keywords: information retrieval, query expansion, term similarity
A method to automatically construct a user knowledge model in a forum environment BIBAKFull-Text 717-718
  Ahmad Kardan; Mehdi Garakani; Bamdad Bahrani
Having a mechanism to validate the opinions and to identify experts in a forum could help people to favor one opinion against another. To achieve this, some solutions have already been introduced, including social network analysis techniques and reputation modeling. However, neither of these solutions considers the users' knowledge to identify an expert. In this paper, a novel method is proposed which estimates users' knowledge based on the forum itself, and identifies the possible areas of expertise associated with each user.
Keywords: expert finding, forum, information retrieval, knowledge model
Learning to rank audience for behavioral targeting BIBAKFull-Text 719-720
  Ning Liu; Jun Yan; Dou Shen; Depin Chen; Zheng Chen; Ying Li
Behavioral Targeting (BT) is a recent trend of online advertising market. However, some classical BT solutions, which predefine the user segments for BT ads delivery, are sometimes too large to numerous long-tail advertisers, who cannot afford to buy any large user segments due to budget consideration. In this extend abstract, we propose to rank users according to their probability of interest in an advertisement in a learning to rank framework. We propose to extract three types of features between user behaviors such as search queries, ad click history etc and the ad content provided by advertisers. Through this way, a long-tail advertiser can select a certain number of top ranked users as needed from the user segments for ads delivery. In the experiments, we use a 30-days' ad click-through log from a commercial search engine. The results show that using our proposed features under a learning to rank framework, we can well rank users who potentially interest in an advertisement.
Keywords: behavioral targeting, learning to rank, online advertising
Multi-modal query expansion for web video search BIBAKFull-Text 721-722
  Bailan Feng; Juan Cao; Zhineng Chen; Yongdong Zhang; Shouxun Lin
Query expansion is an effective method to improve the usability of multimedia search. Most existing multimedia search engines are able to automatically expand a list of textual query terms based on text search techniques, which can be called textual query expansion (TQE). However, the annotations (title and tag) around web videos are generally noisier for text-only query expansion and search matching. In this paper, we propose a novel multi-modal query expansion (MMQE) framework for web video search to solve the issue. Compared with traditional methods, MMQE provides a more intuitive query suggestion by transforming textual query to visual presentation based on visual clustering. Parallel to this, MMQE can enhance the process of search matching with strong pertinence of intent-specific query by joining textual, visual and social cues from both metadata and content of videos. Experimental results on real web videos from YouTube demonstrate the effectiveness of the proposed method.
Keywords: query expansion, web video search
Context aware query classification using dynamic query window and relationship net BIBAKFull-Text 723-724
  Nazli Goharian; Saket S. R. Mengle
The context of the user queries, preceding a given query, is utilized to improve the effectiveness of query classification. Earlier efforts utilize fixed number of preceding queries to derive such context information. We propose and evaluate an approach (DQW) that identifies a set of unambiguous preceding queries in a dynamically determined window to utilize in classifying an ambiguous query. Furthermore, utilizing a relationship-net (R-net) that represents relationships among known categories, we improve the classification effectiveness for those ambiguous queries whose predicted category in this relationship-net is related to the category of a query within the window. Our results indicate that the hybrid approach (DQW+R-net) statistically significantly improves the Conditional Random Field (CRF) query classification approach when static query windowing and hierarchical taxonomy are used (SQW+Tax), in terms of precision (10.8%), recall (13.2%), and F1 measure (11.9%).
Keywords: query classification
Predicting query potential for personalization, classification or regression? BIBAKFull-Text 725-726
  Chen Chen; Muyun Yang; Sheng Li; Tiejun Zhao; Haoliang Qi
The goal of predicting query potential for personalization is to determine which queries can benefit from personalization. In this paper, we investigate which kind of strategy is better for this task: classification or regression. We quantify the potential benefits of personalizing search results using two implicit click-based measures: Click entropy and Potential@N. Meanwhile, queries are characterized by query features and history features. Then we build C-SVM classification model and epsilon-SVM regression model respectively according to these two measures. The experimental results show that the classification model is a better choice for predicting query potential for personalization.
Keywords: classification, query potential for personalization, regression
The impact of collection size on relevance and diversity BIBAKFull-Text 727-728
  Marijn Koolen; Jaap Kamps
It has been observed that precision increases with collection size. One explanation could be that the redundancy of information increases, making it easier to find multiple documents conveying the same information. Arguably, a user has no interest in reading the same information over and over, but would prefer a set of diverse search results covering multiple aspects of the search topic. In this paper, we look at the impact of the collection size on the relevance and diversity of retrieval results by down-sampling the collection.
   Our main finding is that we can we can improve diversity by randomly removing the majority of the results -- this will significantly reduce the redundancy and only marginally affect the subtopic coverage.
Keywords: collection size, diversity, relevance
Spatial relationships in visual graph modeling for image categorization BIBAKFull-Text 729-730
  Trong-Ton Pham; Philippe Mulhem; Loic Maisonnasse
In this paper, a language model adapted to graph-based representation of image content is proposed and assessed. The full indexing and retrieval processes are evaluated on two different image corpora. We show that using the spatial relationships with graph model has a positive impact on the results of standard Language Model (LM) and outperforms the baseline built upon the current state-of-the-art Support Vector Machine (SVM) classification method.
Keywords: graph theory, image categorization, language model
A picture is worth a thousand search results: finding child-oriented multimedia results with collAge BIBAKFull-Text 731-732
  Karl Gyllstrom; Marie-Francine Moens
We present a simple and effective approach to complement search results for children's web queries with child-oriented multimedia results, such as coloring pages and music sheets. Our approach determines appropriate media types for a query by searching Google's database of frequent queries for co-occurrences of a query's terms (e.g., "dinosaurs") with preselected multimedia terms (e.g., "coloring pages"). We show the effectiveness of this approach through an online user evaluation.
Keywords: children, google, mechanical turk, query suggestion
Query recovery of short user queries: on query expansion with stopwords BIBAKFull-Text 733-734
  Johannes Leveling; Gareth J. F. Jones
User queries to search engines are observed to predominantly contain inflected content words but lack stopwords and capitalization. Thus, they often resemble natural language queries after case folding and stopword removal. Query recovery aims to generate a linguistically well-formed query from a given user query as input to provide natural language processing tasks and cross-language information retrieval (CLIR). The evaluation of query translation shows that translation scores (NIST and BLEU) decrease after case folding, stopword removal, and stemming. A baseline method for query recovery reconstructs capitalization and stopwords, which considerably increases translation scores and significantly increases mean average precision for a standard CLIR task.
Keywords: clir, query expansion, query reformulation
Where to start filtering redundancy?: a cluster-based approach BIBAKFull-Text 735-736
  Ronald T. Fernandez; Javier Parapar; David E. Losada; Alvaro Barreiro
Novelty detection is a difficult task, particularly at sentence level. Most of the approaches proposed in the past consist of re-ordering all sentences following their novelty scores. However, this re-ordering has usually little value. In fact, a naive baseline with no novelty detection capabilities yields often better performance than any state-of-the-art novelty detection mechanism. We argue here that this is because current methods initiate too early the novelty detection process. When few sentences have been seen, it is unlikely that the user is negatively affected by redundancy. Therefore, re-ordering the first sentences may be harmful in terms of performance. We propose here a query-dependent method based on cluster analysis to determine where we must start filtering redundancy.
Keywords: novelty detection, sentence clustering
Flickr group recommendation based on tensor decomposition BIBAKFull-Text 737-738
  Nan Zheng; Qiudan Li; Shengcai Liao; Leiming Zhang
Over the last few years, Flickr has gained massive popularity and groups in Flickr are one of the main ways for photo diffusion. However, the huge volume of groups brings troubles for users to decide which group to choose. In this paper, we propose a tensor decomposition-based group recommendation model to suggest groups to users which can help tackle this problem. The proposed model measures the latent associations between users and groups by considering both semantic tags and social relations. Experimental results show the usefulness of the proposed model.
Keywords: Flickr group, group recommendation, tensor decomposition
Robust music identification based on low-order Zernike moment in the compressed domain BIBAKFull-Text 739-740
  Wei Li; Yaduo Liu; Xiangyang Xue
In this paper, we devise a novel robust music identification algorithm utilizing compressed-domain audio Zernike moment adapted from image processing techniques as the pivotal feature. Audio fingerprint derived from this feature exhibits strong robustness against various audio signal distortions including the challenging pitch shifting and time-scale modification. Experiments show that in our test dataset composed of 1822 popular songs, a 5s music query example which might have been severely corrupted is still sufficient to identify its original near-duplicate copy, with more than 90% top five precision rate.
Keywords: music identification, robustness, Zernike moment
Estimating interference in the QPRP for subtopic retrieval BIBAKFull-Text 741-742
  Guido Zuccon; Leif Azzopardi; Claudia Hauff; C. J. Keith van Rijsbergen
The Quantum Probability Ranking Principle (QPRP) has been recently proposed, and accounts for interdependent document relevance when ranking. However, to be instantiated, the QPRP requires a method to approximate the "interference" between two documents. In this poster, we empirically evaluate a number of different methods of approximation on two TREC test collections for subtopic retrieval. It is shown that these approximations can lead to significantly better retrieval performance over the state of the art.
Keywords: diversity, interference estimation, quantum probability ranking principle
Query quality: user ratings and system predictions BIBAKFull-Text 743-744
  Claudia Hauff; Franciska de Jong; Diane Kelly; Leif Azzopardi
Numerous studies have examined the ability of query performance prediction methods to estimate a query's quality for system effectiveness measures (such as average precision). However, little work has explored the relationship between these methods and user ratings of query quality. In this poster, we report the findings from an empirical study conducted on the TREC ClueWeb09 corpus, where we compared and contrasted user ratings of query quality against a range of query performance prediction methods. Given a set of queries, it is shown that user ratings of query quality correlate to both system effectiveness measures and a number of pre-retrieval predictors.
Keywords: query performance prediction
Multi-field learning for email spam filtering BIBAKFull-Text 745-746
  Wuying Liu; Ting Wang
Through the investigation of email document structure, this paper proposes a multi-field learning (MFL) framework, which breaks the multi-field document Text Classification (TC) problem into several sub-document TC problems, and makes the final category prediction by weighted linear combination of several sub-document TC results. Many previous statistical TC algorithms can be easily rebuilt within the MFL framework via turning binary result to spamminess score, which is a real number and reflects the likelihood that the classified email is spam. The experimental results in the TREC spam track show that the performances of many TC algorithms can be improved within the MFL framework.
Keywords: multi-field learning, spam filtering, text feature selection
Language-model-based pro/con classification of political text BIBAKFull-Text 747-748
  Rawia Awadallah; Maya Ramanath; Gerhard Weikum
Given a controversial political topic, our aim is to classify documents debating the topic into pro or con. Our approach extracts topic related terms, pro/con related terms, and pairs of topic related and pro/con related terms and uses them as the basis for constructing a pro query and a con query. Following standard LM techniques, a document is classified as pro or con depending on which of the query likelihoods is higher for the document. Our experiments show that our approach is promising.
Keywords: language models, sentiment analysis, text classification
Intent boundary detection in search query logs BIBAKFull-Text 749-750
  Chieh-Jen Wang; Kevin Hsin-Yih Lin; Hsin-Hsi Chen
Identifying intent boundary in search query logs is important for learning users' behaviors and applying their experiences. Time-based, query-based, and cluster-based approaches are proposed. Experiments show that the integration of intent clusters and dynamic time model performs the best.
Keywords: intent boundary detection, intent clustering, query log analysis
Semi-supervised spam filtering using aggressive consistency learning BIBAKFull-Text 751-752
  Mona Mojdeh; Gordon V. Cormack
A graph based semi-supervised method for email spam filtering, based on the local and global consistency method, yields low error rates with very few labeled examples. The motivating application of this method is spam filters with access to very few labeled message. For example, during the initial deployment of a spam filter, only a handful of labeled examples are available but unlabeled examples are plentiful. We demonstrate the performance of our approach on TREC 2007 and CEAS 2008 email corpora. Our results compare favorably with the best-known methods, using as few as just two labeled examples: one spam and one non-spam.
Keywords: classification, email, filtering, spam
Entropy descriptor for image classification BIBAKFull-Text 753-754
  Hongyu Li; Junyu Niu; Jiachen Chen; Huibo Liu
This paper presents a novel entropy descriptor in the sense of geometric manifolds. With this descriptor, entropy cycles can be easily designed for image classification. Minimizing this entropy leads to an optimal entropy cycle where images are connected in the semantic order. During classification, the training step is to find an optimal entropy cycle in each class. In the test step, an unknown image is grouped into a class if the entropy increase as the result of inserting the image into the cycle of this class is relatively least. The proposed approach can generalize well on difficult image classification problems where images with same objects are taken in multiple views. Experimental results show that this entropy descriptor performs well in image classification and has potential in the image-based modeling retrieval.
Keywords: image classification
Has portfolio theory got any principles? BIBAKFull-Text 755-756
  Guido Zuccon; Leif Azzopardi; C. J. Keith van Rijsbergen
Recently, Portfolio Theory (PT) has been proposed for Information Retrieval. However, under non-trivial conditions PT violates the original Probability Ranking Principle (PRP). In this poster, we shall explore whether PT upholds a different ranking principle based on Quantum Theory, i.e. the Quantum Probability Ranking Principle (QPRP), and examine the relationship between this new model and the new ranking principle. We make a significant contribution to the theoretical development of PT and show that under certain circumstances PT upholds the QPRP, and thus guarantees an optimal ranking according to the QPRP. A practical implication of this finding is that the parameters of PT can be automatically estimated via the QPRP, instead of resorting to extensive parameter tuning.
Keywords: interdependent document relevance, portfolio theory for IR, quantum probability ranking principle
Re-examination on lam% in spam filtering BIBAKFull-Text 757-758
  Haoliang Qi; Muyun Yang; Xiaoning He; Sheng Li
Logistic average misclassification percentage (lam%) is a key measure for the spam filtering performance. This paper demonstrates that a spam filter can achieve a perfect 0.00% in lam%, the minimal value in theory, by simply setting a biased threshold during the classifier modeling. At the same time, the overall classification performance reaches only a low accuracy. The result suggests that the role of lam% for spam filtering evaluation should be re-examined.
Keywords: lam, measurement, spam filtering
Unsupervised estimation of dirichlet smoothing parameters BIBAKFull-Text 759-760
  Jangwon Seo; W. Bruce Croft
A standard approach for determining a Dirichlet smoothing parameter is to choose a value which maximizes a retrieval performance metric using training data consisting of queries and relevance judgments. There are, however, situations where training data does not exist or the queries and relevance judgments do not reflect typical user information needs for the application. We propose an unsupervised approach for estimating a Dirichlet smoothing parameter based on collection statistics. We show empirically that this approach can suggest a plausible Dirichlet smoothing parameter value in cases where relevance judgments cannot be used.
Keywords: Dirichlet smoothing, parameter estimation, unsupervised approach
Comparing click-through data to purchase decisions for retrieval evaluation BIBAKFull-Text 761-762
  Katja Hofmann; Bouke Huurnink; Marc Bron; Maarten de Rijke
Traditional retrieval evaluation uses explicit relevance judgments which are expensive to collect. Relevance assessments inferred from implicit feedback such as click-through data can be collected inexpensively, but may be less reliable. We compare assessments derived from click-through data to another source of implicit feedback that we assume to be highly indicative of relevance: purchase decisions. Evaluating retrieval runs based on a log of an audio-visual archive, we find agreement between system rankings and purchase decisions to be surprisingly high.
Keywords: evaluation, query log analysis
Personalize web search results with user's location BIBAKFull-Text 763-764
  Yumao Lu; Fuchun Peng; Xing Wei; Benoit Dumoulin
We build a probabilistic model to identify implicit local intent queries, and leverage user's physical location to improve Web search results for these queries. Evaluation on commercial search engine shows significant improvement on search relevance and user experience.
Keywords: personalized search, query log analysis
Using search session context for named entity recognition in query BIBAKFull-Text 765-766
  Junwu Du; Zhimin Zhang; Jun Yan; Yan Cui; Zheng Chen
Recently, the problem of Named Entity Recognition in Query (NERQ) is attracting increasingly attention in the field of information retrieval. However, the lack of context information in short queries makes some classical named entity recognition (NER) algorithms fail. In this paper, we propose to utilize the search session information before a query as its context to address this limitation. We propose to improve two classical NER solutions by utilizing the search session context, which are known as Conditional Random Field (CRF) based solution and Topic Model based solution respectively. In both approaches, the relationship between current focused query and previous queries in the same session are used to extract novel context aware features. Experimental results on real user search session data show that the NERQ algorithms using search session context performs significantly better than the algorithms using only information of the short queries.
Keywords: CRF, named entity recognition, search session, topic model
Evaluating whole-page relevance BIBAKFull-Text 767-768
  Peter Bailey; Nick Craswell; Ryen W. White; Liwei Chen; Ashwin Satyanarayana; S. M. M. Tahaghoghi
Whole page relevance defines how well the surface-level representation of all elements on a search result page and the corresponding holistic attributes of the presentation respond to users' information needs. We introduce a method for evaluating the whole-page relevance of Web search engine results pages. Our key contribution is that the method allows us to investigate aspects of component relevance that are difficult or impossible to judge in isolation. Such aspects include component-level information redundancy and cross-component coherence. The method we describe complements traditional document relevance measurement, affords comparative relevance assessment across multiple search engines, and facilitates the study of important factors such as brand presentation effects and component-level quality.
Keywords: evaluation, measurement, web search relevance
Predicting escalations of medical queries based on web page structure and content BIBAKFull-Text 769-770
  Ryen W. White; Eric Horvitz
Logs of users' searches on Web health topics can exhibit signs of escalation of medical concerns, where initial queries about common symptoms are followed by queries about serious, rare illnesses. We present an effort to predict such escalations based on the structure and content of pages encountered during medical search sessions. We construct and then characterize the performance of classifiers that predict whether an escalation will occur after the access of a page. Our findings have implications for ranking algorithms and the design of search interfaces.
Keywords: cyberchondria, medical search
Contextual video advertising system using scene information inferred from video scripts BIBAKFull-Text 771-772
  Bong-Jun Yi; Jung-Tae Lee; Hyun-Wook Woo; Hae-Chang Rim
With the rise of digital video consumptions, contextual video advertising demands have been increasing in recent years. This paper presents a novel video advertising system that selects relevant text ads for a given video scene by automatically identifying the situation of the scene. The situation information of video scenes is inferred from available video scripts. Experimental results show that the use of the situation information enhances the accuracy of ad retrieval for video scenes. The proposed system represents one of the pioneer video advertising systems using contextual information obtained from video scripts.
Keywords: contextual video advertising, scene, script, situation
Cross-language retrieval using link-based language models BIBAKFull-Text 773-774
  Benjamin Roth; Dietrich Klakow
We propose a cross-language retrieval model that is solely based on Wikipedia as a training corpus. The main contributions of our work are: 1. A translation model based on linked text in Wikipedia and a term weighting method associated with it. 2. A combination scheme to interpolate the link translation model with retrieval based on Latent Dirichlet Allocation. On the CLEF 2000 data we achieve improvement with respect to the best German-English system at the bilingual track (non-significant) and improvement against a baseline based on machine translation (significant).
Keywords: CLIR, LDA, language modeling, wikipedia
Search system requirements of patent analysts BIBAKFull-Text 775-776
  Leif Azzopardi; Wim Vanderbauwhede; Hideo Joho
Patent search tasks are difficult and challenging, often requiring expert patent analysts to spend hours, even days, sourcing relevant information. To aid them in this process, analysts use Information Retrieval systems and tools to cope with their retrieval tasks. With the growing interest in patent search, it is important to determine their requirements and expectations of the tools and systems that they employ. In this poster, we report a subset of the findings of a survey of patent analysts conducted to elicit their search requirements.
Keywords: patent analysts, patent engineers, patent search, user study
On performance of topical opinion retrieval BIBAKFull-Text 777-778
  Giambattista Amati; Giuseppe Amodeo; Valerio Capozio; Carlo Gaibisso; Giorgio Gambosi
We investigate the effectiveness of both the standard evaluation measures and the opinion component for topical opinion retrieval. We analyze how relevance is affected by opinions by perturbing relevance ranking by the outcomes of opinion-only classifiers built by Monte Carlo sampling. Topical opinion rankings are obtained by either re-ranking or filtering the documents of a first-pass retrieval of topic relevance.
   The proposed approach establishes the correlation between the accuracy and the precision of the classifier and the performance of the topical opinion retrieval. Among other results, it is possible to assess the effectiveness of the opinion component by comparing the effectiveness of the relevance baseline with the topical opinion ranking.
Keywords: classification, opinion retrieval, sentiment analysis
Improving sentence retrieval with an importance prior BIBAKFull-Text 779-780
  Leif Azzopardi; Ronald T. Fernández; David E. Losada
The retrieval of sentences is a core task within Information Retrieval. In this poster we employ a Language Model that incorporates a prior which encodes the importance of sentences within the retrieval model. Then, in a set of comprehensive experiments using the TREC Novelty Tracks, we show that including this prior substantially improves retrieval effectiveness, and significantly outperforms the current state of the art in sentence retrieval.
Keywords: language models, sentence retrieval
Focused access to sparsely and densely relevant documents BIBAKFull-Text 781-782
  Paavo Arvola; Jaana Kekäläinen; Marko Junkkari
XML retrieval provides a focused access to the relevant content of documents. However, in evaluation, full document retrieval has appeared competitive to focused XML retrieval. We analyze the density of relevance in documents, and show that in sparsely relevant documents focused retrieval performs better, whereas in densely relevant documents the performance of focused and document retrieval is equal.
Keywords: XML retrieval, focused retrieval, tolerance to irrelevance
Text document clustering with metric learning BIBAKFull-Text 783-784
  Jinlong Wang; Shunyao Wu; Huy Quan Vu; Gang Li
One reason for semi-supervised clustering fail to deliver satisfactory performance in document clustering is that the transformed optimization problem could have many candidate solutions, but existing methods provide no mechanism to select a suitable one from all those candidates. This paper alleviates this problem by posing the same task as a soft-constrained optimization problem, and introduces the salient degree measure as an information guide to control the searching of an optimal solution. Experimental results show the effectiveness of the proposed method in the improvement of the performance, especially when the amount of priori domain knowledge is limited.
Keywords: document clustering, metric learning
Predicting query performance on the web BIBAKFull-Text 785-786
  Niranjan Balasubramanian; Giridhar Kumaran; Vitor R. Carvalho
Predicting the performance of web queries is useful for several applications such as automatic query reformulation and automatic spell correction. In the web environment, accurate performance prediction is challenging because measures such as clarity that work well on homogeneous TREC-like collections, are not as effective and are often expensive to compute. We present Rank-time Performance Prediction (RAPP), an effective and efficient approach for online performance prediction on the web. RAPP uses retrieval scores, and aggregates of the rank-time features used by the document-ranking algorithm to train regressors for query performance prediction. On a set of over 12,000 queries sampled from the query logs of a major search engine, RAPP achieves a linear correlation of 0.78 with DCG@5, and 0.52 with NDCG@5. Analysis of prediction accuracy shows that hard queries are easier to identify while easy queries are harder to identify.
Keywords: performance prediction, query difficulty, web search
Hashtag retrieval in a microblogging environment BIBAKFull-Text 787-788
  Miles Efron
Microblog services let users broadcast brief textual messages to people who "follow" their activity. Often these posts contain terms called hashtags, markers of a post's meaning, audience, etc. This poster treats the following problem: given a user's stated topical interest, retrieve useful hashtags from microblog posts. Our premise is that a user interested in topic x might like to find hashtags that are often applied to posts about x. This poster proposes a language modeling approach to hashtag retrieval. The main contribution is a novel method of relevance feedback based on hashtags. The approach is tested on a corpus of data harvested from twitter.com.
Keywords: hashtag, microblog, relevance feedback, twitter
Crowdsourcing a wikipedia vandalism corpus BIBAKFull-Text 789-790
  Martin Potthast
We report on the construction of the PAN Wikipedia vandalism corpus, PAN-WVC-10, using Amazon's Mechanical Turk. The corpus compiles 32452 edits on 28468 Wikipedia articles, among which 2391 vandalism edits have been identified. 753 human annotators cast a total of 193022 votes on the edits, so that each edit was reviewed by at least 3 annotators, whereas the achieved level of agreement was analyzed in order to label an edit as "regular" or "vandalism." The corpus is available free of charge.
Keywords: corpus, evaluation, vandalism detection, wikipedia
MEMOSE: search engine for emotions in multimedia documents BIBAKFull-Text 791-792
  Kathrin Knautz; Tobias Siebenlist; Wolfgang G. Stock
The MEMOSE (Media Emotion Search) system is a specialized search engine for fundamental emotions in all kinds of emotional-laden documents. We apply a controlled vocabulary for basic emotions, a slide control to adjust the intensities of the emotions and the approach of broad folksonomies. The paper describes the indexing and the retrieval tool of MEMOSE and results from its evaluation.
Keywords: collaborative indexing, emotion, emotional information retrieval (EMIR), multimedia resources, slide control tagging
Hierarchical Pitman-Yor language model for information retrieval BIBAKFull-Text 793-794
  Saeedeh Momtazi; Dietrich Klakow
In this paper, we propose a new application of Bayesian language model based on Pitman-Yor process for information retrieval. This model is a generalization of the Dirichlet distribution. The Pitman-Yor process creates a power-law distribution which is one of the statistical properties of word frequency in natural language. Our experiments on Robust04 indicate that this model improves the document retrieval performance compared to the commonly used Dirichlet prior and absolute discounting smoothing techniques.
Keywords: Pitman-Yor process, information retrieval, language modeling, smoothing methods
Entity summarization of news articles BIBAKFull-Text 795-796
  Gianluca Demartini; Malik Muhammad Saad Missen; Roi Blanco; Hugo Zaragoza
In this paper we study the problem of entity retrieval for news applications and the importance of the news trail history (i.e. past related articles) to determine the relevant entities in current articles. We construct a novel entity-labeled corpus with temporal information out of the TREC 2004 Novelty collection. We develop and evaluate several features, and show that an article's history can be exploited to improve its summarization.
Keywords: entity summarization, time-aware search
The power of naive query segmentation BIBAKFull-Text 797-798
  Matthias Hagen; Martin Potthast; Benno Stein; Christof Braeutigam
We address the problem of query segmentation: given a keyword query submitted to a search engine, the task is to group the keywords into phrases, if possible. Previous approaches to the problem achieve good segmentation performance on a gold standard but are fairly intricate. Our method is easy to implement and comes with a comparable accuracy.
Keywords: query segmentation
Clicked phrase document expansion for sponsored search ad retrieval BIBAKFull-Text 799-800
  Dustin Hillard; Chris Leggetter
We present a document expansion approach that uses Conditional Random Field (CRF) segmentation to automatically extract salient phrases from ad titles. We then supplement the ad document with query segments that are probable translations of the document phrases, as learned from a large commercial search engine's click logs. Our approach provides a significant improvement in DCG and interpolated precision and recall on a large set of human labeled query-ad pairs.
Keywords: document expansion
Three web-based heuristics to determine a person's or institution's country of origin BIBAKFull-Text 801-802
  Markus Schedl; Klaus Seyerlehner; Dominik Schnitzer; Gerhard Widmer; Cornelia Schiketanz
We propose three heuristics to determine the country of origin of a person or institution via text-based IE from the Web. We evaluate all methods on a collection of music artists and bands, and show that some heuristics outperform earlier work on the topic by terms of coverage, while retaining similar precision levels. We further investigate an extension using country-specific synonym lists.
Keywords: country of origin detection, evaluation, information extraction, music information research, term weighting
Exploiting click-through data for entity retrieval BIBAKFull-Text 803-804
  Bodo Billerbeck; Gianluca Demartini; Claudiu Firan; Tereza Iofciu; Ralf Krestel
We present an approach for answering Entity Retrieval queries using click-through information in query log data from a commercial Web search engine. We compare results using click graphs and session graphs and present an evaluation test set making use of Wikipedia "List of" pages.
Keywords: entity retrieval, evaluation, query log analysis, user session, wikipedia
Feature subset non-negative matrix factorization and its applications to document understanding BIBAKFull-Text 805-806
  Dingding Wang; Chris Ding; Tao Li
In this paper, we propose feature subset non-negative matrix factorization (NMF), which is an unsupervised approach to simultaneously cluster data points and select important features. We apply our proposed approach to various document understanding tasks including document clustering, summarization, and visualization. Experimental results demonstrate the effectiveness of our approach for these tasks.
Keywords: NMF, feature subset selection
Learning to rank query reformulations BIBAKFull-Text 807-808
  Van Dang; Michael Bendersky; W. Bruce Croft
Query reformulation techniques based on query logs have recently proven to be effective for web queries. However, when initial queries have reasonably good quality, these techniques are often not reliable enough to identify the helpful reformulations among the suggested queries. In this paper, we show that we can use as few as two features to rerank a list of reformulated queries, or expanded queries to be specific, generated by a log-based query reformulation technique. Our results across five TREC collections suggest that there are consistently more useful reformulations in the first two positions in the new ranked list than there were initially, which leads to statistically significant improvements in retrieval effectiveness.
Keywords: learning to rank, query expansion, query log, query performance predictor, query reformulation
Many are better than one: improving multi-document summarization via weighted consensus BIBAKFull-Text 809-810
  Dingding Wang; Tao Li
Given a collection of documents, various multi-document summarization methods have been proposed to generate a short summary. However, few studies have been reported on aggregating different summarization methods to possibly generate better summarization results. We propose a weighted consensus summarization method to combine the results from single summarization systems. Experimental results on DUC2004 data sets demonstrate the performance improvement by aggregating multiple summarization systems, and our proposed weighted consensus summarization method outperforms other combination methods.
Keywords: summarization, weighted consensus
Exploring the use of labels to shortcut search trails BIBAKFull-Text 811-812
  Ryen W. White; Raman Chandrasekar
Search trails comprising queries and Web page views are created as searchers engage in information-seeking activity online. During known-item search (where the objective may be to locate a target Web page), searchers may waste valuable time repeatedly reformulating queries as they attempt to locate an elusive page. Trail shortcuts help users bypass unnecessary queries and get them to their desired destination faster. In this poster we present a comparative oracle study of techniques to shortcut sub-optimal search trails using labels derived from social bookmarking, anchor text, query logs, and a human-computation game. We show that labels can help users reach target pages efficiently, that the label sources perform differently, and that shortcuts are potentially most useful when the target is challenging to find.
Keywords: anchor text, click graph, labels, social bookmarks, trails
Investigating the suboptimality and instability of pseudo-relevance feedback BIBAKFull-Text 813-814
  Raghavendra Udupa; Abhijit Bhole
Although Pseudo-Relevance Feedback (PRF) techniques improve average retrieval performance at the price of high variance, not much is known about their optimality and the reasons for their instability. In this work, we study more than 800 topics from several test collections including the TREC Robust Track and show that PRF techniques are highly suboptimal, i.e. they do not make the fullest utilization of pseudo-relevant documents and under-perform. A careful selection of expansion terms from the pseudo-relevant document with the help of an oracle can actually improve retrieval performance dramatically (by > 60%). Further, we show that instability in PRF techniques is mainly due to wrong selection of expansion terms from the pseudo-relevant documents. Our findings emphasize the need to revisit the problem of term selection to make a break through in PRF.
Keywords: pseudo-relevance feedback
From fusion to re-ranking: a semantic approach BIBAKFull-Text 815-816
  Annalina Caputo; Pierpaolo Basile; Giovanni Semeraro
A number of works have shown that the aggregation of several Information Retrieval (IR) systems works better than each system working individually. Nevertheless, early investigation in the context of CLEF Robust-WSD task, in which semantics is involved, showed that aggregation strategies achieve only slight improvements.
   This paper proposes a re-ranking approach which relies on inter-document similarities. The novelty of our idea is twofold: the output of a semantic based IR system is exploited to re-weigh documents and a new strategy based on Semantic Vectors is used to compute inter-document similarities.
Keywords: re-ranking, semantics, wordnet
High precision opinion retrieval using sentiment-relevance flows BIBAKFull-Text 817-818
  Seung-Wook Lee; Jung-Tae Lee; Young-In Song; Hae-Chang Rim
Opinion retrieval involves the measuring of opinion score of a document about the given topic. We propose a new method, namely sentiment-relevance flow, that naturally unifies the topic relevance and the opinionated nature of a document. Experiments conducted over a large-scaled Web corpus show that the proposed approach improves performance of opinion retrieval in terms of precision at top ranks.
Keywords: opinion retrieval, sentiment analysis, sentiment-relevance flow
Ontology-enriched multi-document summarization in disaster management BIBAKFull-Text 819-820
  Lei Li; Dingding Wang; Chao Shen; Tao Li
In this poster, we propose a novel document summarization approach named Ontology-enriched Multi-Document Summarization (OMS) for utilizing background knowledge to improve summarization results. OMS first maps the sentences of input documents onto an ontology, then links the given query to a specific node in the ontology, and finally extracts the summary from the sentences in the subtree rooted at the query node. By using the domain-related ontology, OMS can better capture the semantic relevance between the query and the sentences, and thus lead to better summarization results. As a byproduct, the final summary generated by OMS can be represented as a tree showing the hierarchical relationships of the extracted sentences. Evaluation results on the collection of press releases by Miami-Dade County Department of Emergency Management during Hurricane Wilma in 2005 demonstrate the efficacy of OMS.
Keywords: disaster management, multi-document summarization, ontology
Multi-view clustering of multilingual documents BIBAKFull-Text 821-822
  Young-Min Kim; Massih-Reza Amini; Cyril Goutte; Patrick Gallinari
We propose a new multi-view clustering method which uses clustering results obtained on each view as a voting pattern in order to construct a new set of multi-view clusters. Our experiments on a multilingual corpus of documents show that performance increases significantly over simple concatenation and another multi-view clustering technique.
Keywords: PLSA, multi-view learning, multilingual document clustering
A stack decoder approach to approximate string matching BIBAKFull-Text 823-824
  Juan M. Huerta
We present a new efficient algorithm for top-N match retrieval of sequential patterns. Our approach is based on an incremental approximation of the string edit distance using index information and a stack based search. Our approach produces hypotheses with average edit error of about 0.29 edits from the optimal SED result while using only about 5% of the CPU computation.
Keywords: A* search, stack decoder, string edit distance, string matching
Late fusion of compact composite descriptors for retrieval from heterogeneous image databases BIBAKFull-Text 825-826
  Savvas A. Chatzichristofis; Avi Arampatzis
Compact composite descriptors (CCDs) are global image features, capturing more than one types of information at the same time in a very compact representation. Their quality has so far been evaluated in retrieval from several homogeneous databases containing images of only the type that each CCD is intended for, and has been found better than other descriptors in the literature such as the MPEG-7 descriptors. In this study, we consider heterogeneous databases and investigate query-time fusion techniques for CCDs. The results show that fusion is beneficial, even with simple score normalization and combination methods due to the compatibility of the score distributions produced by the CCDs considered.
Keywords: CCD, combination, fusion, image retrieval, normalization
Inferring user intent in web search by exploiting social annotations BIBAKFull-Text 827-828
  Jose M. Conde; David Vallet; Pablo Castells
In this paper, we present a folksonomy-based approach for implicit user intent extraction during a Web search process. We present a number of result re-ranking techniques based on this representation that can be applied to any Web search engine. We perform a user experiment the results of which indicate that this type of representation is better at context extraction than using the actual textual content of the document.
Keywords: context, folksonomy, web search
Query term ranking based on dependency parsing of verbose queries BIBAKFull-Text 829-830
  Jae Hyun Park; W. Bruce Croft
Query term ranking approaches are used to select effective terms from a verbose query by ranking terms. Features used for query term ranking and selection in previous work do not consider grammatical relationships between terms. To address this issue, we use syntactic features extracted from dependency parsing results of verbose queries. We also modify the method for measuring the effectiveness of query terms for query term ranking.
Keywords: dependency parse, query reformulation, query term ranking
A ranking approach to target detection for automatic link generation BIBAKFull-Text 831-832
  Jiyin He; Maarten de Rijke
We focus on the task of target detection in automatic link generation with Wikipedia, i.e., given an N-gram in a snippet of text, find the relevant Wikipedia concepts that explain or provide background knowledge for it. We formulate the task as a ranking problem and investigate the effectiveness of learning to rank approaches and of the features that we use to rank the target concepts for a given N-gram. Our experiments show that learning to rank approaches outperform traditional binary classification approaches. Also, our proposed features are effective both in binary classification and learning to rank settings.
Keywords: disambiguation, learning to rank, link generation, wikipedia
Probabilistic latent maximal marginal relevance BIBAKFull-Text 833-834
  Shengbo Guo; Scott Sanner
Diversity has been heavily motivated in the information retrieval literature as an objective criterion for result sets in search and recommender systems. Perhaps one of the most well-known and most used algorithms for result set diversification is that of Maximal Marginal Relevance (MMR). In this paper, we show that while MMR is somewhat ad-hoc and motivated from a purely pragmatic perspective, we can derive a more principled variant via probabilistic inference in a latent variable graphical model. This novel derivation presents a formal probabilistic latent view of MMR (PLMMR) that (a) removes the need to manually balance relevance and diversity parameters, (b) shows that specific definitions of relevance and diversity metrics appropriate to MMR emerge naturally, and (c) formally derives variants of latent semantic indexing (LSI) similarity metrics for use in PLMMR. Empirically, PLMMR outperforms MMR with standard term frequency based similarity and diversity metrics since PLMMR maximizes latent diversity in the results.
Keywords: diversity, graphical models, maximal marginal relevance
Using local precision to compare search engines in consumer health information retrieval BIBAKFull-Text 835-836
  Carla Teixeira Lopes; Cristina Ribeiro
We have conducted a user study to evaluate several generalist and health-specific search engines on health information retrieval. Users evaluated the relevance of the top 30 documents of 4 search engines in two different health information needs. We introduce the concepts of local and global precision and analyze how they affect the evaluation. Results show that Google surpasses the precision of all other engines, including the health-specific ones, and that precision differs with the type of clinical question and its medical specialty.
Keywords: evaluation, health information retrieval, precision, user study
multi Searcher: can we support people to get information from text they can't read or understand? BIBAKFull-Text 837-838
  Farag Ahmed; Andreas Nürnberger
The goal of the proposed tool multi Searcher is to answer this research question: can we expect people to be able to get information from text in languages they can not read or understand? The proposed tool multi Searcher provides users with interactive contextual information that describes the translation in the user's own language so that the user has a certain degree of confidence about the translation. Therefore, the user is considered as an integral part of the retrieval process. The tool provides possibilities to interactively select relevant terms from contextual information in order to improve the translation and thus improve the cross lingual information retrieval (CLIR) process.
Keywords: Arabic, cross lingual information retrieval, word sense disambiguation
Linking wikipedia to the web BIBAKFull-Text 839-840
  Rianne Kaptein; Pavel Serdyukov; Jaap Kamps
We investigate the task of finding links from Wikipedia pages to external web pages. Such external links significantly extend the information in Wikipedia with information from the Web at large, while retaining the encyclopedic organization of Wikipedia. We use a language modeling approach to create a full-text and anchor text runs, and experiment with different document priors. In addition we explore whether social bookmarking site Delicious can be exploited to further improve our performance. We have constructed a test collection of 53 topics, which are Wikipedia pages on different entities. Our findings are that the anchor text index is a very effective method to retrieve home pages. Url class and anchor text length priors and their combination leads to the best results. Using Delicious on its own does not lead to very good results, but it does contain valuable information. Combining the best anchor text run and the Delicious run leads to further improvements.
Keywords: entity search, link detection, wikipedia
Short text classification in twitter to improve information filtering BIBAKFull-Text 841-842
  Bharath Sriram; Dave Fuhry; Engin Demir; Hakan Ferhatosmanoglu; Murat Demirbas
In microblogging services such as Twitter, the users may become overwhelmed by the raw data. One solution to this problem is the classification of short text messages. As short texts do not provide sufficient word occurrences, traditional classification methods such as "Bag-Of-Words" have limitations. To address this problem, we propose to use a small set of domain-specific features extracted from the author's profile and text. The proposed approach effectively classifies the text to a predefined set of generic classes such as News, Events, Opinions, Deals, and Private Messages.
Keywords: classification, feature selection, short text, twitter
A framework for BM25F-based XML retrieval BIBAKFull-Text 843-844
  Kelly Y. Itakura; Charles L. A. Clarke
We evaluate a framework for BM25F-based XML element retrieval. The framework gathers contextual information associated with each XML element into an associated field, which we call a characteristic field. The contents of the element and the contents of the characteristic field are then treated as distinct fields for BM25F weighting purposes. Evidence supporting this framework is drawn from both our own experiments and experiments reported in related work.
Keywords: BM25, BM25F, XML retrieval, book search, wikipedia
Can search systems detect users' task difficulty?: some behavioral signals BIBAKFull-Text 845-846
  Jingjing Liu; Chang Liu; Jacek Gwizdka; Nicholas J. Belkin
In this paper, we report findings on how user behaviors vary in tasks with different difficulty levels as well as of different types. Two behavioral signals: document dwell time and number of content pages viewed per query, were found to be able to help the system detect when users are working with difficult tasks.
Keywords: dwell time, first dwell time, queries, task difficulty, task type
Query log analysis in the context of information retrieval for children BIBAKFull-Text 847-848
  Sergio Duarte Torres; Djoerd Hiemstra; Pavel Serdyukov
In this paper we analyze queries and sessions intended to satisfy children's information needs using a large-scale query log. The aim of this analysis is twofold: i) To identify differences between such queries and sessions, and general queries and sessions; ii) To enhance the query log by including annotations of queries, sessions, and actions for future research on information retrieval for children. We found statistically significant differences between the set of general purpose and queries seeking for content intended for children. We show that our findings are consistent with previous studies on the physical behavior of children using Web search engines.
Keywords: query intent, query log analysis, query representation
Transitive history-based query disambiguation for query reformulation BIBAKFull-Text 849-850
  Karim Filali; Anish Nair; Chris Leggetter
We present a probabilistic model of a user's search history and a target query reformulation. We derive a simple transitive similarity algorithm for disambiguating queries and improving history-based query reformulation accuracy. We compare the merits of this approach to other methods and present results on both examples assessed by human editors and on automatically-labeled click data.
Keywords: graphical models, personalization, reformulation
Using flickr geotags to predict user travel behaviour BIBAKFull-Text 851-852
  Maarten Clements; Pavel Serdyukov; Arjen P. de Vries; Marcel J. T. Reinders
We propose a method to predict a user's favourite locations in a city, based on his Flickr geotags in other cities. We define a similarity between the geotag distributions of two users based on a Gaussian kernel convolution. The geotags of the most similar users are then combined to rerank the popular locations in the target city personalised for this user.
   We show that this method can give personalised travel recommendations for users with a clear preference for a specific type of landmark.
Keywords: flickr, geotag, recommendation
Metrics for assessing sets of subtopics BIBAKFull-Text 853-854
  Filip Radlinski; Martin Szummer; Nick Craswell
To evaluate the diversity of search results, test collections have been developed that identify multiple intents for each query. Intents are the different meanings or facets that should be covered in a search results list. This means that topic development involves proposing a set of intents for each query. We propose four measurable properties of query-to-intent mappings, allowing for more principled topic development for such test collections.
Keywords: diversity, novelty, subtopic
Learning to select rankers BIBAKFull-Text 855-856
  Niranjan Balasubramanian; James Allan
Combining evidence from multiple retrieval models has been widely studied in the context of distributed search, metasearch and rank fusion. Much of the prior work has focused on combining retrieval scores (or the rankings) assigned by different retrieval models or ranking algorithms. In this work, we focus on the problem of choosing between retrieval models using performance estimation. We propose modeling the differences in retrieval performance directly by using rank-time features -- features that are available to the ranking algorithms -- and the retrieval scores assigned by the ranking algorithms. Our experimental results show that when choosing between two rankers, our approach yields significant improvements over the best individual ranker.
Keywords: combining searches, learning to rank, metasearch
VisualSum: an interactive multi-document summarization system using visualization BIBAKFull-Text 857-858
  Yi Zhang; Dingding Wang; Tao Li
Given a collection of documents, most of existing multidocument summarization methods automatically generate a static summary for all the users. However, different users may have different opinions on the documents, thus there is a necessity for improving users' interactions in the summarization process. In this paper, we propose an interactive document summarization system using information visualization techniques.
Keywords: multi-document summarization, visualization
Web page publication time detection and its application for page rank BIBAKFull-Text 859-860
  Zhumin Chen; Jun Ma; Chaoran Cui; Hongxing Rui; Shaomang Huang
Publication Time (P-time for short) of Web pages is often required in many application areas. In this paper, we address the issue of P-time detection and its application for page rank. We first propose an approach to extract P-time for a page with explicit P-time displayed on its body. We then present a method to infer P-time for a page without P-time. We further introduce a temporal sensitive page rank model using P-time. Experiments demonstrate that our methods outperform the baseline methods significantly.
Keywords: page rank, publication time extraction, publication time inference, temporal information detection
HCC: a hierarchical co-clustering algorithm BIBAKFull-Text 861-862
  Jingxuan Li; Tao Li
In this poster, we develop a novel method, called HCC, for hierarchical co-clustering. HCC brings together two interrelated but distinct themes from clustering: hierarchical clustering and co-clustering.
   The goal of the former theme is to organize clusters into a hierarchy that facilitates browsing and navigation, while the goal of the latter theme is to cluster different types of data simultaneously by making use of the relationship information. Our initial empirical results are promising and they demonstrate that simultaneously attempting both these goals in a single model leads to improvements over models that focus on a single goal.
Keywords: co-clustering, hierarchical
Retrieval system evaluation: automatic evaluation versus incomplete judgments BIBAKFull-Text 863-864
  Claudia Hauff; Franciska de Jong
In information retrieval (IR), research aiming to reduce the cost of retrieval system evaluations has been conducted along two lines: (i) the evaluation of IR systems with reduced amounts of manual relevance assessments, and (ii) the fully automatic evaluation of IR systems, thus foregoing the need for manual assessments altogether. The proposed methods in both areas are commonly evaluated by comparing their performance estimates for a set of systems to a ground truth (provided for instance by evaluating the set of systems according to mean average precision). In contrast, in this poster we compare an automatic system evaluation approach directly to two evaluations based on incomplete manual relevance assessments. For the particular case of TREC's Million Query track, we show that the automatic evaluation leads to results which are highly correlated to those achieved by approaches relying on incomplete manual judgments.
Keywords: automatic system evaluation
Aspect presence verification conditional on other aspects BIBAKFull-Text 865-866
  Dmitri Roussinov
I have shown that the presence of difficult query aspects that are revealed only implicitly (e.g. exploration, opposition, achievements, cooperation, risks) can be improved by taking advantage of the known presence of other, easier to verify query aspects. The approach proceeds by mining a large external corpus and results in substantial improvements in re-ranking the subset of the top retrieved documents.
Keywords: external corpus., information retrieval, machine learning
The value of visual elements in web search BIBAKFull-Text 867-868
  Marilyn Ostergren; Seung-yon Yu; Efthimis N. Efthimiadis
We used eye-tracking equipment to observe 36 participants as they performed three search tasks using three graphically-enhanced web search interfaces (Kartoo, SearchMe and Viewzi). In this poster we describe findings of the study focusing on how the presentation of SERP results influences how the user scans and attends to the results, and the user satisfaction with these search engines.
Keywords: eye-tracking study, search engine evaluation, search engine results page display (SERP), user study
Diversification of search results using webgraphs BIBAKFull-Text 869-870
  Praveen Chandar; Ben Carterette
A set of words is often insufficient to express a user's information need. In order to account for various information needs associated with a query, diversification seems to be a reasonable strategy. By diversifying the result set, we increase the probability of results being relevant to the user's information needs when the given query is ambiguous. A diverse result set must contain a set of documents that cover various subtopics for a given query. We propose a graph based method which exploits the link structure of the web to return a ranked list that provides complete coverage for a query. Our method not only provides diversity to the results set, but also avoids excessive redundancy. Moreover, the probability of relevance of a document is conditioned on the documents that appear before it in the result list. We show the effectiveness of our method by comparing it with a query-likelihood model as the baseline.
Keywords: diversity, information retrieval, webgraphs
Capturing page freshness for web search BIBAKFull-Text 871-872
  Na Dai; Brian D. Davison
Freshness has been increasingly realized by commercial search engines as an important criteria for measuring the quality of search results. However, most information retrieval methods focus on the relevance of page content to given queries without considering the recency issue. In this work, we mine page freshness from web user maintenance activities and incorporate this feature into web search. We first quantify how fresh the web is over time from two distinct perspectives -- the page itself and its in-linked pages -- and then exploit a temporal correlation between two types of freshness measures to quantify the confidence of page freshness. Results demonstrate page freshness can be better quantified when combining with temporal freshness correlation. Experiments on a real-world archival web corpus show that incorporating the combined page freshness into the searching process can improve ranking performance significantly on both relevance and freshness.
Keywords: temporal correlation, web freshness, web search
S-PLASA+: adaptive sentiment analysis with application to sales performance prediction BIBAKFull-Text 873-874
  Yang Liu; Xiaohui Yu; Xiangji Huang; Aijun An
Analyzing the large volume of online reviews would produce useful knowledge that could be of economic values to vendors and other interested parties. In particular, the sentiments expressed in the online reviews have been shown to be strongly correlated with the sales performance of products. In this paper, we present an adaptive sentiment analysis model called S-PLSA+, which aims to capture the hidden sentiment factors in the reviews with the capability to be incrementally updated as more data become available. We show how S-PLSA+ can be applied to sales performance prediction using an ARSA model developed in previous literature. A case study is conducted in the movie domain, and results from preliminary experiments confirm the effectiveness of the proposed model.
Keywords: prediction, review mining, sentiment analysis
Supervised query modeling using wikipedia BIBAKFull-Text 875-876
  Edgar Meij; Maarten de Rijke
We use Wikipedia articles to semantically inform the generation of query models. To this end, we apply supervised machine learning to automatically link queries to Wikipedia articles and sample terms from the linked articles to re-estimate the query model. On a recent large web corpus, we observe substantial gains in terms of both traditional metrics and diversity measures.
Keywords: machine learning, query modeling, wikipedia
A two-stage model for blog feed search BIBAKFull-Text 877-878
  Wouter Weerkamp; Krisztian Balog; Maarten de Rijke
We consider blog feed search: identifying relevant blogs for a given topic. An individual's search behavior often involves a combination of exploratory behavior triggered by salient features of the information objects being examined plus goal-directed in-depth information seeking behavior. We present a two-stage blog feed search model that directly builds on this insight. We first rank blog posts for a given topic, and use their parent blogs as selection of blogs that we rank using a blog-based model.
Keywords: blog feed search, two-stage model
Machine learned ranking of entity facets BIBAKFull-Text 879-880
  Roelof van Zwol; Lluís Garcia Pueyo; Mridul Muralidharan; Börkur Sigurbjörnsson
The research described in this paper forms the backbone of a service that enables the faceted search experience of the Yahoo! search engine. We introduce an approach for a machine learned ranking of entity facets based on user click feedback and features extracted from three different ranking sources. The objective of the learned model is to predict the click-through rate on an entity facet. In an empirical evaluation we compare the performance of gradient boosted decision trees (GBDT) against a linear combination of features on two different click feedback models using the raw click-through rate (CTR), and click over expected clicks (COEC). The results show a significant improvement in retrieval performance, in terms of discounted cumulated gain, when ranking entity facets with GBDT trained on the COEC model. Most notably this is true when evaluated against the CTR test set.
Keywords: GBDT, click feedback, ranking entity facets
User comments for news recommendation in social media BIBAKFull-Text 881-882
  Jia Wang; Qing Li; Yuanzhu Peter Chen
Reading and Commenting online news is becoming a common user behavior in social media. Discussion in the form of comments following news postings can be effectively facilitated if the service provider can recommend articles based on not only the original news itself but also the thread of changing comments. This turns the traditional news recommendation to a "discussion moderator" that can intelligently assist online forums. In this work, we present a framework to recommend relevant information in the forum-based social media using user comments. When incorporating user comments, we consider structural and semantic information carried by them. Experiments indicate that our proposed solutions provide an effective recommendation service.
Keywords: comments, information filtering, news recommendation, social media
Incorporating global information into named entity recognition systems using relational context BIBAKFull-Text 883-884
  Yuval Merhav; Filipe Mesquita; Denilson Barbosa; Wai Gen Yee; Ophir Frieder
The state-of-the-art in Named Entity Recognition relies on a combination of local features of the text and global knowledge to determine the types of the recognized entities. This is problematic in some cases, resulting in entities being classified as belonging to the wrong type. We show that using global information about the corpus improves the accuracy of type identification. We explore the notion of a global domain frequency that relates relation identifying terms with pairs of entity types which are used in that relation. We use this to identify entities whose types are not compatible with the terms they co-occur in the text. Our results on a large corpus of social media content allows the identification of mistyped entities with 70% accuracy.
Keywords: domain frequency, named entity recognition
Achieving high accuracy retrieval using intra-document term ranking BIBAKFull-Text 885-886
  Hyun-Wook Woo; Jung-Tae Lee; Seung-Wook Lee; Young-In Song; Hae-Chang Rim
Most traditional ranking models roughly score the relevance of a given document by observing simple term statistics, such as the occurrence of query terms within the document or within the collection. Intuitively, the relative importance of query terms with regard to other individual non-query terms in a document can also be exploited to promote the ranks of documents in which the query is dedicated as the main topic. In this paper, we introduce a simple technique named intra-document term ranking, which involves ranking all the terms in a document according to their relative importance within that particular document. We demonstrate that the information regarding the rank positions of given query terms within the intra-document term ranking can be useful for enhancing the precision of top-retrieved results by traditional ranking models. Experiments are conducted on three standard TREC test collections.
Keywords: inter-document term ranking, precision at top ranks
Author interest topic model BIBAKFull-Text 887-888
  Noriaki Kawamae
This paper presents a hierarchical topic model that simultaneously captures topics and author's interests. Our proposal, the Author Interest Topic model (AIT), introduces a latent variable with a separate probability distribution over topics into each document. Experiments on a research paper corpus show that the AIT is useful as a generative model.
Keywords: latent variable modeling, topic modeling
On the relationship between effectiveness and accessibility BIBAKFull-Text 889-890
  Leif Azzopardi; Richard Bache
Typically the evaluation of Information Retrieval (IR) systems is focused upon two main system attributes: efficiency and effectiveness. However, it has been argued that it is also important to consider accessibility, i.e. the extent to which the IR system makes information easily accessible. But, it is unclear how accessibility relates to typical IR evaluation, and specifically whether there is a trade-off between accessibility and effectiveness. In this poster, we empirically explore the relationship between effectiveness and accessibility to determine whether the two objectives i.e. maximizing effectiveness and maximizing accessibility, are compatible, or not. To this aim, we empirically examine this relationship using two popular IR models and explore the trade-off between access and performance as these models are tuned.
Keywords: accessibility, evaluation, findability, information retrieval, retrievability
Visual concept-based selection of query expansions for spoken content retrieval BIBAKFull-Text 891-892
  Stevan Rudinac; Martha Larson; Alan Hanjalic
In this paper we present a novel approach to semantic-theme-based video retrieval that considers entire videos as retrieval units and exploits automatically detected visual concepts to improve the results of retrieval based on spoken content. We deploy a query prediction method that makes use of a coherence indicator calculated on top returned documents and taking into account the information about visual concepts presence in videos to make a choice between query expansion methods. The main contribution of our approach is in its ability to exploit noisy shot-level concept detection to improve semantic-theme-based video retrieval. Strikingly, improvement is possible using an extremely limited set of concepts. In the experiments performed on TRECVID 2007 and 2008 datasets our approach shows an interesting performance improvement compared to the best performing baseline.
Keywords: concept-based video indexing, query expansion, query performance prediction, semantic-theme-based video retrieval, video-level retrieval
Mining adjacent markets from a large-scale ads video collection for image advertising BIBAKFull-Text 893-894
  Guwen Feng; Xin-Jing Wang; Lei Zhang; Wei-Ying Ma
The research on image advertising is still in its infancy. Most previous approaches suggest ads by directly matching an ad to a query image, which lacks the power to identify ads from adjacent market. In this paper, we tackle the problem by mining knowledge on adjacent markets from ads videos with a novel Multi-Modal Dirichlet Process Mixture Sets model, which is a unified model of (video frames) clustering and (ads) ranking. Our approach is not only capable of discovering relevant ads (e.g. car ads for a query car image), but also suggesting ads from adjacent markets (e.g. tyre ads). Experimental results show that our proposed approach is fairly effective.
Keywords: adjacent marketing, image advertising, video retrieval
A co-learning framework for learning user search intents from rule-generated training data BIBAKFull-Text 895-896
  Jun Yan; Zeyu Zheng; Li Jiang; Yan Li; Shuicheng Yan; Zheng Chen
Learning to understand user search intents from their online behaviors is crucial for both Web search and online advertising. However, it is a challenging task to collect and label a sufficient amount of high quality training data for various user intents such as "compare products", "plan a travel", etc. Motivated by this bottleneck, we start with some user common sense, i.e. a set of rules, to generate training data for learning to predict user intents. The rule-generated training data are however hard to be used since these data are generally imperfect due to the serious data bias and possible data noises. In this paper, we introduce a Co-learning Framework (CLF) to tackle the problem of learning from biased and noisy rule-generated training data. CLF firstly generates multiple sets of possibly biased and noisy training data using different rules, and then trains the individual user search intent classifiers over different training datasets independently. The intermediate classifiers are then used to categorize the training data themselves as well as the unlabeled data. The confidently classified data by one classifier are added to other training datasets and the incorrectly classified ones are instead filtered out from the training datasets. The algorithmic performance of this iterative learning procedure is theoretically guaranteed.
Keywords: classification, search engine, user intent
Learning the click-through rate for rare/new ads from similar ads BIBAKFull-Text 897-898
  Kushal S. Dave; Vasudeva Varma
Ads on the search engine (SE) are generally ranked based on their Click-through rates (CTR). Hence, accurately predicting the CTR of an ad is of paramount importance for maximizing the SE's revenue. We present a model that inherits the click information of rare/new ads from other semantically related ads. The semantic features are derived from the query ad click-through graphs and advertisers account information. We show that the model learned using these features give a very good prediction for the CTR values.
Keywords: click-through rate prediction, ranking, sponsored search
Graphical models for text: a new paradigm for text representation and processing BIBAKFull-Text 899-900
  Charu Aggarwal; Peixiang Zhao
Almost all text applications use the well known vector-space model for text representation and analysis. While the vector-space model has proven itself to be an effective and efficient representation for mining purposes, it does not preserve information about the ordering of the words in the representation. In this paper, we will introduce the concept of distance graph representations of text data. Such representations preserve distance and ordering information between the words, and provide a much richer representation of the underlying text. This approach enables knowledge discovery from text which is not possible with the use of a pure vector-space representation, because it loses much less information about the ordering of the underlying words. Furthermore, this representation does not require the development of new mining and management techniques. This is because the technique can also be converted into a structural version of the vector-space representation, which allows the use of all existing tools for text. In addition, existing techniques for graph and XML data can be directly leveraged with this new representation. Thus, a much wider spectrum of algorithms is available for processing this representation.
Keywords: text representations
A survival modeling approach to biomedical search result diversification using wikipedia BIBAKFull-Text 901-902
  Xiaoshi Yin; Jimmy Xiangji Huang; Xiaofeng Zhou; Zhoujun Li
In this paper, we propose a probabilistic survival model derived from the survival analysis theory for measuring aspect novelty. The retrieved documents' query-relevance and novelty are combined at the aspect level for re-ranking. Experiments conducted on the TREC 2006 and 2007 Genomics collections demonstrate the effectiveness of the proposed approach in promoting ranking diversity for biomedical information retrieval.
Keywords: biomedical IR, diversity, survival modeling


Low cost evaluation in information retrieval BIBAKFull-Text 903
  Ben Carterette; Evangelos Kanoulas; Emine Yilmaz
Search corpora are growing larger and larger: over the last 10 years, the IR research community has moved from the several hundred thousand documents on the TREC disks to the tens of millions of U.S. government web pages of GOV2 to the one billion general-interest web pages in the new ClueWeb09 collection. But traditional means of acquiring relevance judgments and evaluating -- e.g. pooling documents to calculate average precision -- do not seem to scale well to these new large collections. They require substantially more cost in human assessments for the same reliability in evaluation; if the additional cost goes over the assessing budget, errors in evaluation are inevitable.
   Some alternatives to pooling that support low-cost and reliable evaluation have recently been proposed. A number of them have already been used in TREC and other evaluation forums (TREC Million Query, Legal, Chemical, Web, Relevance Feedback Tracks, CLEF Patent IR, INEX). Evaluation via implicit user feedback (e.g. clicks) and crowdsourcing have also recently gained attention in the community. Thus it is important that the methodologies, the analysis they support, and their strengths and weaknesses are well-understood by the IR community. Furthermore, these approaches can support small research groups attempting to start investigating new tasks on new corpora with relatively low cost. Even groups that do not participate in TREC, CLEF, or other evaluation conferences can benefit from understanding how these methods work, how to use them, and what they mean as they build test collections for tasks they are interested in.
   The goal of this tutorial is to provide attendees with a comprehensive overview of techniques to perform low cost (in terms of judgment effort) evaluation. A number of topics will be covered, including alternatives to pooling, evaluation measures robust to incomplete judgments, evaluating with no relevance judgments, statistical inference of evaluation metrics, inference of relevance judgments, query selection, techniques to test the reliability of the evaluation and reusability of the constructed collections.
   The tutorial should be of interest to a wide range of attendees. Those new to the field will come away with a solid understanding of how low cost evaluation methods can be applied to construct inexpensive test collections and evaluate new IR technology, while those with intermediate knowledge will gain deeper insights and further understand the risks and gains of low cost evaluation. Attendees should have a basic knowledge of the traditional evaluation framework (Cranfield) and metrics (such as average precision and nDCG), along with some basic knowledge on probability theory and statistics. More advanced concepts will be explained during the tutorial.
Keywords: evaluation, information retrieval, test collections
Learning to rank for information retrieval BIBAKFull-Text 904
  Tie-Yan Liu
This tutorial is concerned with a comprehensive introduction to the research area of learning to rank for information retrieval. In the first part of the tutorial, we will introduce three major approaches to learning to rank, i.e., the pointwise, pairwise, and listwise approaches, analyze the relationship between the loss functions used in these approaches and the widely-used IR evaluation measures, evaluate the performance of these approaches on the LETOR benchmark datasets, and demonstrate how to use these approaches to solve real ranking applications. In the second part of the tutorial, we will discuss some advanced topics regarding learning to rank, such as relational ranking, diverse ranking, semi-supervised ranking, transfer ranking, query-dependent ranking, and training data preprocessing. In the third part, we will briefly mention the recent advances on statistical learning theory for ranking, which explain the generalization ability and statistical consistency of different ranking methods. In the last part, we will conclude the tutorial and show several future research directions.
Keywords: information retrieval, learning theory, learning to rank, ranking models
Introduction to probabilistic models in IR BIBAKFull-Text 905
  Victor P. Lavrenko
Most of today's state-of-the-art retrieval models, including BM25 and language modeling, are grounded in probabilistic principles. Having a working understanding of these principles can help researchers understand existing retrieval models better and also provide industrial practitioners with an understanding of how such models can be applied to real world problems.
   This half-day tutorial will cover the fundamentals of two dominant probabilistic frameworks for Information Retrieval: the classical probabilistic model and the language modeling approach. The elements of the classical framework will include the probability ranking principle, the binary independence model, the 2-Poisson model, and the widely used BM25 model. Within language modeling framework, we will discuss various distributional assumptions and smoothing techniques. Special attention will be devoted to the event spaces and independence assumptions underlying each approach. The tutorial will outline several techniques for modeling term dependence and addressing vocabulary mismatch. We will also survey applications of probabilistic models in the domains of cross-language and multimedia retrieval. The tutorial will conclude by suggesting a set of open problems in probabilistic models of IR.
   Attendees should have a basic familiarity with probability and statistics. A brief refresher of basic concepts, including random variables, event spaces, conditional probabilities, and independence will be given at the beginning of the tutorial. In addition to slides, some hands on exercises and examples will be used throughout the tutorial.
Keywords: estimation, language modeling, probability ranking principle, term dependence
Multimedia information retrieval BIBAKFull-Text 906
  Stefan Rueger
This tutorial is concerned with creating the best possible multimedia search experience. The intriguing bit here is that the query itself can be a multimedia excerpt: For example, when you walk around in an unknown place and stumble across an interesting landmark, would it not be great if you could just take a picture with your mobile phone and send it to a service that finds a similar picture in a database and tells you more about the building -- and about its significance for that matter?
   The ideas for this type of search have been around for a decade, but this tutorial will look at recent successes and take stock of the state-of-the-art. It examines the full matrix of a variety of query modes versus document types. How do you retrieve a music piece by humming? What if you want to find news video clips on forest fires using a still image? The tutorial discusses underlying techniques and common approaches to facilitate multimedia search engines: metadata driven search; piggy-back text search where automated processes create text surrogates for multimedia; automated image annotation; content-based search. The latter is studied in more depth looking at features and distances, and how to effectively combine them for efficient retrieval, to a point where the participants have the ingredients and recipe in their hands for building their own visual search engines.
   Supporting users in their resource discovery mission when hunting for multimedia material is not a technological indexing problem alone. We will briefly look at interactive ways of engaging with repositories through browsing and relevance feedback, roping in geographical context, and providing visual summaries for videos. The tutorial emphasises state-of-the-art research in the area of multimedia information retrieval, which gives an indication of the research and development trends and, thereby, a glimpse of the future world.
Keywords: automated image annotation, browsing and relevance feedback, content-based search, metadata driven search, multimedia information retrieval, piggy-back text search
Web retrieval: the role of users BIBAKFull-Text 907
  Ricardo Baeza-Yates; Yoelle Maarek
Web retrieval methods have evolved through three major steps in the last decade or so. They started from standard document-centric IR in the early days of the Web, then made a major step forward by leveraging the structure of the Web, using link analysis techniques in both crawling and ranking challenges. A more recent, no less important but maybe more discrete step forward, has been to enter the user in this equation in two ways: (1) implicitly, through the analysis of usage data captured by query logs, and session and click information in general, the goal being to improve ranking as well as to measure user's happiness and engagement; (2) explicitly, by offering novel interactive features; the goal here being to better answer users' needs. In this tutorial, we will cover the user-related challenges associated with the implicit and explicit role of users in Web retrieval. We will review and discuss challenges associated with two types of activities, namely: usage data analysis and metrics and user interaction. The goal of this tutorial is to teach the key principles and technologies behind the activities and challenges briefly outlined above, bring new understanding and insights to the attendees, and hopefully foster future research.
Keywords: user interaction, web retrieval
Information retrieval challenges in computational advertising BIBAKFull-Text 908
  Andrei Broder; Evgeniy Gabrilovich; Vanja Josifovski
Computational advertising is an emerging scientific sub-discipline, at the intersection of large scale search and text analysis, information retrieval, statistical modeling, machine learning, classification, optimization, and microeconomics. The central challenge of computational advertising is to find the "best match" between a given user in a given context and a suitable advertisement. The aim of this tutorial is to present the state of the art in Computational Advertising, in particular in its IR-related aspects, and to expose the participants to the current research challenges in this field. The tutorial does not assume any prior knowledge of Web advertising, and will begin with a comprehensive background survey. Going deeper, our focus will be on using a textual representation of the user context to retrieve relevant ads. At first approximation, this process can be reduced to a conventional setup by constructing a query that describes the user context and executing the query against a large inverted index of ads. We show how to augment this approach using query expansion and text classification techniques tuned for the ad-retrieval problem. In particular, we show how to use the Web as a repository of query-specific knowledge and use the Web search results retrieved by the query as a form of a relevance feedback and query expansion. We also present solutions that go beyond the conventional bag of words indexing by constructing additional features using a large external taxonomy and a lexicon of named entities obtained by analyzing the entire Web as a corpus. The last part of the tutorial will be devoted to a potpourri of recent research results and open problems inspired by Computational Advertising challenges in text summarization, natural language generation, named entity recognition, computer-human interaction, and other SIGIR-relevant areas.
Keywords: content match, online advertising, sponsored search
Extraction of open-domain class attributes from text: building blocks for faceted search BIBAKFull-Text 909
  Marius Pasca
Knowledge automatically extracted from text captures instances, classes of instances and relations among them. In particular, the acquisition of class attributes (e.g., "top speed", "body style" and "number of cylinders" for the class of "sports cars") from text is a particularly appealing task and has received much attention recently, given its natural fit as a building block towards the far-reaching goal of constructing knowledge bases from text. This tutorial provides an overview of extraction methods developed in the area of Web-based information extraction, with the purpose of acquiring attributes of open-domain classes. The attributes are extracted for classes organized either as a flat set or hierarchically. The extraction methods operate over unstructured or semi-structured text available within collections of Web documents, or over relatively more intriguing data sources consisting of anonymized search queries. The methods take advantage of weak supervision provided in the form of seed examples or small amounts of annotated data, or draw upon knowledge already encoded within human-compiled resources (e.g., Wikipedia). The more ambitious methods, aiming at acquiring as many accurate attributes from text as possible for hundreds or thousands of classes covering a wide range of domains of interest, need to be designed to scale to Web collections. This restriction has significant consequences on the overall complexity and choice of underlying tools, in order for the extracted attributes to ultimately aid information retrieval in general and Web search in particular, by producing relevant attributes for open-domain classes, along with other types of relations among instances or among classes.
Keywords: attribute extraction, class attributes, information extraction, knowledge acquisition
From federated to aggregated search BIBAKFull-Text 910
  Fernando Diaz; Mounia Lalmas; Milad Shokouhi
Federated search refers to the brokered retrieval of content from a set of auxiliary retrieval systems instead of from a single, centralized retrieval system. Federated search tasks occur in, for example, digital libraries (where documents from several retrieval systems must be seamlessly merged) or peer-to-peer information retrieval (where documents distributed across a network of local indexes must be retrieved).
   In the context of web search, aggregated search refers to the integration of non-web content (e.g. images, videos, news articles, maps, tweets) into a web search result page. This is in contrast with classic web search where users are presented with a ranked list consisting exclusively of general web documents. As in other federated search situations, the non-web content is often retrieved from auxiliary retrieval systems (e.g. image or video databases, news indexes).
   Although aggregated search can be seen as an instance of federated search, several aspects make aggregated search a unique and compelling research topic. These include large sources of evidence (e.g. click logs) for deciding what non-web items to return, constrained interfaces (e.g. mobile screens), and a very heterogeneous set of available auxiliary resources (e.g. images, videos, maps, news articles). Each of these aspects introduces problems and opportunities not addressed in the federated search literature.
   Aggregated search is an important future research direction for information retrieval. All major search engines now provide aggregated search results. As the number of available auxiliary resources grows, deciding how to effectively surface content from each will become increasingly important.
   The goal of this tutorial is to provide an overview of federated search and aggregated search techniques for an intermediate information retrieval researcher. At the same time, the content will be valuable for practitioners in industry. We will take the audience through the most influential work in these areas and describe how they relate to real world aggregated search systems. We will also list some of the new challenges confronted in aggregated search and discuss directions for future work.
Keywords: aggregated search, distributed information retrieval, federated search, metasearch, universal search, vertical search
Estimating the query difficulty for information retrieval BIBAKFull-Text 911
  David Carmel; Elad Yom-Tov
Many information retrieval (IR) systems suffer from a radical variance in performance when responding to users' queries. Even for systems that succeed very well on average, the quality of results returned for some of the queries is poor. Thus, it is desirable that IR systems will be able to identify "difficult" queries in order to handle them properly. Understanding why some queries are inherently more difficult than others is essential for IR, and a good answer to this important question will help search engines to reduce the variance in performance, hence better servicing their customer needs.
   The high variability in query performance has driven a new research direction in the IR field on estimating the expected quality of the search results, i.e. the query difficulty, when no relevance feedback is given. Estimating the query difficulty is a significant challenge due to the numerous factors that impact retrieval performance. Many prediction methods have been proposed recently. However, as many researchers observed, the prediction quality of state-of-the-art predictors is still too low to be widely used by IR applications. The low prediction quality is due to the complexity of the task, which involves factors such as query ambiguity, missing content, and vocabulary mismatch.
   The goal of this tutorial is to expose participants to the current research on query performance prediction (also known as query difficulty estimation). Participants will become familiar with states-of-the-art performance prediction methods, and with common evaluation methodologies for prediction quality. We will discuss the reasons that cause search engines to fail for some of the queries, and provide an overview of several approaches for estimating query difficulty. We then describe common methodologies for evaluating the prediction quality of those estimators, and some experiments conducted recently with their prediction quality, as measured over several TREC benchmarks. We will cover a few potential applications that can utilize query difficulty estimators by handling each query individually and selectively based on its estimated difficulty. Finally we will summarize with a discussion on open issues and challenges in the field.
Keywords: performance prediction, query difficulty estimation, retrieval robustness
Search and browse log mining for web information retrieval: challenges, methods, and applications BIBAKFull-Text 912
  Daxin Jiang; Jian Pei; Hang Li
Huge amounts of search log data have been accumulated in various search engines. Currently, a commercial search engine receives billions of queries and collects tera-bytes of log data on any single day. Other than search log data, browse logs can be collected by client-side browser plug-ins, which record the browse information if users' permissions are granted. Such massive amounts of search/browse log data, on the one hand, provide great opportunities to mine the wisdom of crowds and improve search results as well as online advertisement. On the other hand, designing effective and efficient methods to clean, model, and process large scale log data also presents great challenges.
   In this tutorial, we focus on mining search and browse log data for Web information retrieval. We consider a Web information retrieval system consisting of four components, namely, query understanding, document understanding, query-document matching, and user understanding. Accordingly, we organize the tutorial materials along these four aspects. For each aspect, we will survey the major tasks, challenges, fundamental principles, and state-of-the-art methods.
   The goal of this tutorial is to provide a systematic survey on large-scale search/browse log mining to the IR community. It will help IR researchers to get familiar with the core challenges and promising directions in log mining. At the same time, this tutorial may also serve the developers of Web information retrieval systems as a comprehensive and in-depth reference to the advanced log mining techniques.
Keywords: log data mining, search and browse logs
Information retrieval for e-discovery BIBAKFull-Text 913
  David D. Lewis
Discovery, the process under which parties to legal cases must reveal documents relevant to the disputed issues is a core aspect of trials in the United States, and a lesser but important factor in other countries. Discovery on documents stored in computerized systems (known variously as electronic discovery, e-discovery, e-disco, EDD, and ED) is increasingly the major factor in discovery, and has become a multi-billion dollar industry.
   I will discuss the basics of e-discovery, the scale and diversity of the materials involved, and the economics of identifying and reviewing potentially responsive material. I will then focus on three major IR areas of interest: search, supervised machine learning (including text classification and relevance feedback), and interface support for manual relevance assessment. For each, I will discuss technologies currently used in e-discovery, the evaluation methods applicable to measuring effectiveness, and existing research results not yet seeing commercial practice.
   I will also outline research directions that, if successfully pursued, would potentially be of great interest in e-discovery applications. A particular focus will be on areas where researchers can make progress without access to operational e-discovery environments or "realistic" test collections. Connections will be drawn with the use of IR in related tasks, such as enterprise search, criminal investigations, intelligence analysis, historical research, truth and reconciliation commissions, and freedom of information (open records or sunshine law) requests.
Keywords: OCR, backups, computer forensics, document formats, duplicate detection, e-mail, electronic mail, text mining

Doctoral consortium

On the mono- and cross-language detection of text reuse and plagiarism BIBAKFull-Text 914
  Alberto Barrón-Cedeño
Plagiarism, the unacknowledged reuse of text, has increased in recent years due to the large amount of texts readily available. For instance, recent studies claim that nowadays a high rate of student reports include plagiarism, making manual plagiarism detection practically infeasible.
   Automatic plagiarism detection tools assist experts to analyse documents for plagiarism. Nevertheless, the lack of standard collections with cases of plagiarism has prevented accurate comparing models, making differences hard to appreciate. Seminal efforts on the detection of text reuse [2] have fostered the composition of standard resources for the accurate evaluation and comparison of methods. The aim of this PhD thesis is to address three of the main problems in the development of better models for automatic plagiarism detection: (i) the adequate identification of good potential sources for a given suspicious text; (ii) the detection of plagiarism despite modifications, such as words substitution and paraphrasing (special stress is given to cross-language plagiarism); and (iii) the generation of standard collections of cases of plagiarism and text reuse in order to provide a framework for accurate comparison of models. Regarding difficulties (i) and (ii) , we have carried out preliminary experiments over the METER corpus [2]. Given a suspicious document dq and a collection of potential source documents D, the process is divided in two steps. First, a small subset of potential source documents D* in D is retrieved. The documents d in D* are the most related to dq and, therefore, the most likely to include the source of the plagiarised fragments in it. We performed this stage on the basis of the Kullback-Leibler distance, over a subsample of document's vocabularies. Afterwards, a detailed analysis is carried out comparing dq to every d in D* in order to identify potential cases of plagiarism and their source. This comparison was made on the basis of word n-grams, by considering n = {2, 3}. These n-gram levels are flexible enough to properly retrieve plagiarised fragments and their sources despite modifications [1]. The result is offered to the user to take the final decision. Further experiments were done in both stages in order to compare other similarity measures, such as the cosine measure, the Jaccard coefficient and diverse fingerprinting and probabilistic models.
   One of the main weaknesses of currently available models is that they are unable to detect cross-language plagiarism. Approaching the detection of this kind of plagiarism is of high relevance, as the most of the information published is written in English, and authors in other languages may find it attractive to make use of direct translations. Our experiments, carried out over parallel and a comparable corpora, show that models of "standard" cross-language information retrieval are not enough. In fact, if the analysed source and target languages are related in some way (common linguistic ancestors or technical vocabulary), a simple comparison based on character n-grams seems to be the option. However, in those cases where the relation between the implied languages is weaker, other models, such as those based on statistical machine translation, are necessary [3]. We plan to perform further experiments, mainly to approach the detection of cross-language plagiarism. In order to do that, we will use the corpora developed under the framework of the PAN competition on plagiarism detection (cf. PAN@CLEF: http://pan.webis.de). Models that consider cross-language thesauri and comparison of cognates will also be applied.
Keywords: cross-language plagiarism detection, plagiarism detection, text similarity
User interface designs to support the social transfer of web search expertise BIBAKFull-Text 915
  Neema Moraveji
While there are many ways to develop search expertise, I maintain that most members of the general public do so in an inefficient manner. One reason is that, with current tools, is difficult to observe experts as a means of acquiring search expertise in a scalable fashion. This calls for a redesign of computer-mediated communication tools to make individual search strategies visible to other users. I present a research agenda to investigate this claim, which draws upon theories of social learning. I use design-based research to build novel systems that enable imitation-based learning of search expertise.
Keywords: expertise, information retrieval, learning, social, transfer
Leveraging user interaction and collaboration for improving multilingual information access in digital libraries BIBAKFull-Text 916
  Juliane Stiller
The goal of interactive cross-lingual information retrieval systems is to support users in formulating effective queries and selecting the documents which satisfy their information needs regardless of the language of these documents. This dissertation aims at harnessing user-system interaction, extracting the added value and integrating it back into the system to improve cross-lingual information retrieval for successive users. To achieve this, user input at different interaction points will be evaluated. This will, among others, include interaction during user-assisted query translations, implicit and explicit relevance feedback and social tags. To leverage this input, explorative studies need to be conducted to determine beneficial user input and the methods of extracting it.
Keywords: digital libraries, interactive clir, social tags
Entity information management in complex networks BIBAKFull-Text 917
  Yi Fang
Entity information management (EIM) deals with organizing, processing and delivering information about entities. Its emergence is a result of satisfying more sophisticated information needs that go beyond document search. In the recent years, entity retrieval has attracted much attention in the IR community. INEX has started the XML Entity Ranking track since 2007 and TREC has launched the Entity track since 2009 to investigate the problem of related entity finding. Some EIM problems go beyond retrieval and ranking such as: 1) entity profiling, which is about characterizing a specific entity, and 2) entity distillation, which is about discovering the trend about an entity. These problems have received less attention while they have many important applications.
   On the other hand, the entities in the real world or in the Web environment are usually not isolated. They are connected or related with each other in one way or another. For example, the coauthorship makes the authors with similar research interests be connected. The emergence of social media such as Facebook, Twitter and YouTube has further interweaved the related entities in a much larger scale. Millions of users in these sites can become friends, fans or followers of others, or taggers or commenters of different types of entities (e.g., bookmarks, photos and videos). These networks are complex in the sense that they are heterogeneous with multiple types of entities and of interactions, they are large-scale, they are multi-lingual, and they are dynamic. These features of the complex networks go beyond traditional social network analysis and require further research.
   In this proposed research, I investigate entity information management in the environment of complex networks. The main research question is: how can the EIM tasks be facilitated by modeling the content and structure of complex networks? The research is in the intersection of content based information retrieval and complex network analysis, which deals with both unstructured text data and structured networks. The specific targeting EIM tasks are entity retrieval, entity profiling and entity distillation. In addition to the main research question, the following questions are considered: How can we accomplish a EIM task involving diverse entity and interaction types? How to model the evolution of entity profiles as well as the underlying complex networks? How can the existing cross-language IR work be leveraged to build entity profiles with multi-lingual evidence?
   I propose to use probabilistic models and discriminative models in particular to address the above research questions. In my research, I have developed discriminative models for expert search to integrate arbitrary document features [3] and to learn flexible combination strategies to rank experts in heterogeneous information sources [1]. Discriminative graphical models are proposed to jointly discover homepages by inference on the homepage dependence network [2]. The dependence of table elements is exploited to collectively perform the entity retrieval task [4]. These works have shown the power of discriminative models for entity search and the benefits of utilizing the dependencies among related entities. What I would like to do next is to develop a unified probabilistic framework to investigate the research questions raised in this proposal.
Keywords: entity profiling, entity retrieval, social network analysis
Finding people and their utterances in social media BIBAKFull-Text 918
  Wouter Weerkamp
Since its introduction, social media, "a group of internet-based applications that (...) allow the creation and exchange of user generated content" [1], has attracted more and more users. Over the years, many platforms have arisen that allow users to publish information, communicate with others, connect to like-minded, and share anything a users wants to share. Text-centric examples are mailing lists, forums, blogs, community question answering, collaborative knowledge sources, social networks, and microblogs, with new platforms starting all the time. Given the volume of information available in social media, ways of accessing this information intelligently are needed; this is the scope of my research.
   Why should we care about information in social media? Here are three examples that motivate my interest. (A) Viewpoint research; someone wants to take note of the viewpoints on a particular issue. (B) Answers to problems; many problems have been encountered before, and people have shared solutions. (C) Product development; gaining insight into how people use a product and what features they wish for, eases the development of new products. Looking at these examples of information need in social media, we observe that they revolve not just around relevance in the traditional sense (i.e., objects relevant to a given topic), but also around criteria like credibility, authority, viewpoints, expertise, and experiences. However, these additional aspects are typically conditioned on the topical relevance of information objects.
   In social media, "information objects" come in several types but many are utterances created by people (blog posts, emails, questions, answers, tweets). People and their utterances offer two natural entry points to information contained in social media: utterances that are relevant and people that are of interest. I focus on three tasks in which the interaction between the two is key.
Keywords: information retrieval, social media
Leveraging user-generated content for news search BIBAKFull-Text 919
  Richard M. C. McCreadie
Over the last few years both availability and accessibility of current news stories on the Web have dramatically improved. In particular, users can now access news from a variety of sources hosted on the Web, from newswire presences such as the New York Times, to integrated news search within Web search engines. However, of central interest is the emerging impact that user-generated content (UGC) is having on this online news landscape. Indeed, the emergence of Web 2.0 has turned a static news consumer base into a dynamic news machine, where news stories are summarised and commented upon. In summary, value is being added to each news story in terms of additional content.
   Importantly, however, while there has been movement in commercial circles to exploit this extra value to enrich online news, there has been little research from the academic community on how can be achieved. Indeed, the main purpose of this thesis is to research practical techniques for the integration of UGC to improve the news search component of the most ubiquitous of Web tools, i.e the Web search engine.
Keywords: blogs, news, social media
User centered story tracking BIBAKFull-Text 920
  Ilija Subasic
Using data collections available on the Internet has for many people became the main medium for staying informed about the world. Many of these collections are in nature dynamic, evolving as the subjects they describe change. The goal of different research areas is to identify and highlight these changes to better enable readers to track stories. In this work we restrict ourselves to news collections and investigate "real-life" effectiveness and usability of temporal text mining (TTM) story tracking methods. We propose a new story tracking method and build a tool to support it. Additionally, we investigate the effectiveness and usability of story tracking methods and define a new frameworks for automatic and user oriented evaluation. We built methods and tools which allow for understanding, discovery, and search through user interaction. Although there are many TTM methods developed there is a lack of common evaluation procedure. Therefore, we propose an evaluation framework for measuring how different TTM methods discover novel "facts". Apart from the automatic evaluation we are interested in how can users interact with pattens and learn about the underlying subjects of the story they track. For this purpose we propose a user testing environment that measures speed and accuracy in which users can use story tracking methods to discover predefined sets of ground-truth sentences.
Keywords: evaluation, temporal text mining, visualization
Reverse annotation based retrieval from large document image collections BIBAKFull-Text 921
  A Pramod Sankar K.
A number of projects are dedicated to creating digital libraries from scanned books, such as Google Books, UDL, Digital Library of India (DLI), etc. The ability to search in the content of document images is essential for the usability and popularity of these DLs. In this work, we aim toward building a retrieval system over 120K document images coming from 1000 scanned books of Telugu literature. This is a challenge because: i) OCRs are not robust enough for Indian languages, especially the Telugu script, ii) the document images contain large number of degradations and artifacts, iii) scalability to large collections is hard. Moreover, users expect that the search system accept text queries and retrieve relevant results in interactive times.
   We propose a Reverse Annotation framework [1], that labels word-images by their equivalent text label in the offline phase. Reverse Annotation applies a retrieval based approach to recognition. Unlike traditional annotation/recognition that identifies keywords for data, Reverse Annotation identifies data that corresponds to a given keyword. It first selects a set of keywords which are considered useful for labeling and retrieval, such as those that repeat often, and ignoring stopwords and rare-words. Exemplars are obtained for each word from a crude OCR or human annotations. The labels are then propagated across the rest of the collection by matching words in the image-feature space. Since such a matching is computationally expensive, scalability is achieved using a fast approximate nearest neighbor technique based on Hierarchical K-Means. Once text labels are assigned, each document image is considered a bag-of-words over the labeled keywords. A standard search engine is used to build a search index for quick online retrieval. An example query and the retrieved results are shown in Figure 1. We are unaware of any conventional OCRs which can retrieve such images for the given query.
   There are three major contributions of our work: i) recognizing the entire document collection together, instead of one-at-a-time; this means that the repetition of words in the test set is effectively used for improving accuracy, ii) speeding up recognition by clustering multiple instances of a given word, iii) recognising at the word-level, avoiding the pitfalls of character segmentation and recognition. Other OCR techniques that use word-level context still rely on inaccurate component-level classification.
   Using the techniques developed from this work, we were able to successfully build a retrieval system over our challenging dataset. To the best of our knowledge, this is the largest collection of document images that has been made searchable for any Indian language. Our algorithm is easily scalable to larger collections, and directly applicable to documents from other language scripts.
   The first issue to discuss, is the fraction of word-images that remain unrecognized at the end of the Reverse Annotation phase. Rare-words, nouns etc. are not labeled in the test set. It is important to estimate the cost of not being able to answer such queries. If this cost is indeed high, we need to explore methods to label such infrequently occurring words in the collection. Needless to say, such methods should be computationally efficient without compromising on accuracy.
   The other major issue to discuss is the evaluation of retrieval results. The true recall of the retrieval system cannot be computed, since it is impossible to identify every occurrence of the given query in such large data. Questions to be considered include: whether precision alone is a sufficient indicator of retrieval performance; whether there is some better document-level effectiveness assessment possible; and how best to estimate the relative satisfaction of the user's information need.
Keywords: document images, recognition-free, scalability
Learning hidden variable models for blog retrieval BIBAKFull-Text 922
  Mengqiu Wang
We describe probabilistic models that leverage individual blog post evidence to improve blog seed retrieval performances. Our model offers a intuitive and principled method to combine multiple posts in scoring a whole blog site by treating individual posts as hidden variables. When applied to the seed retrieval task, our model yields state-of-the-art results on the TREC 2007 Blog Distillation Task dataset.
Keywords: blog retrieval, learning to rank, passage retrieval
Investigation on smoothing and aggregation methods in blog retrieval BIBAKFull-Text 923
  Mostafa Keikha
Recently, user generated data is growing rapidly and becoming one of the most important source of information in the web. Blogosphere (the collection of blogs on the web) is one of the main source of information in this category. In my work for my PhD, I mainly focussed on the blog distillation task which is: given a user query find the blogs that are most related to the query topic [3].
   There are some properties of blogs that make blog analysis different from usual text analysis. One of these properties is related to the time stamp assigned to each post; it is possible that the topics of a blog change over the time and this can affect blog relevance to the query. Also each post in a blog can have viewer generated comments that can change the relevance of the blog to the query if these are considered as part of the content of the blog. Another property is related to the meaning of the links between blogs which are different than links between websites. Finally, blog distillation is different from traditional ad-hoc search since the retrieval unit is a blog (a collection of posts), instead of a single document. With this view, blog distillation is similar to the task of resource selection in federated search [1].
   Researchers have applied different methods from similar problems to blog distillation like ad-hoc search methods, expert search algorithms or methods from resource selection in distributed information retrieval.
   Based on our preliminary experiments, I decided to divide the blog distillation problem into two sub-problems. First of all, I want to use mentioned properties of blogs to retrieve the most relevant posts for a given query. This part is very similar to the ad hoc retrieval. After that, I want to aggregate relevance of posts in each blog and calculate relevance of the blog. This part requires the development of a cross-modal aggregation model that combines the different blog relevance clues found in the blogosphere.
   We use structure based smoothing methods for improving posts retrieval. The idea behind these smoothing methods is to change the score of a document based on the score of its similar or related documents. We model the blogosphere as a single graph that represents relations between posts and terms [2]. The idea is that in accordance with the Clustering Hypothesis, related documents should have similar scores for the same query. To model the relatedness between posts, we define a new measure which takes into account both content similarity and temporal distance.
   In more recent work, in the aggregation part of the problem, we model each post as evidence about relevance of a blog to the query, and use aggregation methods like Ordered Weighted Averaging operators to combine the evidence. The ordered weighted averaging operator, commonly called OWA operator, was introduced by Yager [4]. OWA provides a parametrized class of mean type aggregation operators, that can generate OR operator (Max), AND operator (Min) and any other aggregation operator between them.
   For the next steps, I'm thinking about capturing the temporal properties of the blogs. Bloggers can change their interests over the time or write about different topics periodically. Capturing these changes and using them in the retrieval is one the future woks that I'm interested in. Also, studying the relations between blogs and news and their effect on each other is an interesting problem.
Keywords: blog search, temporal analysis, user generated data
Aiming for user experience in information retrieval: towards user-centered relevance (UCR) BIBKFull-Text 924
  Frans van der Sluis; Betsy. van Dijk; Egon L. van den Broek
Keywords: positive affect, relevance, user experience