| Using Intent Information to Model User Behavior in Diversified Search | | BIBA | Full-Text | 1-13 | |
| Aleksandr Chuklin; Pavel Serdyukov; Maarten de Rijke | |||
| A result page of a modern commercial search engine often contains documents of different types targeted to satisfy different user intents (news, blogs, multimedia). When evaluating system performance and making design decisions we need to better understand user behavior on such result pages. To address this problem various click models have previously been proposed. In this paper we focus on result pages containing fresh results and propose a way to model user intent distribution and bias due to different document presentation types. To the best of our knowledge this is the first work that successfully uses intent and layout information to improve existing click models. | |||
| Understanding Relevance: An fMRI Study | | BIBA | Full-Text | 14-25 | |
| Yashar Moshfeghi; Luisa R. Pinto; Frank E. Pollick; Joemon M. Jose | |||
| Relevance is one of the key concepts in Information Retrieval (IR). A huge body of research exists that attempts to understand this concept so as to operationalize it for IR systems. Despite advances in the past few decades, answering the question "How does relevance happen?" is still a big challenge. In this paper, we investigate the connection between relevance and brain activity. Using functional Magnetic Resonance Imaging (fMRI), we measured the brain activity of eighteen participants while they performed four topical relevance assessment tasks on relevant and non-relevant images. The results of this experiment revealed three brain regions in the frontal, parietal and temporal cortex where brain activity differed between processing relevant and non-relevant documents. This is an important step in unravelling the nature of relevance and therefore better utilising it for effective retrieval. | |||
| An Exploratory Study of Sensemaking in Collaborative Information Seeking | | BIBA | Full-Text | 26-37 | |
| Yihan Tao; Anastasios Tombros | |||
| With the ubiquity of current information retrieval systems, users move beyond individual searching to performing complex information seeking tasks together with collaborators for social, leisure or professional purposes. In this paper, we investigate the sensemaking behaviour of online searchers in terms of sensemaking strategies, sharing of information, construction of a shared representation and sharing of task progress and status. We also looked into the support provided to them by search systems in the collaborative information seeking process. We report the results of an observational user study where 24 participants, in groups of 3, completed a travel planning task. Our results show that current tools do not sufficiently support searchers in most aspects of the collaborative sensemaking process. Our findings have implications for the design of collaborative information seeking systems. | |||
| Exploiting User Comments for Audio-Visual Content Indexing and Retrieval | | BIBA | Full-Text | 38-49 | |
| Carsten Eickhoff; Wen Li; Arjen P. de Vries | |||
| State-of-the-art content sharing platforms often require users to assign tags to pieces of media in order to make them easily retrievable. Since this task is sometimes perceived as tedious or boring, annotations can be sparse. Commenting on the other hand is a frequently used means of expressing user opinion towards shared media items. This work makes use of time series analyses in order to infer potential tags and indexing terms for audio-visual content from user comments. In this way, we mitigate the vocabulary gap between queries and document descriptors. Additionally, we show how large-scale encyclopaedias such as Wikipedia can aid the task of tag prediction by serving as surrogates for high-coverage natural language vocabulary lists. Our evaluation is conducted on a corpus of several million real-world user comments from the popular video sharing platform YouTube, and demonstrates significant improvements in retrieval performance. | |||
| An Evaluation of Labelling-Game Data for Video Retrieval | | BIBA | Full-Text | 50-61 | |
| Riste Gligorov; Michiel Hildebrand; Jacco van Ossenbruggen; Lora Aroyo; Guus Schreiber | |||
| Games with a purpose (GWAPs) are increasingly used in audio-visual collections as a mechanism for annotating videos through tagging. This trend is driven by the assumption that user tags will improve video search. In this paper we study whether this is indeed the case. To this end, we create an evaluation dataset that consists of: (i) a set of videos tagged by users via video labelling game, (ii) a set of queries derived from real-life query logs, and (iii) relevance judgements. Besides user tags from the labelling game, we exploit the existing metadata associated with the videos (textual descriptions and curated in-house tags) and closed captions. Our findings show that search based on user tags alone outperforms search based on all other metadata types. Combining user tags with the other types of metadata yields an increase in search performance of 33%. We also find that the search performance of user tags steadily increases as more tags are collected. | |||
| Multimodal Re-ranking of Product Image Search Results | | BIBAK | Full-Text | 62-73 | |
| Joyce M. dos Santos; João M. B. Cavalcanti; Patricia C. Saraiva; Edleno S. de Moura | |||
| In this article we address the problem of searching for products using an
image as query, instead of the more popular approach of searching by textual
keywords. With the fast development of the Internet, the popularization of
mobile devices and e-commerce systems, searching specific products by image has
become an interesting research topic. In this context, Content-Based Image
Retrieval (CBIR) techniques have been used to support and enhance the customer
shopping experience. We propose an image re-ranking strategy based on
multimedia information available on product databases. Our re-ranking strategy
relies on category and textual information associated to the top-k images of an
initial ranking computed purely with CBIR techniques. Experiments were carried
out with users' relevance judgment on two image datasets collected from
e-commerce Web sites. Our results show that our re-ranking strategy outperforms
the baselines when using only CBIR techniques. Keywords: Image re-ranking; Product visual search; E-commerce | |||
| Predicting Information Diffusion in Social Networks Using Content and User's Profiles | | BIBA | Full-Text | 74-85 | |
| Cédric Lagnier; Ludovic Denoyer; Eric Gaussier; Patrick Gallinari | |||
| Predicting the diffusion of information on social networks is a key problem for applications like Opinion Leader Detection, Buzz Detection or Viral Marketing. Many recent diffusion models are direct extensions of the Cascade and Threshold models, initially proposed for epidemiology and social studies. In such models, the diffusion process is based on the dynamics of interactions between neighbor nodes in the network (the social pressure), and largely ignores important dimensions as the content of the piece of information diffused. We propose here a new family of probabilistic models that aims at predicting how a content diffuses in a network by making use of additional dimensions: the content of the piece of information diffused, user's profile and willingness to diffuse. These models are illustrated and compared with other approaches on two blog datasets. The experimental results obtained on these datasets show that taking into account the content of the piece of information diffused is important to accurately model the diffusion process. | |||
| How Tagging Pragmatics Influence Tag Sense Discovery in Social Annotation Systems | | BIBA | Full-Text | 86-97 | |
| Thomas Niebler; Philipp Singer; Dominik Benz; Christian Körner; Markus Strohmaier; Andreas Hotho | |||
| The presence of emergent semantics in social annotation systems has been reported in numerous studies. Two important problems in this context are the induction of semantic relations among tags and the discovery of different senses of a given tag. While a number of approaches for discovering tag senses exist, little is known about which factors influence the discovery process. In this paper, we analyze the influence of user pragmatic factors. We divide taggers into different pragmatic distinctions. Based on these distinctions, we identify subsets of users whose annotations allow for a more precise and complete discovery of tag senses. Our results provide evidence for a link between tagging pragmatics and semantics and provide another argument for including pragmatic factors in semantic extraction methods. Our work is relevant for improving search, retrieval and browsing in social annotation systems, as well as for optimizing ontology learning algorithms based on tagging data. | |||
| A Unified Framework for Monolingual and Cross-Lingual Relevance Modeling Based on Probabilistic Topic Models | | BIBAK | Full-Text | 98-109 | |
| Ivan VuliÄ; Marie-Francine Moens | |||
| We explore the potential of probabilistic topic modeling within the
relevance modeling framework for both monolingual and cross-lingual ad-hoc
retrieval. Multilingual topic models provide a way to represent documents in a
structured and coherent way, regardless of their actual language, by means of
language-independent concepts, that is, cross-lingual topics. We show how to
integrate the topical knowledge into a unified relevance modeling framework in
order to build quality retrieval models in monolingual and cross-lingual
contexts. The proposed modeling framework processes all documents uniformly and
does not make any conceptual distinction between monolingual and cross-lingual
modeling. Our results obtained from the experiments conducted on the standard
CLEF test collections reveal that fusing the topical knowledge and relevance
modeling leads to building monolingual and cross-lingual retrieval models that
outperform several strong baselines. We show that that the topical knowledge
coming from a general Web-generated corpus boosts retrieval scores.
Additionally, we show that within this framework the estimation of
cross-lingual relevance models may be performed by exploiting only a general
non-parallel corpus. Keywords: Cross-lingual information retrieval; relevance models; multilingual topic
models; probabilistic retrieval models; comparable multilingual corpora | |||
| Semantic Search Log k-Anonymization with Generalized k-Cores of Query Concept Graph | | BIBA | Full-Text | 110-121 | |
| Claudio Carpineto; Giovanni Romano | |||
| Search log k-anonymization is based on the elimination of infrequent queries under exact (or nearly exact) matching conditions, which usually results in a big data loss and impaired utility. We present a more flexible, semantic approach to k-anonymity that consists of three steps: query concept mining, automatic query expansion, and affinity assessment of expanded queries. Based on the observation that many infrequent queries can be seen as refinements of a more general frequent query, we first model query concepts as probabilistically weighted n-grams and extract them from the search log data. Then, after expanding the original log queries with their weighted concepts, we find all the k-affine expanded queries under a given affinity threshold Θ, modeled as a generalized k-core of the graph of Θ-affine queries. Experimenting with the AOL data set, we show that this approach achieves levels of privacy comparable to those of plain k-anonymity while at the same time reducing the data losses to a great extent. | |||
| A Joint Classification Method to Integrate Scientific and Social Networks | | BIBA | Full-Text | 122-133 | |
| Mahmood Neshati; Ehsaneddin Asgari; Djoerd Hiemstra; Hamid Beigy | |||
| In this paper, we address the problem of scientific-social network integration to find a matching relationship between members of these networks. Utilizing several name similarity patterns and contextual properties of these networks, we design a focused crawler to find high probable matching pairs, then the problem of name disambiguation is reduced to predict the label of each candidate pair as either true or false matching. By defining matching dependency graph, we propose a joint label prediction model to determine the label of all candidate pairs simultaneously. An extensive set of experiments have been conducted on six test collections obtained from the DBLP and the Twitter networks to show the effectiveness of the proposed joint label prediction model. | |||
| Using Document-Quality Measures to Predict Web-Search Effectiveness | | BIBAK | Full-Text | 134-145 | |
| Fiana Raiber; Oren Kurland | |||
| The query-performance prediction task is estimating retrieval effectiveness
in the absence of relevance judgments. The task becomes highly challenging over
the Web due to, among other reasons, the effect of low quality (e.g., spam)
documents on retrieval performance. To address this challenge, we present a
novel prediction approach that utilizes query-independent document-quality
measures. While using these measures was shown to improve Web-retrieval
effectiveness, this is the first study demonstrating the clear merits of using
them for query-performance prediction. Evaluation performed with large scale
Web collections shows that our methods post prediction quality that often
surpasses that of state-of-the-art predictors, including those devised
specifically for Web retrieval. Keywords: query-performance prediction; Web retrieval | |||
| Training Efficient Tree-Based Models for Document Ranking | | BIBA | Full-Text | 146-157 | |
| Nima Asadi; Jimmy Lin | |||
| Gradient-boosted regression trees (GBRTs) have proven to be an effective solution to the learning-to-rank problem. This work proposes and evaluates techniques for training GBRTs that have efficient runtime characteristics. Our approach is based on the simple idea that compact, shallow, and balanced trees yield faster predictions: thus, it makes sense to incorporate some notion of execution cost during training to "encourage" trees with these topological characteristics. We propose two strategies for accomplishing this: the first, by directly modifying the node splitting criterion during tree induction, and the second, by stagewise tree pruning. Experiments on a standard learning-to-rank dataset show that the pruning approach is superior; one balanced setting yields an approximately 40% decrease in prediction latency with minimal reduction in output quality as measured by NDCG. | |||
| DTD Based Costs for Tree-Edit Distance in Structured Information Retrieval | | BIBA | Full-Text | 158-170 | |
| Cyril Laitang; Karen Pinel-Sauvagnat; Mohand Boughanem | |||
| In this paper we present a Structured Information Retrieval (SIR) model based on graph matching. Our approach combines content propagation, which handles sibling relationships, with a document-query structure matching process. The latter is based on Tree-Edit Distance (TED) which is the minimum set of insert, delete, and replace operations to turn one tree to another. To our knowledge this algorithm has never been used in ad-hoc SIR. As the effectiveness of TED relies both on the input tree and the edit costs, we first present a focused subtree extraction technique which selects the most representative elements of the document w.r.t the query. We then describe our TED costs setting based on the Document Type Definition (DTD). Finally we discuss our results according to the type of the collection (data-oriented or text-oriented). Experiments are conducted on two INEX test sets: the 2010 Datacentric collection and the 2005 Ad-hoc one. | |||
| Ranked Accuracy and Unstructured Distributed Search | | BIBAK | Full-Text | 171-182 | |
| Sami Richardson; Ingemar J. Cox | |||
| Non-uniformly distributing documents in an unstructured peer-to-peer (P2P)
network has been shown to improve both the expected search length and search
accuracy, where accuracy is defined as the size of the intersection of the
documents retrieved by a constrained, probabilistic search and the documents
that would have been retrieved by an exhaustive search, normalized by the size
of the latter. However neither metric considers the relative ranking of the
documents in the retrieved sets. We therefore introduce a new performance
metric, rank-accuracy, that is a rank weighted score of the top-k documents
retrieved. By replicating documents across nodes based on their retrieval rate
(a function of query frequency), and rank, we show that average rank-accuracy
can be improved. The practical performance of rank-aware search is demonstrated
using a simulated network of 10,000 nodes and queries drawn from a Yahoo! web
search log. Keywords: Unstructured P2P Network; Probabilistic Retrieval | |||
| Learning to Rank from Structures in Hierarchical Text Classification | | BIBA | Full-Text | 183-194 | |
| Qi Ju; Alessandro Moschitti; Richard Johansson | |||
| In this paper, we model learning to rank algorithms based on structural dependencies in hierarchical multi-label text categorization (TC). Our method uses the classification probability of the binary classifiers of a standard top-down approach to generate k-best hypotheses. The latter are generated according to their global probability while at the same time satisfy the structural constraints between father and children nodes. The rank is then refined using Support Vector Machines and tree kernels applied to a structural representation of hypotheses, i.e., a hierarchy tree in which the outcome of binary one-vs-all classifiers is directly marked in its nodes. Our extensive experiments on the whole Reuters Corpus Volume 1 show that our models significantly improve over the state of the art in TC, thanks to the use of structural dependencies. | |||
| Folktale Classification Using Learning to Rank | | BIBA | Full-Text | 195-206 | |
| Dong Nguyen; Dolf Trieschnigg; Mariët Theune | |||
| We present a learning to rank approach to classify folktales, such as fairy tales and urban legends, according to their story type, a concept that is widely used by folktale researchers to organize and classify folktales. A story type represents a collection of similar stories often with recurring plot and themes. Our work is guided by two frequently used story type classification schemes. Contrary to most information retrieval problems, the text similarity in this problem goes beyond topical similarity. We experiment with approaches inspired by distributed information retrieval and features that compare subject-verb-object triplets. Our system was found to be highly effective compared with a baseline system. | |||
| Open-Set Classification for Automated Genre Identification | | BIBAK | Full-Text | 207-217 | |
| Dimitrios A. Pritsos; Efstathios Stamatatos | |||
| Automated Genre Identification (AGI) of web pages is a problem of increasing
importance since web genre (e.g. blog, news, e-shops, etc.) information can
enhance modern Information Retrieval (IR) systems. The state-of-the-art in this
field considers AGI as a closed-set classification problem where a variety of
web page representation and machine learning models have intensively studied.
In this paper, we study AGI as an open-set classification problem which better
formulates the real world conditions of exploiting AGI in practice. Focusing on
the use of content information, different text representation methods (words
and character n-grams) are tested. Moreover, two classification methods are
examined, one-class SVM learners, used as a baseline, and an ensemble of
classifiers based on random feature subspacing, originally proposed for author
identification. It is demonstrated that very high precision can be achieved in
open-set AGI while recall remains relatively high. Keywords: Automated Genre Identification; Classifier Ensembles; One Class SVM | |||
| Semantic Tagging of Places Based on User Interest Profiles from Online Social Networks | | BIBAK | Full-Text | 218-229 | |
| Vinod Hegde; Josiane Xavier Parreira; Manfred Hauswirth | |||
| In recent years, location based services (LBS) have become very popular. The
performance of LBS depends on number of factors including how well the places
are described. Though LBS enable users to tag places, users rarely do so. On
the other hand, users express their interests via online social networks. The
common interests of a group of people that has visited a particular place can
potentially provide further description for that place. In this work we present
an approach that automatically assigns tags to places, based on interest
profiles and visits or check-ins of users at places. We have evaluated our
approach with real world datasets from popular social network services against
a set of manually assigned tags. Experimental results show that we are able to
derive meaningful tags for different places and that sets of tags assigned to
places are expected to stabilise as more unique users visit places. Keywords: place tagging; recommendation systems; data mining; online social networks;
location based services | |||
| Sponsored Search Ad Selection by Keyword Structure Analysis | | BIBAK | Full-Text | 230-241 | |
| Kai Hui; Bin Gao; Ben He; Tie-jian Luo | |||
| In sponsored search, the ad selection algorithm is used to pick out the best
candidate ads for ranking, the bid keywords of which are best matched to the
user queries. Existing ad selection methods mainly focus on the relevance
between user query and selected ads, and consequently the monetization ability
of the results is not necessarily maximized. To this end, instead of making
selection based on keywords as a whole, our work takes advantages of the
different impacts, as revealed in our data study, of different components
inside the keywords on both relevance and monetization ability. In particular,
we select keyword components and then maximize the relevance and revenue on the
component level. Finally, we combine the selected components to generate the
bid keywords. The experiments reveal that our method can significantly
outperform two baseline algorithms on the metrics including recall, precision
and the monetization ability. Keywords: ad selection; entity relationship; sponsored search | |||
| Intent-Based Browse Activity Segmentation | | BIBA | Full-Text | 242-253 | |
| Yury Ustinovskiy; Anna Mazur; Pavel Serdyukov | |||
| Users search and browse activity mined with special toolbars is known to provide diverse valuable information for the search engine. In particular, it helps to understand information need of a searcher, her personal preferences, context of the topic she is currently interested in. Most of the previous studies on the topic either considered the whole user activity for a fixed period of time or divided it relying on some predefined inactivity time-out. It helps to identify groups of web sites visited with the same information need. This paper addresses the problem of automatic segmentation of users browsing logs into logical segments. We propose a method for automatic division of their daily activity into intent-related parts. This segmentation advances the commonly used approaches. We propose several methods for browsing log partitioning and provide detailed study of their performance. We evaluate all algorithms and analyse contributions of various types of features. | |||
| Extracting Event-Related Information from Article Updates in Wikipedia | | BIBA | Full-Text | 254-266 | |
| Mihai Georgescu; Nattiya Kanhabua; Daniel Krause; Wolfgang Nejdl; Stefan Siersdorfer | |||
| Wikipedia is widely considered the largest and most up-to-date online encyclopedia, with its content being continuously maintained by a supporting community. In many cases, real-life events like new scientific findings, resignations, deaths, or catastrophes serve as triggers for collaborative editing of articles about affected entities such as persons or countries. In this paper, we conduct an in-depth analysis of event-related updates in Wikipedia by examining different indicators for events including language, meta annotations, and update bursts. We then study how these indicators can be employed for automatically detecting event-related updates. Our experiments on event extraction, clustering, and summarization show promising results towards generating entity-specific news tickers and timelines. | |||
| Using WordNet Hypernyms and Dependency Features for Phrasal-Level Event Recognition and Type Classification | | BIBAK | Full-Text | 267-278 | |
| Yoonjae Jeong; Sung-Hyon Myaeng | |||
| The goal of this research is to devise a method for recognizing and
classifying TimeML events in a more effective way. TimeML is the most recent
annotation scheme for processing the event and temporal expressions in natural
language processing fields. In this paper, we argue and demonstrate that unit
feature dependency information and deep-level WordNet hypernyms are useful for
event recognition and type classification. The proposed method utilizes various
features including lexical semantic and dependency-based combined features. The
experimental results show that our proposed method outperforms a
state-of-the-art approach, mainly due to the new strategies. Especially, the
performance of noun and adjective events, which have been largely ignored and
yet significant, is significantly improved. Keywords: Event Recognition; Event Type Classification; TimeML; Time-Bank; WordNet;
Combined Features | |||
| Aggregating Evidence from Hospital Departments to Improve Medical Records Search | | BIBAK | Full-Text | 279-291 | |
| Nut Limsopatham; Craig Macdonald; Iadh Ounis | |||
| Searching medical records is challenging due to their inherent implicit
knowledge -- such knowledge may be known by medical practitioners, but it is
hidden from an information retrieval (IR) system. For example, it is intuitive
for a medical practitioner to assert that patients with heart disease are
likely to have records from the hospital's cardiology department. Hence, we
hypothesise that this implicit knowledge can be used to enhance a medical
records search system that ranks patients based on the relevance of their
medical records to a query. In this paper, we propose to group aggregates of
medical records from individual hospital departments, which we refer to as
department-level evidence, to capture some of the implicit knowledge. In
particular, each department-level aggregate consists of all of the medical
records created by a particular hospital department, which is then exploited to
enhance retrieval effectiveness. Specifically, we propose two approaches to
build the department-level evidence based on a federated search and a voting
paradigm, respectively. In addition, we introduce an extended voting technique
that could leverage this department-level evidence while ranking. We evaluate
the retrieval effectiveness of our approaches in the context of the TREC 2011
Medical Records track. Our results show that modelling department-level
evidence of records in medical records search improves retrieval effectiveness.
In particular, our proposed approach to leverage department-level evidence
built using a voting technique obtains results comparable to the best submitted
TREC 2011 Medical Records track systems without requiring any external
resources that are exploited in those systems. Keywords: Medical Records Search; Corpus Structure | |||
| An N-Gram Topic Model for Time-Stamped Documents | | BIBAK | Full-Text | 292-304 | |
| Shoaib Jameel; Wai Lam | |||
| This paper presents a topic model that captures the temporal dynamics in the
text data along with topical phrases. Previous approaches have relied upon
bag-of-words assumption to model such property in a corpus. This has resulted
in an inferior performance with less interpretable topics. Our topic model can
not only capture changes in the way a topic structure changes over time but
also maintains important contextual information in the text data. Finding
topical n-grams, when possible based on context, instead of always presenting
unigrams in topics does away with many ambiguities that individual words may
carry. We derive a collapsed Gibbs sampler for posterior inference. Our
experimental results show an improvement over the current state-of-the-art
topics over time model. Keywords: Topic model; Bayesian inference; Collapsed Gibbs sampling; N-gram words;
topics over time; temporal data | |||
| Influence of Timeline and Named-Entity Components on User Engagement | | BIBA | Full-Text | 305-317 | |
| Yashar Moshfeghi; Michael Matthews; Roi Blanco; Joemon M. Jose | |||
| Nowadays, successful applications are those which contain features that captivate and engage users. Using an interactive news retrieval system as a use case, in this paper we study the effect of timeline and named-entity components on user engagement. This is in contrast with previous studies where the importance of these components were studied from a retrieval effectiveness point of view. Our experimental results show significant improvements in user engagement when named-entity and timeline components were installed. Further, we investigate if we can predict user-centred metrics through user's interaction with the system. Results show that we can successfully learn a model that predicts all dimensions of user engagement and whether users will like the system or not. These findings might steer systems that apply a more personalised user experience, tailored to the user's preferences. | |||
| Cognitive Temporal Document Priors | | BIBA | Full-Text | 318-330 | |
| Maria-Hendrike Peetz; Maarten de Rijke | |||
| Temporal information retrieval exploits temporal features of document collections and queries. Temporal document priors are used to adjust the score of a document based on its publication time. We consider a class of temporal document priors that is inspired by retention functions considered in cognitive psychology that are used to model the decay of memory. Many such functions used as a temporal document prior have a positive effect on overall retrieval performance. We examine the stability of this effect across news and microblog collections and discover interesting differences between retention functions. We also study the problem of optimizing parameters of the retention functions as temporal document priors; some retention functions display consistent good performance across large regions of the parameter space. A retention function based on a Weibull distribution is the preferred choice for a temporal document prior. | |||
| Combining Recency and Topic-Dependent Temporal Variation for Microblog Search | | BIBA | Full-Text | 331-343 | |
| Taiki Miyanishi; Kazuhiro Seki; Kuniaki Uehara | |||
| The appearance of microblogging services has led to many short documents being issued by crowds of people. To retrieve useful information from among such a huge quantity of messages, query expansion (QE) is usually used to enrich a user query. Some QE methods for microblog search utilize temporal properties (e.g., recency and temporal variation) derived from the real-time characteristic that many messages are posted by users when an interesting event has recently occurred. Our approach leverages temporal properties for QE and combines them according to the temporal variation of a given topic. Experimental results show that this QE method using automatically combined temporal properties is effective at improving retrieval performance. | |||
| Subjectivity Annotation of the Microblog 2011 Realtime Adhoc Relevance Judgments | | BIBA | Full-Text | 344-355 | |
| Georgios Paltoglou; Kevan Buckley | |||
| In this work, we extend the Microblog dataset with subjectivity annotations. Our aim is twofold; first, we want to provide a high-quality, multiply-annotated gold standard of subjectivity annotations for the relevance assessments of the real-time adhoc task. Second, we randomly sample the rest of the dataset and annotate it for subjectivity once, in order to create a complementary annotated dataset that is at least an order of magnitude larger than the gold standard. As a result we have 2,389 tweets that have been annotated by multiple humans and 75,761 tweets that have been annotated by one annotator. We discuss issues like inter-annotator agreement, the time that it took annotators to classify tweets in correlation to their subjective content and lastly, the distribution of subjective tweets in relation to topic categorization. The annotated datasets and all relevant anonymised information are freely available for research purposes. | |||
| Geo-spatial Event Detection in the Twitter Stream | | BIBAK | Full-Text | 356-367 | |
| Maximilian Walther; Michael Kaisser | |||
| The rise of Social Media services in the last years has created huge streams
of information that can be very valuable in a variety of scenarios. What
precisely these scenarios are and how the data streams can efficiently be
analyzed for each scenario is still largely unclear at this point in time and
has therefore created significant interest in industry and academia. In this
paper, we describe a novel algorithm for geo-spatial event detection on Social
Media streams. We monitor all posts on Twitter issued in a given geographic
region and identify places that show a high amount of activity. In a second
processing step, we analyze the resulting spatio-temporal clusters of posts
with a Machine Learning component in order to detect whether they constitute
real-world events or not. We show that this can be done with high precision and
recall. The detected events are finally displayed to a user on a map, at the
location where they happen and while they happen. Keywords: Social Media Analytics; Event Detection; Twitter | |||
| A Versatile Tool for Privacy-Enhanced Web Search | | BIBA | Full-Text | 368-379 | |
| Avi Arampatzis; George Drosatos; Pavlos S. Efraimidis | |||
| We consider the problem of privacy leaks suffered by Internet users when they perform web searches, and propose a framework to mitigate them. Our approach, which builds upon and improves recent work on search privacy, approximates the target search results by replacing the private user query with a set of blurred or scrambled queries. The results of the scrambled queries are then used to cover the original user interest. We model the problem theoretically, define a set of privacy objectives with respect to web search and investigate the effectiveness of the proposed solution with a set of real queries on a large web collection. Experiments show great improvements in retrieval effectiveness over a previously reported baseline in the literature. Furthermore, the methods are more versatile, predictably-behaved, applicable to a wider range of information needs, and the privacy they provide is more comprehensible to the end-user. | |||
| Exploiting Novelty and Diversity in Tag Recommendation | | BIBAK | Full-Text | 380-391 | |
| Fabiano Belém; Eder Martins; Jussara Almeida; Marcos Gonçalves | |||
| The design and evaluation of tag recommendation methods have focused only on
relevance. However, other aspects such as novelty and diversity may be as
important to evaluate the usefulness of the recommendations. In this work, we
define these two aspects in the context of tag recommendation and propose a
novel recommendation strategy that considers them jointly with relevance. This
strategy extends a state-of-the-art method based on Genetic Programming to
include novelty and diversity metrics both as attributes and as part of the
objective function. We evaluate the proposed strategy using data collected from
3 popular Web 2.0 applications: LastFM, YouTube and YahooVideo. Our experiments
show that our strategy outperforms the state-of-the-art alternative in terms of
novelty and diversity, without harming relevance. Keywords: Tag Recommendation; Relevance; Novelty; Diversity | |||
| Example Based Entity Search in the Web of Data | | BIBA | Full-Text | 392-403 | |
| Marc Bron; Krisztian Balog; Maarten de Rijke | |||
| The scale of today's Web of Data motivates the use of keyword search-based approaches to entity-oriented search tasks in addition to traditional structure-based approaches, which require users to have knowledge of the underlying schema. We propose an alternative structure-based approach that makes use of example entities and compare its effectiveness with a text-based approach in the context of an entity list completion task. We find that both the text and structure-based approaches are effective in retrieving relevant entities, but that they find different sets of entities. Additionally, we find that the performance of the structure-based approach is dependent on the quality and number of example entities given. We experiment with a number of hybrid techniques that balance between the two approaches and find that a method that uses the example entities to determine the weights of approaches in the combination on a per query basis is most effective. | |||
| A Fast Generative Spell Corrector Based on Edit Distance | | BIBA | Full-Text | 404-410 | |
| Ishan Chattopadhyaya; Kannappan Sirchabesan; Krishanu Seal | |||
| One of the main challenges in the implementation of web-scale online search
systems is the disambiguation of the user input when portions of the input
queries are possibly misspelt. Spell correctors that must be integrated with
such systems have very stringent restrictions imposed on them; primarily they
must possess the ability to handle large volume of concurrent queries and
generate relevant spelling suggestions at a very high speed. Often, these
systems consist of highend server machines with lots of memory and processing
power and the requirement from such spell correctors is to minimize the latency
of generating suggestions to a bare minimum.
In this paper, we present a spell corrector that we developed to cater to high volume incoming queries for an online search service. It consists of a fast, per-token candidate generator which generates spell suggestions within a distance of two edit operations of an input token. We compare its performance against an n-gram based spell corrector and show that the presented spell candidate generation approach has lower response times. | |||
| Being Confident about the Quality of the Predictions in Recommender Systems | | BIBA | Full-Text | 411-422 | |
| Sergio Cleger-Tamayo; Juan M. Fernández-Luna; Juan F. Huete; Nava Tintarev | |||
| Recommender systems suggest new items to users to try or buy based on their previous preferences or behavior. Many times the information used to recommend these items is limited. An explanation such as "I believe you will like this item, but I do not have enough information to be fully confident about it." may mitigate the issue, but can also damage user trust because it alerts users to the fact that the system might be wrong. The findings in this paper suggest that there is a way of modelling recommendation confidence that is related to accuracy (MAE, RMSE and NDCG) and user rating behaviour (rated vs unrated items). In particular, it was found that unrated items have lower confidence compared to the entire item set -- highlighting the importance of explanations for novel but risky recommendations. | |||
| Two-Stage Learning to Rank for Information Retrieval | | BIBA | Full-Text | 423-434 | |
| Van Dang; Michael Bendersky; W. Bruce Croft | |||
| Current learning to rank approaches commonly focus on learning the best possible ranking function given a small fixed set of documents. This document set is often retrieved from the collection using a simple unsupervised bag-of-words method, e.g. BM25. This can potentially lead to learning a sub-optimal ranking, since many relevant documents may be excluded from the initially retrieved set. In this paper we propose a novel two-stage learning framework to address this problem. We first learn a ranking function over the entire retrieval collection using a limited set of textual features including weighted phrases, proximities and expansion terms. This function is then used to retrieve the best possible subset of documents over which the final model is trained using a larger set of query- and document-dependent features. Empirical evaluation using two web collections unequivocally demonstrates that our proposed two-stage framework, being able to learn its model from more relevant documents, outperforms current learning to rank approaches. | |||
| Hybrid Query Scheduling for a Replicated Search Engine | | BIBAK | Full-Text | 435-446 | |
| Ana Freire; Craig Macdonald; Nicola Tonellotto; Iadh Ounis; Fidel Cacheda | |||
| Search engines use replication and distribution of large indices across many
query servers to achieve efficient retrieval. Under high query load, queries
can be scheduled to replicas that are expected to be idle soonest, facilitated
by the use of predicted query response times. However, the overhead of making
response time predictions can hinder the usefulness of query scheduling under
low query load. In this paper, we propose a hybrid scheduling approach that
combines the scheduling methods appropriate for both low and high load
conditions, and can adapt in response to changing conditions. We deploy a
simulation framework, which is prepared with actual and predicted response
times for real Web search queries for one full day. Our experiments using
different numbers of shards and replicas of the 50 million document ClueWeb09
corpus show that hybrid scheduling can reduce the average waiting times of one
day of queries by 68% under high load conditions and by 7% under low load
conditions w.r.t. traditional scheduling methods. Keywords: Query Efficiency Prediction; Query Scheduling; Distributed Search Engines | |||
| Latent Factor BlockModel for Modelling Relational Data | | BIBA | Full-Text | 447-458 | |
| Sheng Gao; Ludovic Denoyer; Patrick Gallinari; Jun Guo | |||
| In this paper we address the problem of modelling relational data, which has appeared in many applications such as social network analysis, recommender systems and bioinformatics. Previous studies either consider latent feature based models to do link prediction in the relational data but disregarding local structure in the network, or focus exclusively on capturing network structure of objects based on latent blockmodels without coupling with latent characteristics of objects to avoid redundant information. To combine the benefits of the previous work, we model the relational data as a function of both latent feature factors and latent cluster memberships of objects via our proposed Latent Factor BlockModel (LFBM) to collectively discover globally predictive intrinsic properties of objects and capture the latent block structure. We also develop an optimization transfer algorithm to learn the latent factors. Extensive experiments on the synthetic data and several real world datasets suggest that our proposed LFBM model outperforms the state-of-the-art approaches for modelling the relational data. | |||
| Estimation of the Collection Parameter of Information Models for IR | | BIBAK | Full-Text | 459-470 | |
| Parantapa Goswami; Eric Gaussier | |||
| In this paper we explore various methods to estimate the collection
parameter of the information based models for ad hoc information retrieval. In
previous studies, this parameter was set to the average number of documents
where the word under consideration appears. We introduce here a fully
formalized estimation method for both the log-logistic and the smoothed power
law models that leads to improved versions of these models in IR. Furthermore,
we show that the previous setting of the collection parameter of the
log-logistic model is a special case of the estimated value proposed here. Keywords: IR Theory; Information Models; Estimation of Parameters | |||
| Increasing Stability of Result Organization for Session Search | | BIBA | Full-Text | 471-482 | |
| Dongyi Guan; Hui Yang | |||
| Search result clustering (SRC) organizes search results into labeled hierarchical structures as an "information lay-of-land", providing users an overview and helping them quickly locate relevant information from piles of search results. Hierarchies built by this process are usually sensitive to query changes. For search sessions with multiple queries, this could be undesirable since it may leave users a seemly random overview and partly diminish the benefits that SRC intents to offer. We propose to integrate external knowledge from Wikipedia when building concept hierarchies to boost their stability for session queries. Our evaluations on both TREC 2010 and 2011 Session tracks demonstrate that the proposed approaches outperform the state-of-the-art hierarchy construction algorithms in stability of search results organization. | |||
| Updating Users about Time Critical Events | | BIBA | Full-Text | 483-494 | |
| Qi Guo; Fernando Diaz; Elad Yom-Tov | |||
| During unexpected events such as natural disasters, individuals rely on the information generated by news outlets to form their understanding of these events. This information, while often voluminous, is frequently degraded by the inclusion of unimportant, duplicate, or wrong information. It is important to be able to present users with only the novel, important information about these events as they develop. We present the problem of updating users about time critical news events, and focus on the task of deciding which information to select for updating users as an event develops. We propose a solution to this problem which incorporates techniques from information retrieval and multi-document summarization and evaluate this approach on a set of historic events using a large stream of news documents. We also introduce an evaluation method which is significantly less expensive than traditional approaches to temporal summarization. | |||
| Comparing Crowd-Based, Game-Based, and Machine-Based Approaches in Initial Query and Query Refinement Tasks | | BIBA | Full-Text | 495-506 | |
| Christopher G. Harris; Padmini Srinivasan | |||
| Human computation techniques have demonstrated their ability to accomplish portions of tasks that machine-based techniques find difficult. Query refinement is a task that may benefit from human involvement. We conduct an experiment that evaluates the contributions of two user types: student participants and crowdworkers hired from an online labor market. Human participants are assigned to use one of two query interfaces: a traditional web-based interface or a game-based interface. We ask each group to manually construct queries to respond to TREC information needs and calculate their resulting recall and precision. Traditional web interface users are provided feedback on their initial queries and asked to use this information to reformulate their original queries. Game interface users are provided with instant scoring and ask to refine their queries based on their scores. We measure the resulting feedback-based improvement on each group and compare the results from human computation techniques to machine-based algorithms. | |||
| Reducing the Uncertainty in Resource Selection | | BIBA | Full-Text | 507-519 | |
| Ilya Markov; Leif Azzopardi; Fabio Crestani | |||
| The distributed retrieval process is plagued by uncertainty. Sampling, selection, merging and ranking are all based on very limited information compared to centralized retrieval. In this paper, we focus our attention on reducing the uncertainty within the resource selection phase by obtaining a number of estimates, rather than relying upon only one point estimate. We propose three methods for reducing uncertainty which are compared against state-of-the-art baselines across three distributed retrieval testbeds. Our results show that the proposed methods significantly improve baselines, reduce the uncertainty and improve robustness of resource selection. | |||
| Exploiting Time in Automatic Image Tagging | | BIBAK | Full-Text | 520-531 | |
| Philip J. McParlane; Joemon M. Jose | |||
| Existing automatic image annotation (AIA) models that depend solely on
low-level image features often produce poor results, particularly when
annotating real-life collections. Tag co-occurrence has been shown to improve
image annotation by identifying additional keywords associated with
user-provided keywords. However, existing approaches have treated tag
co-occurrence as a static measure over time, thereby ignoring the temporal
trends of many tags. The temporal distribution of tags, however, caused by
events, seasons, memes, etc. provide a strong source of evidence beyond
keywords for AIA. In this paper we propose a temporal tag co-occurrence
approach to improve upon the current state-of-the-art automatic image
annotation model. By replacing the annotated tags with more temporally
significant tags, we achieve statistically significant increases to annotation
accuracy on a real-life timestamped image collection from Flickr. Keywords: automatic image annotation; temporal | |||
| Using Text-Based Web Image Search Results Clustering to Minimize Mobile Devices Wasted Space-Interface | | BIBA | Full-Text | 532-544 | |
| Jose G. Moreno; Gaël Dias | |||
| The recent shift in human-computer interaction from desktop to mobile computing fosters the needs of new interfaces for web image search results exploration. In order to leverage users' efforts, we present a set of state-of-the-art ephemeral clustering algorithms, which allow to summarize web image search results into meaningful clusters. This way of presenting visual information on mobile devices is exhaustively evaluated based on two main criteria: clustering accuracy, which must be maximized, and wasted space-interface, which must be minimized. For the first case, we use a broad set of metrics to evaluate ephemeral clustering over a public golden standard data set of web images. For the second case, we propose a new metric to evaluate the mismatch of the used space-interface between the ground truth and the cluster distribution obtained by ephemeral clustering. The results evidence that there exist high divergences between clustering accuracy and used space maximization. As a consequence, the trade-off of cluster-based exploration of web image search results on mobile devices is difficult to define, although our study evidences some clear positive results. | |||
| Discovery and Analysis of Evolving Topical Social Discussions on Unstructured Microblogs | | BIBA | Full-Text | 545-556 | |
| Kanika Narang; Seema Nagar; Sameep Mehta; L. V. Subramaniam; Kuntal Dey | |||
| Social networks have emerged as hubs of user generated content. Online social conversations can be used to retrieve users interests towards given topics and trends. Microblogging platforms like Twitter are primary examples of social networks with significant volumes of topical message exchanges between users. However, unlike traditional online discussion forums, blogs and social networking sites, explicit discussion threads are absent from microblogging networks like Twitter. This inherent absence of any conversation framework makes it challenging to distinguish conversations from mere topical interests. In this work, we explore semantic, social and temporal relationships of topical clusters formed in Twitter to identify conversations. We devise an algorithm comprising of a sequence of steps such as text clustering, topical similarity detection using TF-IDF and Wordnet, and intersecting social, semantic and temporal graphs to discover social conversations around topics. We further qualitatively show the presence of social localization of discussion threads. Our results suggest that discussion threads evolve significantly over social networks on Twitter. Our algorithm to find social discussion threads can be used for settings such as social information spreading applications and information diffusion analyses on microblog networks. | |||
| Web Credibility: Features Exploration and Credibility Prediction | | BIBAK | Full-Text | 557-568 | |
| Alexandra Olteanu; Stanislav Peshterliev; Xin Liu; Karl Aberer | |||
| The open nature of the World Wide Web makes evaluating webpage credibility
challenging for users. In this paper, we aim to automatically assess web
credibility by investigating various characteristics of webpages. Specifically,
we first identify features from textual content, link structure, webpages
design, as well as their social popularity learned from popular social media
sites (e.g., Facebook, Twitter). A set of statistical analyses methods are
applied to select the most informative features, which are then used to infer
webpages credibility by employing supervised learning algorithms. Real
dataset-based experiments under two application settings show that we attain an
accuracy of 75% for classification, and an improvement of 53% for the mean
absolute error (MAE), with respect to the random baseline approach, for
regression. Keywords: web credibility; feature analysis; classification; regression | |||
| Query Suggestions for Textual Problem Solution Repositories | | BIBA | Full-Text | 569-581 | |
| A Deepak P.; Sutanu Chakraborti; Deepak Khemani | |||
| Textual problem-solution repositories are available today in various forms, most commonly as problem-solution pairs from community question answering systems. Modern search engines that operate on the web can suggest possible completions in real-time for users as they type in queries. We study the problem of generating intelligent query suggestions for users of customized search systems that enable querying over problem-solution repositories. Due to the small scale and specialized nature of such systems, we often do not have the luxury of depending on query logs for finding query suggestions. We propose a retrieval model for generating query suggestions for search on a set of problem solution pairs. We harness the problem solution partition inherent in such repositories to improve upon traditional query suggestion mechanisms designed for systems that search over general textual corpora. We evaluate our technique over real problem-solution datasets and illustrate that our technique provides large and statistically significant improvements over the state-of-the-art technique in query suggestion. | |||
| Improving ESA with Document Similarity | | BIBA | Full-Text | 582-593 | |
| Tamara Polajnar; Nitish Aggarwal; Kartik Asooja; Paul Buitelaar | |||
| Explicit semantic analysis (ESA) is a technique for computing semantic relatedness between natural language texts. It is a document-based distributional model similar to latent semantic analysis (LSA), which is often built on the Wikipedia database when it is required for general English usage. Unlike LSA, however, ESA does not use dimensionality reduction, and therefore it is sometimes unable to account for similarity between words that do not co-occur with same concepts, even if their concepts themselves cover similar subjects. In the Wikipedia implementation ESA concepts are Wikipedia articles, and the Wikilinks between the articles are used to overcome the concept-similarity problem. In this paper, we provide two general solutions for integration of concept-concept similarities into the ESA model, ones that do not rely on a particular corpus structure and do not alter the explicit concept-mapping properties that distinguish ESA from models like LSA and latent Dirichlet allocation (LDA). | |||
| Ontology-Based Word Sense Disambiguation for Scientific Literature | | BIBA | Full-Text | 594-605 | |
| Roman Prokofyev; Gianluca Demartini; Alexey Boyarsky; Oleg Ruchayskiy; Philippe Cudré-Mauroux | |||
| Scientific documents often adopt a well-defined vocabulary and avoid the use
of ambiguous terms. However, as soon as documents from different research
sub-communities are considered in combination, many scientific terms become
ambiguous as the same term can refer to different concepts from different
sub-communities. The ability to correctly identify the right sense of a given
term can considerably improve the effectiveness of retrieval models, and can
also support additional features such as search diversification. This is even
more critical when applied to explorative search systems within the scientific
domain.
In this paper, we propose novel semi-supervised methods to term disambiguation leveraging the structure of a community-based ontology of scientific concepts. Our approach exploits the graph structure that connects different terms and their definitions to automatically identify the correct sense that was originally picked by the authors of a scientific publication. Experimental evidence over two different test collections from the physics and biomedical domains shows that the proposed method is effective and outperforms state-of-the-art approaches based on feature vectors constructed out of term co-occurrences as well as standard supervised approaches. | |||
| A Language Modeling Approach for Extracting Translation Knowledge from Comparable Corpora | | BIBAK | Full-Text | 606-617 | |
| Razieh Rahimi; Azadeh Shakery | |||
| A main challenge in Cross-Language information retrieval is to estimate a
translation language model, as its quality directly affects the retrieval
performance. The translation language model is built using translation
resources such as bilingual dictionaries, parallel corpora, or comparable
corpora. In general, high quality resources may not be available for
scarce-resource languages. For these languages, efficient exploitation of
commonly available resources such as comparable corpora is considered more
crucial. In this paper, we focus on using only comparable corpora to extract
translation information more efficiently. We propose a language modeling
approach for estimating the translation language model. The proposed method is
based on probability distribution estimation, and can be tuned easier in
comparison with heuristically adjusted previous work. Experiment results show a
significant improvement in the translation quality and CLIR performance
compared to the previous approaches. Keywords: Cross-language Information Retrieval; Translation Language Models;
Comparable Corpora | |||
| Content-Based Re-ranking of Text-Based Image Search Results | | BIBAK | Full-Text | 618-629 | |
| Franck Thollard; Georges Quénot | |||
| This article presents a method for re-ranking images retrieved by classical
search engine using key words for entering queries. This method uses the visual
content of the images and it is based on the idea that the relevant images
should be similar to each other while the non-relevant images should be
different from each other and from relevant images. This idea has been
implemented by ranking the images according to their average distances to their
nearest neighbors. This query-dependent re-ranking is completed by a
query-independent re-ranking taking into account the fact that some types of
images are non-relevant for almost all queries. This idea is implemented by
training a classifier on results from all queries in the training set. The
re-ranking is successfully evaluated on classical datasets built with
Exalead™ and Google Images™ search engines. Keywords: Image retrieval; re-ranking | |||
| Encoding Local Binary Descriptors by Bag-of-Features with Hamming Distance for Visual Object Categorization | | BIBA | Full-Text | 630-641 | |
| Yu Zhang; Chao Zhu; Stephane Bres; Liming Chen | |||
| This paper presents a novel method for encoding local binary descriptors for Visual Object Categorization (VOC). Nowadays, local binary descriptors, e.g. LBP and BRIEF, have become very popular in image matching tasks because of their fast computation and matching using binary bitstrings. However, the bottleneck of applying them in the domain of VOC lies in the high dimensional histograms produced by encoding these binary bitstrings into decimal codes. To solve this problem, we propose to encode local binary bitstrings directly by the Bag-of-Features (BoF) model with Hamming distance. The advantages of this approach are two-fold: (1) It solves the high dimensionality issue of the traditional binary bitstring encoding methods, making local binary descriptors more feasible for the task of VOC, especially when more bits are considered; (2) It is computationally efficient because the Hamming distance, which is very suitable for comparing bitstrings, is based on bitwise XOR operations that can be fast computed on modern CPUs. The proposed method is validated by applying on LBP feature for the purpose of VOC. The experimental results on the PASCAL VOC 2007 benchmark show that our approach effectively improves the recognition accuracy compared to the traditional LBP feature. | |||
| Recommending High Utility Query via Session-Flow Graph | | BIBAK | Full-Text | 642-655 | |
| Xiaofei Zhu; Jiafeng Guo; Xueqi Cheng; Yanyan Lan; Wolfgang Nejdl | |||
| Query recommendation is an integral part of modern search engines that helps
users find their information needs. Traditional query recommendation methods
usually focus on recommending users relevant queries, which attempt to find
alternative queries with close search intent to the original query. Whereas the
ultimate goal of query recommendation is to assist users to accomplish their
search task successfully, while not just find relevant queries in spite of they
can sometimes return useful search results. To better achieve the ultimate goal
of query recommendation, a more reasonable way is to recommend users high
utility queries, i.e., queries that can return more useful information. In this
paper, we propose a novel utility query recommendation approach based on
absorbing random walk on the session-flow graph, which can learn queries'
utility by simultaneously modeling both users' reformulation behaviors and
click behaviors. Extensively experiments were conducted on real query logs, and
the results show that our method significantly outperforms the state-of-the-art
methods under the evaluation metric QRR and MRD. Keywords: Query Recommendation; Absorbing Random Walk; Session-Flow Graph | |||
| URL Redirection Accounting for Improving Link-Based Ranking Methods | | BIBAK | Full-Text | 656-667 | |
| Maksim Zhukovskii; Gleb Gusev; Pavel Serdyukov | |||
| Traditional link-based web ranking algorithms are applied to web snapshots
in the form of webgraphs consisting of pages as vertices and links as edges.
Constructing webgraph, researchers do not pay attention to a particular method
of how links are taken into account, while certain details may significantly
affects the contribution of link-based factors to ranking. Furthermore,
researchers use small subgraphs of the webgraph for more efficient evaluation
of new algorithms. They usually consider a graph induced by pages, for example,
of a certain first level domain. In this paper we reveal a significant
dependence of PageRank on the method of accounting redirects while constructing
the webgraph. We evaluate several natural ways of redirect accounting on a
large-scale domain and find an optimal case, which turns out non-trivial.
Moreover, we experimentally compare different ways of extracting a small
subgraph for multiple evaluations and reveal some essential shortcomings of
traditional approaches. Keywords: Redirects; PageRank; sample of the web; webgraph | |||
| Lo mejor de dos idiomas -- Cross-Lingual Linkage of Geotagged Wikipedia Articles | | BIBAK | Full-Text | 668-671 | |
| Dirk Ahlers | |||
| Different language versions of Wikipedia contain articles referencing the
same place. However, an article in one language does not necessarily mean it is
available in another language as well and linked to. This paper examines
geotagged articles describing places in Honduras in both the Spanish and the
English language versions. It demonstrates that a method based on simple
features can reliably identify article pairs describing the same semantic place
concept and evaluates it against the existing interlinks as well as a manual
assessment. Keywords: Geospatial Web Search; Data fusion; Cross-lingual Information Retrieval;
Record Linkage; Entity Resolution; Wikipedia; Honduras | |||
| A Pilot Study on Using Profile-Based Summarisation for Interactive Search Assistance | | BIBA | Full-Text | 672-675 | |
| Azhar Alhindi; Udo Kruschwitz; Chris Fox | |||
| Text summarisation is the process of distilling the most important information from a source to produce an abridged version for a particular user or task. This poster investigates the use of profile-based summarisation to provide contextualisation and interactive support for enterprise searches. We employ log analysis to acquire continuously updated profiles to provide profile-based summarisations of search results. These profiles could be capturing an individual's interests or (as discussed here) those of a group of users. Here we report on a first pilot study. | |||
| Exploring Patent Passage Retrieval Using Nouns Phrases | | BIBAK | Full-Text | 676-679 | |
| Linda Andersson; Parvaz Mahdabi; Allan Hanbury; Andreas Rauber | |||
| This paper presents experiments which initially were carried out for the
Patent Passage Retrieval track of CLEF-IP 2012. The Passage Retrieval module
was implemented independently of the Document Retrieval system. In the Passage
Retrieval module we make use of Natural Language Processing applications
(WordNet and Stanford Part-of-Speech tagger) for lemmatization and phrase
(multi word units) retrieval. We show by applying simple rule-based
modifications and only targeting specific language instances (noun phrases) the
usage of general NLP tools for phrase retrieval will increase performance of a
Patent Passage Information Extraction system. Keywords: Passage Retrieval; Patent Search; Natural language Processing | |||
| Characterizing Health-Related Community Question Answering | | BIBAK | Full-Text | 680-683 | |
| Alexander Beloborodov; Artem Kuznetsov; Pavel Braslavski | |||
| Our ongoing project is aimed at improving information access to
narrow-domain collections of questions and answers. This poster demonstrates
how out-of-the-box tools and domain dictionaries can be applied to community
question answering (CQA) content in health domain. This approach can be used to
improve user interfaces and search over CQA data, as well as to evaluate
content quality. The study is a first-time use of a sizable dataset from the
Russian CQA site Otvety@Mail.Ru. Keywords: community question answering; CQA; consumer health information; content
analysis; latent Dirichlet allocation; LDA; Otvety@Mail.Ru | |||
| Topic Models Can Improve Domain Term Extraction | | BIBAK | Full-Text | 684-687 | |
| Elena Bolshakova; Natalia Loukachevitch; Michael Nokel | |||
| The paper describes the results of an experimental study of topic models
applied to the task of single-word term extraction. The experiments encompass
several probabilistic and non-probabilistic topic models and demonstrate that
topic information improves the quality of term extraction, as well as NMF with
KL-divergence minimization is the best among the models under study. Keywords: Topic Models; Clustering; Single-Word Term Extraction | |||
| A Topic Person Multi-polarization Method Using Friendship Network Analysis | | BIBAK | Full-Text | 688-692 | |
| Zhong-Yong Chen; Chien Chin Chen | |||
| In this paper, we leverage competing viewpoints of the persons mentioned in
a set of topic documents. We propose a method to construct a friendship network
of the persons and present a graph-partition based multi-polarization algorithm
to group the persons into clusters with competing viewpoints. Keywords: Person Multi-polarization; Graph Partition; Social Network Analysis | |||
| Improving Cyberbullying Detection with User Context | | BIBA | Full-Text | 693-696 | |
| Maral Dadvar; Dolf Trieschnigg; Roeland Ordelman; Franciska de Jong | |||
| The negative consequences of cyberbullying are becoming more alarming every day and technical solutions that allow for taking appropriate action by means of automated detection are still very limited. Up until now, studies on cyberbullying detection have focused on individual comments only, disregarding context such as users' characteristics and profile information. In this paper we show that taking user context into account improves the detection of cyberbullying. | |||
| Snippet-Based Relevance Predictions for Federated Web Search | | BIBAK | Full-Text | 697-700 | |
| Thomas Demeester; Dong Nguyen; Dolf Trieschnigg; Chris Develder; Djoerd Hiemstra | |||
| How well can the relevance of a page be predicted, purely based on snippets?
This would be highly useful in a Federated Web Search setting where caching
large amounts of result snippets is more feasible than caching entire pages.
The experiments reported in this paper make use of result snippets and pages
from a diverse set of actual Web search engines. A linear classifier is trained
to predict the snippet-based user estimate of page relevance, but also, to
predict the actual page relevance, again based on snippets alone. The presented
results confirm the validity of the proposed approach and provide promising
insights into future result merging strategies for a Federated Web Search
setting. Keywords: Federated Web search; snippets; classification; relevance judgments | |||
| Designing Human-Readable User Profiles for Search Evaluation | | BIBA | Full-Text | 701-705 | |
| Carsten Eickhoff; Kevyn Collins-Thompson; Paul Bennett; Susan Dumais | |||
| Forming an accurate mental model of a user is crucial for the qualitative design and evaluation steps of many information-centric applications such as web search, content recommendation, or advertising. This process can often be time-consuming as search and interaction histories become verbose. In this work, we present and analyze the usefulness of concise human-readable user profiles in order to enhance system tuning and evaluation by means of user studies. | |||
| Sentiment Classification Based on Phonetic Characteristics | | BIBAK | Full-Text | 706-709 | |
| Sergei Ermakov; Liana Ermakova | |||
| The majority of sentiment classifiers is based on dictionaries or requires
large amount of training data. Unfortunately, dictionaries contain only limited
data and machine-learning classifiers using word-based features do not consider
part of words, which makes them domain-specific, less effective and not robust
to orthographic mistakes. We attempt to overcome these drawbacks by developing
a context-independent approach. Our main idea is to determine some phonetic
features of words that could affect their sentiment polarity. These features
are applicable to all words; it eliminates the need to continuous manual
dictionary renewal. Our experiments are based on a sentiment dictionary for the
Russian language. We apply phonetic features to predict word sentiment based on
machine learning. Keywords: sentiment analysis; machine learning; phonosemantics; n-grams | |||
| Cross-Language Plagiarism Detection Using a Multilingual Semantic Network | | BIBA | Full-Text | 710-713 | |
| Marc Franco-Salvador; Parth Gupta; Paolo Rosso | |||
| Cross-language plagiarism refers to the type of plagiarism where the source and suspicious documents are in different languages. Plagiarism detection across languages is still in its infancy state. In this article, we propose a new graph-based approach that uses a multilingual semantic network to compare document paragraphs in different languages. In order to investigate the proposed approach, we used the German-English and Spanish-English cross-language plagiarism cases of the PAN-PC'11 corpus. We compare the obtained results with two state-of-the-art models. Experimental results indicate that our graph-based approach is a good alternative for cross-language plagiarism detection. | |||
| Classification of Opinion Questions | | BIBA | Full-Text | 714-717 | |
| Hongping Fu; Zhendong Niu; Chunxia Zhang; Lu Wang; Peng Jiang; Ji Zhang | |||
| With the increasing growth of opinions on news, services and so on, automatic opinion question answering aims at answering questions involving views of persons, and plays an important role in fields of sentiment analysis and information recommendation. One challenge is that opinion questions may contain different types of question focuses that affect answer extraction, such as holders, comparison and location. In this paper, we build a taxonomy of opinion questions, and propose a hierarchical classification technique to classify opinion questions according to our constructed taxonomy. This technique first uses Bayesian classifier and then employs an approach leveraging semantic similarities between questions. Experimental results show that our approach significantly improves performances over baseline and other related works. | |||
| Tempo of Search Actions to Modeling Successful Sessions | | BIBA | Full-Text | 718-721 | |
| Kazuya Fujikawa; Hideo Joho; Shin-ichi Nakayama | |||
| Considering search process in the evaluation of interactive information retrieval (IIR) is a challenging issue. This paper explores tempo of search actions (query, click, and judgement) to measure people's search process and performance. When we analyse how people consume their search resource (i.e., a total number of search actions taken to complete a task) over the time, it was observed that there was a different pattern in successful sessions and unsuccessful sessions. Successful sessions tend to have a regular tempo in search actions while poor sessions tend to have uneven distribution of resource usage. The resource consumption graph also allows us to observe where in the search process was affected by experimental conditions. Therefore, this paper suggests that tempo of search actions can be exploited to model successful search sessions. | |||
| Near-Duplicate Detection for Online-Shops Owners: An FCA-Based Approach | | BIBAK | Full-Text | 722-725 | |
| Dmitry I. Ignatov; Andrey V. Konstantiov; Yana Chubis | |||
| We proposed a prototype of near-duplicate detection system for web-shop
owners. It's a typical situation for this online businesses to buy description
of their goods from so-called copyrighters. Copyrighter can cheat from time to
time and provide the owner with some almost identical descriptions for
different items. In this paper we demonstrated how we can use FCA for fast
clustering and revealing such duplicates in real online perfume shop's
datasets. Keywords: Near duplicate detection; Formal Concept Analysis; E-commerce | |||
| Incremental Reranking for Hierarchical Text Classification | | BIBA | Full-Text | 726-729 | |
| Qi Ju; Alessandro Moschitti | |||
| The top-down method is efficient and commonly used in hierarchical text classification. Its main drawback is the error propagation from the higher to the lower nodes. To address this issue we propose an efficient incremental reranking model of the top-down classifier decisions. We build a multiclassifier for each hierarchy node, constituted by the latter and its children. Then we generate several classification hypotheses with such classifiers and rerank them to select the best one. Our rerankers exploit category dependencies, which allow them to recover from the multiclassifier errors whereas their application in top-down fashion results in high efficiency. The experimentation on Reuters Corpus Volume 1 (RCV1) shows that our incremental reranking is as accurate as global rerankers but at least one magnitude order faster. | |||
| Topic Model for User Reviews with Adaptive Windows | | BIBA | Full-Text | 730-733 | |
| Takuya Konishi; Fuminori Kimura; Akira Maeda | |||
| We discuss the problem in applying topic models to user reviews. Different from ordinary documents, reviews in a same category are similar to each other. This makes it difficult to estimate meaningful topics from these reviews. In this paper, we develop a new model for this problem using the distance dependent Chinese restaurant process. It need not decide the size of windows and can consider neighboring sentences adaptively. We compare this model to the Multi-grain latent Dirichlet allocation which has been proposed previously, and show that our model achieves better results in terms of perplexity. | |||
| Time Based Feedback and Query Expansion for Twitter Search | | BIBA | Full-Text | 734-737 | |
| Naveen Kumar; Benjamin Carterette | |||
| Twitter is an accepted platform among users for expressing views in a short text called a "Tweet" Application of search models to platforms like Twitter is still an open-ended question, though the creation of the TREC Microblog track in 2011 aims to help resolve it. In this paper, we propose a modified language search model by extending a traditional query-likelihood language model with time based feedback and query expansion. The proposed method makes use of two types of feedback, time feedback by evaluating the time distribution of top retrieved tweets, and query expansion by using highly frequent terms in top tweets as expanded terms. Our results suggest that using both types of feedback, we get better results than using a standard language model, and the time-based feedback uniformly improves results whether query expansion is used or not. | |||
| Is Intent-Aware Expected Reciprocal Rank Sufficient to Evaluate Diversity? | | BIBA | Full-Text | 738-742 | |
| Teerapong Leelanupab; Guido Zuccon; Joemon M. Jose | |||
| In this paper we define two models of users that require diversity in search results; these models are theoretically grounded in the notion of intrinsic and extrinsic diversity. We then examine Intent-Aware Expected Reciprocal Rank (ERR-IA), one of the official measures used to assess diversity in TREC 2011-12, with respect to the proposed user models. By analyzing ranking preferences as expressed by the user models and those estimated by ERR-IA, we investigate whether ERR-IA assesses document rankings according to the requirements of the diversity retrieval task expressed by the two models. Empirical results demonstrate that ERR-IA neglects query-intents coverage by attributing excessive importance to redundant relevant documents. ERR-IA behavior is contrary to the user models that require measures to first assess diversity through the coverage of intents, and then assess the redundancy of relevant intents. Furthermore, diversity should be considered separately from document relevance and the documents positions in the ranking. | |||
| Late Data Fusion for Microblog Search | | BIBA | Full-Text | 743-746 | |
| Shangsong Liang; Maarten de Rijke; Manos Tsagkias | |||
| The character of microblog environments raises challenges for microblog search because relevancy becomes one of the many aspects for ranking documents. We concentrate on merging multiple ranking strategies at post-retrieval time for the TREC Microblog task. We compare several state-of-the-art late data fusion methods, and present a new semi-supervised variant that accounts for microblog characteristics. Our experiments show the utility of late data fusion in microblog search, and that our method helps boost retrieval effectiveness. | |||
| A Task-Specific Query and Document Representation for Medical Records Search | | BIBA | Full-Text | 747-751 | |
| Nut Limsopatham; Craig Macdonald; Iadh Ounis | |||
| One of the challenges of searching in the medical domain is to deal with the complexity and ambiguity of medical terminology. Concept-based representation approaches using terminology from domain-specific resources have been developed to handle such a challenge. However, it has been shown that these techniques are effective only when combined with a traditional term-based representation approach. In this paper, we propose a novel technique to represent medical records and queries by focusing only on medical concepts essential for the information need of a medical search task. Such a representation could enhance retrieval effectiveness since only the medical concepts crucial to the information need are taken into account. We evaluate the retrieval effectiveness of our proposed approach in the context of the TREC 2011 Medical Records track. The results demonstrate the effectiveness of our approach, as it significantly outperforms a baseline where all concepts are represented, and markedly outperforms a traditional term-based representation baseline. Moreover, when combining the relevance scores obtained from our technique and a term-based representation approach, the achieved performance is comparable to the best TREC 2011 systems. | |||
| On CORI Results Merging | | BIBA | Full-Text | 752-755 | |
| Ilya Markov; Avi Arampatzis; Fabio Crestani | |||
| Score normalization and results merging are important components of many IR applications. Recently MinMax -- an unsupervised linear score normalization method -- was shown to perform quite well across various distributed retrieval testbeds, although based on strong assumptions. The CORI results merging method relaxes these assumptions to some extent and significantly improves the performance of MinMax. We parameterize CORI and evaluate its performance across a range of parameter settings. Experimental results on three distributed retrieval testbeds show that CORI significantly outperforms state-of-the-art results merging and score normalization methods when its parameter goes to infinity. | |||
| Detecting Friday Night Party Photos: Semantics for Tag Recommendation | | BIBA | Full-Text | 756-759 | |
| Philip J. McParlane; Yelena Mejova; Ingmar Weber | |||
| Multimedia annotation is central to its organization and retrieval -- a task which tag recommendation systems attempt to simplify. We propose a photo tag recommendation system which automatically extracts semantics from visual and meta-data features to complement existing tags. Compared to standard content/tag-based models, these automatic tags provide a richer description of the image and especially improve performance in the case of the "cold start problem". | |||
| Optimizing nDCG Gains by Minimizing Effect of Label Inconsistency | | BIBA | Full-Text | 760-763 | |
| Pavel Metrikov; Virgil Pavlu; Javed A. Aslam | |||
| We focus on nDCG choice of gains, and in particular on the fracture between large differences in exponential gains of high relevance labels and the not-so-small confusion, or inconsistency, between these labels in data. We show that better gains can be derived from data by measuring the label inconsistency, to the point that virtually indistinguishable labels correspond to equal gains. Our derived optimal gains make a better nDCG objective for training Learning to Rank algorithms. | |||
| Least Square Consensus Clustering: Criteria, Methods, Experiments | | BIBAK | Full-Text | 764-767 | |
| Boris G. Mirkin; Andrey Shestakov | |||
| We develop a consensus clustering framework developed three decades ago in
Russia and experimentally demonstrate that our least squares consensus
clustering algorithm consistently outperforms several recent consensus
clustering methods. Keywords: consensus clustering; ensemble clustering; least squares | |||
| Domain Adaptation of Statistical Machine Translation Models with Monolingual Data for Cross Lingual Information Retrieval | | BIBA | Full-Text | 768-771 | |
| Vassilina Nikoulina; Stéphane Clinchant | |||
| Statistical Machine Translation (SMT) is often used as a black-box in CLIR tasks. We propose an adaptation method for an SMT model relying on the monolingual statistics that can be extracted from the document collection (both source and target if available). We evaluate our approach on CLEF Domain Specific task (German-English and English-German) and show that very simple document collection statistics integrated in SMT translation model allow to obtain good gains both in terms of IR metrics (MAP, P10) and MT evaluation metrics (BLEU, TER). | |||
| Text Summarization while Maximizing Multiple Objectives with Lagrangian Relaxation | | BIBA | Full-Text | 772-775 | |
| Masaaki Nishino; Norihito Yasuda; Tsutomu Hirao; Jun Suzuki; Masaaki Nagata | |||
| We show an extractive text summarization method that solves an optimization problem involving the maximization of multiple objectives. Though we can obtain high quality summaries if we solve the problem exactly with our formulation, it is NP-hard and cannot scale to support large problem size. Our solution is an efficient and high quality approximation method based on Lagrangian relaxation (LR) techniques. In experiments on the DUC'04 dataset, our LR based method matches the performance of state-of-the-art methods. | |||
| Towards Detection of Child Sexual Abuse Media: Categorization of the Associated Filenames | | BIBAK | Full-Text | 776-779 | |
| Alexander Panchenko; Richard Beaufort; Hubert Naets; Cédrick Fairon | |||
| This paper approaches the problem of automatic pedophile content
identification. We present a system for filename categorization, which is
trained to identify suspicious files on P2P networks. In our initial
experiments, we used regular pornography data as a substitution of child
pornography. Our system separates filenames of pornographic media from the
others with an accuracy that reaches 91-97%. Keywords: short text categorization; P2P networks; child pornography | |||
| Leveraging Latent Concepts for Retrieving Relevant Ads for Short Text | | BIBAK | Full-Text | 780-783 | |
| Ankit Patil; Kushal Dave; Vasudeva Varma | |||
| The microblogging platforms are increasingly becoming a lucrative prospect
for advertisers to attract the customers. The challenge with advertising on
such platforms is that there is very little content to retrieve relevant ads.
As the microblogging content is short and noisy and the ads are short too,
there is a high amount of lexical/vocabulary mismatch between the micropost and
the ads. To bridge this vocabulary mismatch, we propose a conceptual approach
that transforms the content into a conceptual space that represent the latent
concepts of the content. We empirically show that the conceptual model performs
better than various state-of-the-art techniques the performance gain obtained
are substantial and significant. Keywords: Content Targeted Advertising; Semantic Match | |||
| Robust PLSA Performs Better Than LDA | | BIBAK | Full-Text | 784-787 | |
| Anna Potapenko; Konstantin Vorontsov | |||
| In this paper we introduce a generalized learning algorithm for
probabilistic topic models (PTM). Many known and new algorithms for PLSA, LDA,
and SWB models can be obtained as its special cases by choosing a subset of the
following "options": regularization, sampling, update frequency, sparsing and
robustness. We show that a robust topic model, which distinguishes specific,
background and topic terms, doesn't need Dirichlet regularization and provides
controllably sparse solution. Keywords: topic modeling; Gibbs sampling; perplexity; robustness | |||
| WANTED: Focused Queries for Focused Retrieval | | BIBAK | Full-Text | 788-791 | |
| Georgina Ramírez | |||
| Focused retrieval tasks such as XML or passage retrieval strive to provide
direct access to the relevant content of a document. In these scenarios users
can pose focused queries, i.e., queries that restrict the type of output the
user wants to see. We first analyze several characteristics of this type of
requests and show that they differ substantially from the unfocused ones. We
also show that typical XML retrieval systems tend to perform poorly on focused
queries and that systems ranking differs considerably when processing each of
the types. Finally, we argue that the unbalanced number of focused queries in
the INEX benchmark topic set might lead to misleading interpretations of the
evaluation results. To get a better insight of the systems ability to perform
focused search, more focused queries are needed. Keywords: INEX; Focused search; XML retrieval; Evaluation | |||
| Exploiting Click Logs for Adaptive Intranet Navigation | | BIBA | Full-Text | 792-795 | |
| Sharhida Zawani Saad; Udo Kruschwitz | |||
| Web sites and intranets can be difficult to navigate as they tend to be rather static and a new user might have no idea what documents are most relevant to his or her need. Our aim is to capture the navigational behaviour of existing users (as recorded in the click logs) so that we can assist future users by proposing the most relevant pages as they navigate the site without changing the actual Web site and do this adaptively so that a continuous learning cycle is being employed. In this paper we explore three different algorithms that can be employed to learn such suggestions from navigation logs. We find that users managed to conduct the tasks significantly quicker than the (purely frequency-based) baseline by employing ant colony optimisation or random walk approaches to the log data for building a suggestion model. | |||
| Leveraging Microblogs for Spatiotemporal Music Information Retrieval | | BIBA | Full-Text | 796-799 | |
| Markus Schedl | |||
| We present results of text data mining experiments for music retrieval, analyzing microblogs gathered from November 2011 to September 2012 to infer music listening patterns all around the world. We assess relationships between particular music preferences and spatial properties, such as month, weekday, and country, and the temporal stability of listening activities. The findings of our study will help improve music retrieval and recommendation systems in that it will allow to incorporate geospatial and cultural information into models for music retrieval, which has not been looked into before. | |||
| Topic-Focused Summarization of Chat Conversations | | BIBA | Full-Text | 800-803 | |
| Arpit Sood; Thanvir P. Mohamed; Vasudeva Varma | |||
| In this paper, we propose a novel approach to address the problem of chat summarization. We summarize real-time chat conversations which contain multiple users with frequent shifts in topic. Our approach consists of two phases. In the first phase, we leverage topic modeling using web documents to find the primary topic of discussion in the chat. Then, in the summary generation phase, we build a semantic word space to score sentences based on their association with the primary topic. Experimental results show that our method significantly outperforms the baseline systems on ROUGE F-scores. | |||
| Risk Ranking from Financial Reports | | BIBAK | Full-Text | 804-807 | |
| Ming-Feng Tsai; Chuan-Ju Wang | |||
| This paper attempts to use soft information in finance to rank the risk
levels of a set of companies. Specifically, we deal with a ranking problem with
a collection of financial reports, in which each report is associated with a
company. By using text information in the reports, which is so-called the soft
information, we apply learning-to-rank techniques to rank a set of companies to
keep them in line with their relative risk levels. In our experiments, a
collection of financial reports, which are annually published by
publicly-traded companies, is employed to evaluate our ranking approach;
moreover, a regression-based approach is also carried out for comparison. The
experimental results show that our ranking approach not only significantly
outperforms the regression-based one, but identifies some interesting relations
between financial terms. Keywords: Ranking; Soft Information; Volatility; Financial Report | |||
| An Initial Investigation on the Relationship between Usage and Findability | | BIBA | Full-Text | 808-811 | |
| Colin Wilkie; Leif Azzopardi | |||
| Ensuring that information within a website is findable is particularly important. This is because visitors that cannot find what they are looking for are likely to leave the site or become very frustrated and switch to a competing site. While findability has been touted as important in web design, we wonder to what degree measures of findability are correlated to usage. To this end, we have conducted a preliminary study on three sub-domains across a number of measures of findability. | |||
| Sub-sentence Extraction Based on Combinatorial Optimization | | BIBA | Full-Text | 812-815 | |
| Norihito Yasuda; Masaaki Nishino; Tsutomu Hirao; Masaaki Nagata | |||
| This paper describes the prospect of word extraction for text summarization based on combinatorial optimization. Instead of the commonly used sentence-based approach, word-based approaches are preferable if highly-compressed summarizations are required. However, naively applying conventional methods for word extraction yields excessively fragmented summaries. We avoid this by restricting the number of selected fragments from each sentence to at most one when formulating the maximum coverage problem. Consequently, the method only choose sub-sentences as fragments. Experiments show that our method matches the ROUGE scores of state-of-the-art systems without requiring any training or special parameters. | |||
| ADRTrace: Detecting Expected and Unexpected Adverse Drug Reactions from User Reviews on Social Media Sites | | BIBA | Full-Text | 816-819 | |
| Andrew Yates; Nazli Goharian | |||
| We automatically extract adverse drug reactions (ADRs) from consumer reviews provided on various drug social media sites to identify adverse reactions not reported by the United States Food and Drug Administration (FDA) but touted by consumers. We utilize various lexicons, identify patterns, and generate a synonym set that includes variations of medical terms. We identify "expected" and "unexpected" ADRs. Background (drug) language is utilized to evaluate the strength of the detected unexpected ADRs. Evaluation results for our synonym set and ADR extraction are promising. | |||
| The Impact of Temporal Intent Variability on Diversity Evaluation | | BIBA | Full-Text | 820-823 | |
| Ke Zhou; Stewart Whiting; Joemon M. Jose; Mounia Lalmas | |||
| To cope with the uncertainty involved with ambiguous or underspecified queries, search engines often diversify results to return documents that cover multiple interpretations, e.g. the car brand, animal or operating system for the query 'jaguar'. Current diversity evaluation measures take the popularity of the subtopics into account and aim to favour systems that promote most popular subtopics earliest in the result ranking. However, this subtopic popularity is assumed to be static over time. In this paper, we hypothesise that temporal subtopic popularity change is common for many topics and argue this characteristic should be considered when evaluating diversity. Firstly, to support our hypothesis we analyse temporal subtopic popularity changes for ambiguous queries through historic Wikipedia article viewing statistics. Further, by simulation, we demonstrate the impact of this temporal intent variability on diversity evaluation. | |||
| Re-leashed! The PuppyIR Framework for Developing Information Services for Children, Adults and Dogs | | BIBA | Full-Text | 824-827 | |
| Doug Dowie; Leif Azzopardi | |||
| Children are active information seekers, but research has suggested that services, designed with adults in mind, are a poor fit to their needs [1-3]. The goal of the PuppyIR project is to design, develop and deliver an open source framework for building information services specifically for children, which incorporates the current understanding of children's information seeking needs. This paper describes the framework's architecture, highlights two of its novel information processing components, and marks the release of the framework to the wider Interactive Information Retrieval community. PuppyIR provides an open and common framework for the rapid prototyping, development and evaluation of information services specifically for children. | |||
| A Web Mining Tool for Assistance with Creative Writing | | BIBAK | Full-Text | 828-831 | |
| Boris A. Galitsky; Sergei O. Kuznetsov | |||
| We develop a web mining tool for assistance with creative writing. The
relevance of web mining is achieved via computing similarities of parse trees
for queries and found snippets. To assure the plausible flow of mental states
of involved agents, a multi-agent behavior simulator is included in content
generation algorithm. Keywords: content generation; web mining; simulating mental states | |||
| DS4: A Distributed Social and Semantic Search System | | BIBA | Full-Text | 832-836 | |
| Dionisis Kontominas; Paraskevi Raftopoulou; Christos Tryfonopoulos; Euripides G. M. Petrakis | |||
| We present DS4, a Distributed Social and Semantic Search System that allows users to share content among friends and clusters of users. In DS4 nodes that are semantically, thematically, or socially similar are automatically discovered and logically organised. Content retrieval is then performed by routing the query towards social friends and clusters of nodes that are likely to answer it. In this way, search receives two facets: the social facet, addressing friends, and the semantic facet, addressing nodes that are semantically close to the query. DS4 is scalable (requires no centralised component), privacy-aware (users maintain ownership and control over their content), automatic (requires no intervention by the user), general (works for any type of content), and adaptive (adjusts to changes of user content or interests). In this work, we aim to design the next generation of social networks that will offer open and adaptive design, and privacy-aware content management. | |||
| Serelex: Search and Visualization of Semantically Related Words | | BIBAK | Full-Text | 837-840 | |
| Alexander Panchenko; Pavel Romanov; Olga Morozova; Hubert Naets; Andrey Philippovich; Alexey Romanov; Cédrick Fairon | |||
| We present Serelex, a system that provides, given a query in English, a list
of semantically related words. The terms are ranked according to an original
semantic similarity measure learnt from a huge corpus. The system performs
comparably to dictionary-based baselines, but does not require any semantic
resource such as WordNet. Our study shows that users are completely satisfied
with 70% of the query results. Keywords: semantic similarity measure; visualization; extraction | |||
| SIAM: Social Interaction Analysis for Multimedia | | BIBA | Full-Text | 841-844 | |
| Jérôme Picault; Myriam Ribière | |||
| This paper describes the SIAM demonstrator, a system that illustrates the usefulness of indexing multimedia segments thanks to associated microblog posts. From a socialized multimedia content (i.e. video and associated microblog posts on Twitter), the system applies text mining techniques and derives a topic model to index socialized multimedia segments. That result may then be used inside many multimedia applications, such as in-media social navigation, multimedia summarization or composition, or exploration of multimedia collections according to various socially-based viewpoints. | |||
| Exploratory Search on Social Media | | BIBAK | Full-Text | 845-848 | |
| Aaron Russ; Michael Kaisser | |||
| The rise of Social Media creates a wealth of information that can be very
valuable for private and professional users alike. But many challenges
surrounding this relatively new kind of information are yet unsolved. This is
true for algorithms that efficiently and intelligently process such data, but
also for methods of how users can conveniently access it and how results are
displayed. In this paper we present a tool that lets users perform exploratory
search on several Social Media sites in parallel. It gives users the
opportunity to explore a topic space, and to better understand facets of
current discussions. Keywords: Social Media; Exploratory Search | |||
| VisNavi: Citation Context Visualization and Navigation | | BIBAK | Full-Text | 849-852 | |
| Farag Saad; Brigitte Mathiak | |||
| The process of retrieving information for literature review purposes differs
from traditional web information retrieval. Literature reviews differentiate
between the weightiness of the retrieved data segments. For example, citations
and their accompanying information, such as cited author, citation context
etc., are a very important consideration when searching for relevant
information in literature. However, this information is integrated into a
scientific paper, in rich interrelationships, making it very complicated for
standard search systems to present and track them efficiently. In this paper,
we demonstrate a system, VisNavi, in the form of a visualized star-centered
approach that introduces the rich citation interrelationships to the searchers
in an effective and navigational appearance. Keywords: digital libraries; citation context; visualization; navigation; information
retrieval; text extraction | |||
| Face-Based People Searching in Videos | | BIBA | Full-Text | 853-856 | |
| Jan Sedmidubsky; Michal Batko; Pavel Zezula | |||
| We propose a system for retrieving people according to their faces in unannotated video streams. The system processes input videos to extract key-frames on which faces are detected. The detected faces are automatically grouped together to create clusters containing snapshots of the same person. The system also facilitates annotation and manual manipulations with created clusters. On the processed videos the system offers to search for persons in three distinct operations applicable to various scenarios. The system is presented online by indexing five high-quality video streams with the total length of nearly five hours. | |||
| Political Hashtag Trends | | BIBA | Full-Text | 857-860 | |
| Ingmar Weber; Venkata Rama Kiran Garimella; Asmelash Teka | |||
| Political Hashtag Trends (PHT) is an analysis tool for political left-vs.-right polarization of Twitter hashtags. PHT computes a leaning for trending, political hashtags in a given week, giving insights into the polarizing U.S. American issues on Twitter. The leaning of a hashtag is derived in two steps. First, users retweeting a set of "seed users" with a known political leaning, such as Barack Obama or Mitt Romney, are identified and the corresponding leaning is assigned to retweeters. Second, a hashtag is assigned a fractional leaning corresponding to which retweeting users used it. Non-political hashtags are removed by requiring certain hashtag co-occurrence patterns. PHT also offers functionality to put the results into context. For example, it shows example tweets from different leanings, it shows historic information and it links to the New York Times archives to explore a topic in depth. In this paper, we describe the underlying methodology and the functionality of the demo. | |||
| OPARS: Objective Photo Aesthetics Ranking System | | BIBA | Full-Text | 861-864 | |
| Huang Xiao; Han Xiao; Claudia Eckert | |||
| As the perception of beauty is subjective across individuals, evaluating the objective aesthetic value of an image is a challenging task in image retrieval system. Unlike current online photo sharing services that take the average rating as the aesthetic score, our system integrates various ratings from different users by jointly modeling images and users' expertise in a regression framework. In the front-end, users are asked to rate images selected by an active learning process. A multi-observer regression model is employed in the back-end to integrate these ratings for predicting the aesthetic value of images. Moreover, the system can be incorporated into current photo sharing services as complement by providing more accurate ratings. | |||
| Distributed Information Retrieval and Applications | | BIBA | Full-Text | 865-868 | |
| Fabio Crestani; Ilya Markov | |||
| Distributed Information Retrieval (DIR) is a generic area of research that brings together techniques, such as resource selection and results aggregation, dealing with data that, for organizational or technical reasons, cannot be managed centrally. Existing and potential applications of DIR methods vary from blog retrieval to aggregated search and from multimedia and multilingual retrieval to distributed Web search. In this tutorial we briefly discuss main DIR phases, that are resource description, resource selection, results merging and results presentation. The main focus is made on applications of DIR techniques: blog, expert and desktop search, aggregated search and personal meta-search, multimedia and multilingual retrieval. We also discuss a number of potential applications of DIR techniques, such as distributed Web search, enterprise search and aggregated mobile search. | |||
| Searching the Web of Data | | BIBAK | Full-Text | 869-873 | |
| Gerard de Melo; Katja Hose | |||
| Search is currently undergoing a major paradigm shift away from the
traditional document-centric "10 blue links" towards more explicit and
actionable information. Recent advances in this area are Google's Knowledge
Graph, Virtual Personal Assistants such as Siri and Google Now, as well as the
now ubiquitous entity-oriented vertical search results for places, products,
etc. Apart from novel query understanding methods, these developments are
largely driven by structured data that is blended into the Web Search
experience. We discuss efficient indexing and query processing techniques to
work with large amounts of structured data. Finally, we present query
interpretation and understanding methods to map user queries to these
structured data sources. Keywords: information retrieval; structured data; Web of Data | |||
| Monolingual and Cross-Lingual Probabilistic Topic Models and Their Applications in Information Retrieval | | BIBAK | Full-Text | 874-877 | |
| Marie-Francine Moens; Ivan VuliÄ | |||
| Probabilistic topic models are a group of unsupervised generative machine
learning models that can be effectively trained on large text collections. They
model document content as a two-step generation process, i.e., documents are
observed as mixtures of latent topics, while topics are probability
distributions over vocabulary words. Recently, a significant research effort
has been invested into transferring the probabilistic topic modeling concept
from monolingual to multilingual settings. Novel topic models have been
designed to work with parallel and comparable multilingual data (e.g.,
Wikipedia or news data discussing the same events). Probabilistic topics models
offer an elegant way to represent content across different languages. Their
probabilistic framework allows for their easy integration into a language
modeling framework for monolingual and cross-lingual information retrieval.
Moreover, we present how to use the knowledge from the topic models in the
tasks of cross-lingual event clustering, cross-lingual document classification
and the detection of cross-lingual semantic similarity of words. The tutorial
also demonstrates how semantically similar words across languages are
integrated as useful additional evidences in cross-lingual information
retrieval models. Keywords: Probabilistic topic models; Cross-lingual retrieval; Ranking models;
Cross-lingual text mining | |||
| Practical Online Retrieval Evaluation | | BIBAK | Full-Text | 878-881 | |
| Filip Radlinski; Katja Hofmann | |||
| Online evaluation allows the assessment of information retrieval (IR)
techniques based on how real users respond to them. Because this technique is
directly based on observed user behavior, it is a promising alternative to
traditional offline evaluation, which is based on manual relevance assessments.
In particular, online evaluation can enable comparisons in settings where
reliable assessments are difficult to obtain (e.g., personalized search) or
expensive (e.g., for search by trained experts in specialized collections).
Despite its advantages, and its successful use in commercial settings, online evaluation is rarely employed outside of large commercial search engines due to a perception that it is impractical at small scales. The goal of this tutorial is to show how online evaluations can be conducted in such settings, demonstrate software to facilitate its use, and promote further research in the area. We will also contrast online evaluation with standard offline evaluation, and provide an overview of online approaches. Keywords: Interleaving; Clicks; Search Engine; Online Evaluation | |||
| Integrating IR Technologies for Professional Search | | BIBA | Full-Text | 882-885 | |
| Michail Salampasis; Norbert Fuhr; Allan Hanbury; Mihai Lupu; Birger Larsen; Henrik Strindberg | |||
| Professional search in specific domains (e.g. patent, medical, scientific literature, media) usually needs an exploratory type of search which is characterized more often, in comparison to fact finding and question answering web search, by recall-oriented information needs and by uncertainty and evolution or change of the information need. Additionally, the complexity of the tasks that need to be performed by professional searchers, which usually include not only retrieval but also information analysis and monitoring tasks, require association, pipelining and possibly integration of information as well as synchronization and coordination of multiple and potentially concurrent search views produced from different datasets, search tools and UIs. Many facets of IR technology (e.g. exploratory search, aggregated search, federated search, task-based search, IR over query sessions, cognitive IR approaches, Human Computer and Information Retrieval) aim to at least partially address these demands. This workshop aims to stimulate exploratory research, bring together various facets of IR research and promote discussion between researchers towards the development of a generalised framework facilitating the integration of IR technologies and search tools into next generation professional search systems. This envisioned framework should be supported from new or the extension of existing protocols and may influence the design of next generation professional search systems. | |||
| From Republicans to Teenagers -- Group Membership and Search (GRUMPS) | | BIBAK | Full-Text | 886-889 | |
| Ingmar Weber; Djoerd Hiemstra; Pavel Serdyukov | |||
| In the early years of information retrieval, the focus of research was on
systems aspects such as crawling, indexing, and relevancy ranking. Over the
years, more and more user-related information such as click information or
search history has entered the equation creating more and more personalized
search experiences, though still within the scope of the same overall system.
Though fully personalized search is probably desirable, this individualistic
perspective does not exploit the fact that a lot of a users behavior can be
explained through their group membership. Children, despite individual
differences, share many challenges and needs; as do men, Republicans, Chinese
or any user group. This workshop takes a group-centric approach to IR and
invites contributions that either (i) propose and evaluate IR systems for a
particular user group or that (ii) describe how the search behavior of specific
groups differ, potentially requiring a different way of addressing their needs. Keywords: information retrieval; user groups; user modeling | |||
| Doctoral Consortium at ECIR 2013 | | BIBAK | Full-Text | 890 | |
| Hideo Joho; Dmitry I. Ignatov | |||
| This is a short description of Doctoral Consortium at ECIR 2013. Keywords: Information Retrieval; Doctoral Consortium | |||