HCI Bibliography Home | HCI Conferences | CIKM Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
CIKM Tables of Contents: 0809101112131415

Proceedings of the 2013 ACM Conference on Information and Knowledge Management

Fullname:Proceedings of the 22nd ACM International Conference on Information & Knowledge Management
Editors:Qi He; Arun Iyengar; Wolfgang Nejdl; Jian Pei; Rajeev Rastogi
Location:San Francisco, California
Dates:2013-Oct-27 to 2013-Nov-01
Publisher:ACM
Standard No:ISBN: 978-1-4503-2263-8; ACM DL: Table of Contents; hcibib: CIKM13
Papers:382
Pages:2566
Links:Conference Website
  1. Keynote address
  2. DB track: search
  3. IR track: retrieval models
  4. IR track: entities
  5. KM track: social networks (1)
  6. KM track: mining topics
  7. KM track: pattern mining and applications
  8. DB track: data streams and probabilistic queries
  9. IR track: search engines
  10. IR track: networks
  11. KM track: social networks (2)
  12. KM track: mining big data
  13. KM track: ontologies
  14. KM track: mobile and event mining
  15. IR track: evaluation
  16. IR track
  17. DB track: data streams and ranking
  18. KM track: graphs and networks
  19. KM track: clusters, topics and similarity
  20. DB track: graphs and social networks
  21. IR track: data classification
  22. KM track: networks
  23. KM track: mining reviews and Wiki
  24. IR track: applications I
  25. Poster session: DB+IR track
  26. Industry session
  27. IR track: ranking
  28. KM track: learning and applications (1)
  29. KM track: similarity, clustering, and outlier mining
  30. IR track: applications II
  31. Poster Session: KM track
  32. Industry session
  33. DB track: graphs and storage systems
  34. KM track: social networks and media
  35. KM track: text
  36. IR track
  37. Poster session: IR track
  38. Industry session
  39. DB track: miscellaneous
  40. IR track: users
  41. KM track: extraction and text mining
  42. KM track: community and web mining
  43. KM track: learning and applications (2)
  44. Industry session
  45. DB track: query processing and privacy
  46. IR Track
  47. KM track: entities, tags, and time series
  48. KM track: mining and learning
  49. Demo session
  50. Panel discussion
  51. Co-located workshop summaries

Keynote address

Scholarly big data: information extraction and data mining BIBAFull-Text 1-2
  C. Lee Giles
Collections of scholarly documents are usually not thought of as big data. However, large collections of scholarly documents often have many millions of publications, authors, citations, equations, figures, etc., and large scale related data and structures such as social networks, slides, data sets, etc. We discuss scholarly big data challenges, insights, methodologies and applications. We illustrate scholarly big data issues with examples of specialized search engines and recommendation systems that use information extraction and data mining in various areas such as computer science, chemistry, archaeology, acknowledgements, reference recommendation, collaboration recommendation, and others.
Applying theory to practice BIBAFull-Text 3-4
  Ronald Fagin
We discuss the art of applying theory to practice. In particular, we discuss in detail our interactions with two research projects at IBM Almaden: the Garlic project, which built a multimedia database system on top of various existing systems, and the Clio project, which developed tools for converting data from one format to another. We discuss the problems we resolved, and the impact this had both on the Garlic or Clio systems and on the broader scientific community. We draw morals from these interactions, including why theoreticians do better theory by working with system builders, and why system builders build better systems by working with theoreticians. We present the remarkably simple Threshold Algorithm, which is optimal in an extremely strong sense: optimal not just in the worst case, or in the average case, but in every case! The Threshold Algorithm and its variants have applications to numerous areas, including information retrieval, fuzzy and uncertain databases, group recommendation systems, and the semantic web.
Usability in machine learning at scale with GraphLab BIBAFull-Text 5-6
  Carlos Guestrin
Today, machine learning (ML) methods play a central role in industry and science. The growth of the Web and improvements in sensor data collection technology have been rapidly increasing the magnitude and complexity of the ML tasks we must solve. This growth is driving the need for scalable, parallel ML algorithms that can handle "Big Data."
   In this talk, we will focus on: Examining common algorithmic patterns in distributed ML methods. Qualifying the challenges of implementing these algorithms in real distributed systems. Describing computational frameworks for implementing these algorithms at scale. Addressing a significant core challenge to large-scale ML -- enabling the widespread adoption of machine learning beyond experts.
   In the latter part, we will focus mainly on the GraphLab framework, which naturally expresses asynchronous, dynamic graph computations that are key for state-of-the-art ML algorithms. When these algorithms are expressed in our higher-level abstraction, GraphLab will effectively address many of the underlying parallelism challenges, including data distribution, optimized communication, and guaranteeing sequential consistency, a property that is surprisingly important for many ML algorithms. On a variety of large-scale tasks, GraphLab provides 20-100x performance improvements over Hadoop. In recent months, GraphLab has received many tens of thousands of downloads, and is being actively used by a number of startups, companies, research labs and universities.
Structured data in web search BIBAFull-Text 7-8
  Alon Halevy
For the first time since the emergence of the Web, structured data is playing a key role in search engines and is therefore being collected via a concerted effort. Much of this data is being extracted from the Web, which contains vast quantities of structured data on a variety of domains, such as hobbies, products and reference data. Moreover, the Web provides a platform that encourages publishing more data sets from governments and other public organizations. The Web also supports new data management opportunities, such as effective crisis response, data journalism and crowd-sourcing data sets.
   I will describe some of the efforts we are conducting at Google to collect structured data, filter the high-quality content, and serve it to our users. These efforts include providing Google Fusion Tables, a service for easily ingesting, visualizing and integrating data, mining the Web for high-quality HTML tables, and contributing these data assets to Google's other services.

DB track: search

One size does not fit all: multi-granularity search of web forums BIBAFull-Text 9-18
  Gayatree Ganu; Amélie Marian
Users rely increasingly on online forums, blogs, and mailing lists to exchange information, practical tips, and stories. Although this type of social interaction has become central to our daily lives and decision-making processes, forums are surprisingly technologically poor: often there is no choice but to browse through massive numbers of posts while looking for specific information. A critical challenge then for forum search is to provide results that are as complete as possible and that do not miss some relevant information but that are not too broad. In this paper, we address the problem of presenting textual search results in a concise manner to answer user needs. Specifically, we propose a new search approach over free-form text in forums that allows for the search results to be returned at varying granularity levels. We implement a novel hierarchical representation and scoring technique for objects at multiple granularities, taking into account the inherent containment relationship provided by the hierarchy. We also present a score optimization algorithm that efficiently chooses the best k-sized result set while ensuring no overlap between the results. We evaluate the effectiveness of multi-granularity search by conducting extensive user studies and show that a mixed granularity set of results is more relevant to users than standard post-only approaches.
Spatial search for K diverse-near neighbors BIBAFull-Text 19-28
  Gregory Ference; Wang-Chien Lee; Hui-Ju Jung; De-Nian Yang
To many location-based service applications that prefer diverse results, finding locations that are spatially diverse and close in proximity to a query point (e.g., the current location of a user) can be more useful than finding the k nearest neighbors/locations. In this paper, we investigate the problem of searching for the k Diverse-Near Neighbors (kDNNs)} in spatial space that is based upon the spatial diversity and proximity of candidate locations to the query point. While employing a conventional distance measure for proximity, we develop a new and intuitive diversity metric based upon the variance of the angles among the candidate locations with respect to the query point. Accordingly, we create a dynamic programming algorithm that finds the optimal kDNNs. Unfortunately, the dynamic programming algorithm, with a time complexity of O(kn3), incurs excessive computational cost. Therefore, we further propose two heuristic algorithms, namely, Distance-based Browsing (DistBrow) and Diversity-based Browsing (DivBrow) that provide high effectiveness while being efficient by exploring the search space prioritized upon the proximity to the query point and spatial diversity, respectively. Using real and synthetic datasets, we conduct a comprehensive performance evaluation. The results show that DistBrow and DivBrow have superior effectiveness compared to state-of-the-art algorithms while maintaining high efficiency.
Mining a search engine's corpus without a query pool BIBAFull-Text 29-38
  Mingyang Zhang; Nan Zhang; Gautam Das
Many websites (e.g., WedMD.com, CNN.com) provide keyword search interfaces over a large corpus of documents. Meanwhile, many third parties (e.g., investors, analysts) are interested in learning big-picture analytical information over such a document corpus, but have no direct way of accessing it other than using the highly restrictive web search interface. In this paper, we study how to enable third-party data analytics over a search engine's corpus without the cooperation of its owner -- specifically, by issuing a small number of search queries through the web interface.
   Almost all existing techniques require a pre-constructed query pool -- i.e., a small yet comprehensive collection of queries which, if all issued through the search interface, can recall almost all documents in the corpus. The problem with this requirement is that a "good" query pool can only be constructed by someone with very specific knowledge (e.g., size, topic, special terms used, etc.) of the corpus, essentially leading to a chicken-and-egg problem. In this paper, we develop QG-SAMPLER and QG-ESTIMATOR, the first practical pool-free techniques for sampling and aggregate (e.g., SUM, COUNT, AVG) estimation over a search engine's corpus, respectively. Extensive real-world experiments show that our algorithms perform on-par with the state-of-the-art pool-based techniques equipped with a carefully tailored query pool, and significantly outperforms the latter when the query pool is a mismatch.
G-tree: an efficient index for KNN search on road networks BIBAFull-Text 39-48
  Ruicheng Zhong; Guoliang Li; Kian-Lee Tan; Lizhu Zhou
In this paper we study the problem of kNN search on road networks. Given a query location and a set of candidate objects in a road network, the kNN search finds the k nearest objects to the query location. To address this problem, we propose a balanced search tree index, called G-tree. The G-tree of a road network is constructed by recursively partitioning the road network into sub-networks and each G-tree node corresponds to a sub-network. Inspired by classical kNN search on metric space, we introduce a best-first search algorithm on road networks, and propose an elaborately-designed assembly-based method to efficiently compute the minimum distance from a G-tree node to the query location. G-tree only takes O(|V|log|V|) space, where |V| is the number of vertices in a network, and thus can easily scale up to large road networks with more than 20 millions vertices. Experimental results on eight real-world datasets show that our method significantly outperforms state-of-the-art methods, even by 2-3 orders of magnitude.
Efficient parsing-based search over structured data BIBAFull-Text 49-58
  Aditya Parameswaran; Raghav Kaushik; Arvind Arasu
Parsing-based search, i.e., parsing keyword search queries using grammars, is often used to override the traditional "bag-of-words'" semantics in web search and enterprise search scenarios. Compared to the "bag-of-words" semantics, the parsing-based semantics is richer and more customizable. While a formalism for parsing-based semantics for keyword search has been proposed in prior work and ad-hoc implementations exist, the problem of designing efficient algorithms to support the semantics is largely unstudied. In this paper, we present a suite of efficient algorithms and auxiliary indexes for this problem. Our algorithms work for a broad classes of grammars used in practice, and cover a variety of database matching functions (set- and substring-containment, approximate and exact equality) and scoring functions (to filter and rank different parses). We formally analyze the time complexity of our algorithms and provide an empirical evaluation over real-world data to show that our algorithms scale well with the size of the database and grammar.

IR track: retrieval models

Graph-of-word and TW-IDF: new approach to ad hoc IR BIBAFull-Text 59-68
  François Rousseau; Michalis Vazirgiannis
In this paper, we introduce novel document representation (graph-of-word) and retrieval model (TW-IDF) for ad hoc IR. Questioning the term independence assumption behind the traditional bag-of-word model, we propose a different representation of a document that captures the relationships between the terms using an unweighted directed graph of terms. From this graph, we extract at indexing time meaningful term weights (TW) that replace traditional term frequencies (TF) and from which we define a novel scoring function, namely TW-IDF, by analogy with TF-IDF. This approach leads to a retrieval model that consistently and significantly outperforms BM25 and in some cases its extension BM25+ on various standard TREC datasets. In particular, experiments show that counting the number of different contexts in which a term occurs inside a document is more effective and relevant to search than considering an overall concave term frequency in the context of ad hoc IR.
Map search via a factor graph model BIBAFull-Text 69-78
  Qi Zhang; Jihua Kang; Yeyun Gong; Huan Chen; Yaqian Zhou; Xuanjing Huang
Map search has received considerable attention in recent years. With map search, users can specify target locations with textual queries. However, these queries do not always include well-formed addresses or place names. They may contain transpositions, misspellings, fragments and so on. Queries may significantly differ from items stored in the spatial database. In this paper, we propose to connect this task to the semi-structured retrieval problem. A novel factor graph-based semi-structured retrieval framework is introduced to incorporate concept weighting, attribute selection, and word-based similarity metrics together. We randomly sampled a number of queries from logs of a commercial map search engine and manually labeled their categories and relevant results for analysis and evaluation. The results of several experimental comparisons demonstrate that our method outperforms both state-of-the-art semi-structured retrieval methods and some commercial systems in retrieving freeform location queries.
A phased ranking model for question answering BIBAFull-Text 79-88
  Rui Liu; Eric Nyberg
We describe a general result ranking approach for multi-phase, multi strategy information systems, which has been applied to the task of question answering (QA). Many information systems incorporate multiple steps and each step or phase may incorporate multiple component algorithms to achieve acceptable robustness and overall performance. Such systems may produce and rank a large number of candidate results. Prior work includes many models that rank a particular type of information object (e.g. a retrieved document, a factoid answer) using features specific to that information type, without attempting to make use of other non-local features (e.g. features of the upstream information source). We propose an approach that allows each phase in a system to leverage information propagated from preceding phases to inform the ranking decision. This is accomplished by a system object graph which represents all of the objects created during system execution, object dependencies (e.g. provenance), and ranking feature values extracted for a specific object. We evaluate the effectiveness of the proposed ranking approach in a multi-phase question answering system built by recombining pre-existing software modules. Experimental results show that our proposed approach significantly outperforms comparable answer ranking models.
CRF framework for supervised preference aggregation BIBAFull-Text 89-98
  Maksims N. Volkovs; Richard S. Zemel
We develop a flexible Conditional Random Field framework for supervised preference aggregation, which combines preferences from multiple experts over items to form a distribution over rankings. The distribution is based on an energy comprised of unary and pairwise potentials allowing us to effectively capture correlations between both items and experts. We describe procedures for learning in this model, and demonstrate that inference can be done much more efficiently than in analogous models. Experiments on benchmark tasks demonstrate significant performance gains over existing rank aggregation methods.
CQArank: jointly model topics and expertise in community question answering BIBAFull-Text 99-108
  Liu Yang; Minghui Qiu; Swapna Gottipati; Feida Zhu; Jing Jiang; Huiping Sun; Zhong Chen
Community Question Answering (CQA) websites, where people share expertise on open platforms, have become large repositories of valuable knowledge. To bring the best value out of these knowledge repositories, it is critically important for CQA services to know how to find the right experts, retrieve archived similar questions and recommend best answers to new questions. To tackle this cluster of closely related problems in a principled approach, we proposed Topic Expertise Model (TEM), a novel probabilistic generative model with GMM hybrid, to jointly model topics and expertise by integrating textual content model and link structure analysis. Based on TEM results, we proposed CQARank to measure user interests and expertise score under different topics. Leveraging the question answering history based on long-term community reviews and voting, our method could find experts with both similar topical preference and high topical expertise. Experiments carried out on Stack Overflow data, the largest CQA focused on computer programming, show that our method achieves significant improvement over existing methods on multiple metrics.

IR track: entities

Penguins in sweaters, or serendipitous entity search on user-generated content BIBAFull-Text 109-118
  Ilaria Bordino; Yelena Mejova; Mounia Lalmas
In many cases, when browsing the Web users are searching for specific information or answers to concrete questions. Sometimes, though, users find unexpected, yet interesting and useful results, and are encouraged to explore further. What makes a result serendipitous? We propose to answer this question by exploring the potential of entities extracted from two sources of user-generated content -- Wikipedia, a user-curated online encyclopedia, and Yahoo! Answers, a more unconstrained question/answering forum -- in promoting serendipitous search. In this work, the content of each data source is represented as an entity network, which is further enriched with metadata about sentiment, writing quality, and topical category. We devise an algorithm based on lazy random walk with restart to retrieve entity recommendations from the networks. We show that our method provides novel results from both datasets, compared to standard web search engines. However, unlike previous research, we find that choosing highly emotional entities does not increase user interest for many categories of entities, suggesting a more complex relationship between topic matter and the desirable metadata attributes in serendipitous search.
Entity-centric document filtering: boosting feature mapping through meta-features BIBAFull-Text 119-128
  Mianwei Zhou; Kevin Chen-Chuan Chang
This paper studies the entity-centric document filtering task -- given an entity represented by its identification page (e.g., a Wikipedia page), how to correctly identify its relevant documents. In particular, we are interested in learning an entity-centric document filter based on a small number of training entities, and the filter can predict document relevance for a large set of unseen entities at query time. Towards characterizing the relevance of a document, the problem boils down to learning keyword importance for the query entities. Since the same keyword will have very different importance for different entities, we abstract the entity-centric document filtering problem as a transfer learning problem, and the challenge becomes how to appropriately transfer the keyword importance learned from training entities to query entities. Based on the insight that keywords sharing some similar "properties" should have similar importance for their respective entities, we propose a novel concept of meta-feature to map keywords from different entities. To realize the idea of meta-feature-based feature mapping, we develop and contrast two different models, LinearMapping and BoostMapping. Experiments on three different datasets confirm the effectiveness of our proposed models, which show significant improvement compared with four state-of-the-art baseline methods.
Structured positional entity language model for enterprise entity retrieval BIBAFull-Text 129-138
  Chunliang Lu; Lidong Bing; Wai Lam
We investigate the problem of general entity retrieval for enterprise websites. Our framework transforms the webpage content into a structured content representation, which captures hierarchical information blocks and semi-structured data records information. To facilitate entity retrieval given a user query, we develop a structured positional entity language model suitable for ranking entities extracted from the webpage content incorporating the structured content representation. Different from existing language models for retrieval, our proposed model considers both the proximity and the structured webpage content in a unified manner. Extensive experiments on the benchmark datasets demonstrate the effectiveness of our proposed framework.
Learning relatedness measures for entity linking BIBAFull-Text 139-148
  Diego Ceccarelli; Claudio Lucchese; Salvatore Orlando; Raffaele Perego; Salvatore Trani
Entity Linking is the task of detecting, in text documents, relevant mentions to entities of a given knowledge base. To this end, entity-linking algorithms use several signals and features extracted from the input text or from the knowledge base. The most important of such features is entity relatedness. Indeed, we argue that these algorithms benefit from maximizing the relatedness among the relevant entities selected for annotation, since this minimizes errors in disambiguating entity-linking.
   The definition of an effective relatedness function is thus a crucial point in any entity-linking algorithm. In this paper we address the problem of learning high quality entity relatedness functions. First, we formalize the problem of learning entity relatedness as a learning-to-rank problem. We propose a methodology to create reference datasets on the basis of manually annotated data. Finally, we show that our machine-learned entity relatedness function performs better than other relatedness functions previously proposed, and, more importantly, improves the overall performance of different state-of-the-art entity-linking algorithms.
Gem-based entity-knowledge maintenance BIBAFull-Text 149-158
  Bilyana Taneva; Gerhard Weikum
Knowledge bases about entities have become a vital asset for Web search, recommendations, and analytics. Examples are Freebase being the core of the Google Knowledge Graph and the use of Wikipedia for distant supervision in numerous IR and NLP tasks. However, maintaining the knowledge about not so prominent entities in the long tail is often a bottleneck as human contributors face the tedious task of continuously identifying and reading relevant sources. To overcome this limitation and accelerate the maintenance of knowledge bases, we propose an approach that automatically extracts, from the Web, key contents for given input entities.
   Our method, called GEM, generates salient contents about a given entity, using minimal assumptions about the underlying sources, while meeting the constraint that the user is willing to read only a certain amount of information. Salient content pieces have variable length and are computed using a budget-constrained optimization problem which decides upon which sub-pieces of an input text should be selected for the final result. GEM can be applied to a variety of knowledge-gathering settings including news streams and speech input from videos. Our experimental studies show the viability of the approach, and demonstrate improvements over various baselines, in terms of precision and recall.

KM track: social networks (1)

Predicting user activity level in social networks BIBAFull-Text 159-168
  Yin Zhu; Erheng Zhong; Sinno Jialin Pan; Xiao Wang; Minzhe Zhou; Qiang Yang
The study of users' social behaviors has gained much research attention since the advent of various social media such as Facebook, Renren and Twitter. A major kind of applications is to predict a user's future activities based on his/her historical social behaviors. In this paper, we focus on a fundamental task: to predict a user's future activity levels in a social network, e.g. weekly activeness, active or inactive. This problem is closely related to Social Customer Relationship Management (Social CRM). Compared to traditional CRM, the three properties: user diversity, social influence, and dynamic nature of social networks, raise new challenges and opportunities to Social CRM. Firstly, the user diversity property implies that a global predictive model may not be precise for all users. On the other hand, historical data of individual users are too sparse to build precisely personalized models. Secondly, the social influence property suggests that relationships between users can be embedded to further boost prediction results on individual users. Finally, the dynamical nature of social networks means that users' behaviors may keep changing over time. To address these challenges, we develop a personalized and social regularized time-decay model for user activity level prediction. Experiments on the social media Renren validate the effectiveness of our proposed model compared with some baselines including traditional supervised learning methods and node classification methods in social networks.
On popularity prediction of videos shared in online social networks BIBAFull-Text 169-178
  Haitao Li; Xiaoqiang Ma; Feng Wang; Jiangchuan Liu; Ke Xu
Popularity prediction, with both technological and economic importance, has been extensively studied for conventional video sharing sites (VSSes), where the videos are mainly found via searching, browsing, or related links. Recent statistics however suggest that online social network (OSN) users regularly share video contents from VSSes, which has contributed to a significant portion of the accesses; yet the popularity prediction in this new context remains largely unexplored. In this paper, we present an initial study on the popularity prediction of videos propagated in OSNs along friendship links.
   We conduct a large-scale measurement and analysis of viewing patterns of videos shared in one of largest OSNs in China, and examine the performance of typical views-based prediction models. We find that they are generally ineffective, if not totally fail, especially when predicting the early peaks and later bursts of accesses, which are common during video propagations in OSNs. To overcome these limits, we track the propagation process of videos shared in a Facebook-like OSN in China, and analyze the user viewing and sharing behaviors. We accordingly develop a novel propagation-based video popularity prediction solution, namely SoVP. Instead of relying solely on the early views for prediction, SoVP considers both the intrinsic attractiveness of a video and the influence from the underlying propagation structure. The effectiveness of SoVP, particularly for predicting the peaks and bursts, have been validated through our trace-driven experiments.
Inferring anchor links across multiple heterogeneous social networks BIBAFull-Text 179-188
  Xiangnan Kong; Jiawei Zhang; Philip S. Yu
Online social networks can often be represented as heterogeneous information networks containing abundant information about: who, where, when and what. Nowadays, people are usually involved in multiple social networks simultaneously. The multiple accounts of the same user in different networks are mostly isolated from each other without any connection between them. Discovering the correspondence of these accounts across multiple social networks is a crucial prerequisite for many interesting inter-network applications, such as link recommendation and community analysis using information from multiple networks. In this paper, we study the problem of anchor link prediction across multiple heterogeneous social networks, i.e., discovering the correspondence among different accounts of the same user. Unlike most prior work on link prediction and network alignment, we assume that the anchor links are one-to-one relationships (i.e., no two edges share a common endpoint) between the accounts in two social networks, and a small number of anchor links are known beforehand. We propose to extract heterogeneous features from multiple heterogeneous networks for anchor link prediction, including user's social, spatial, temporal and text information. Then we formulate the inference problem for anchor links as a stable matching problem between the two sets of user accounts in two different networks. An effective solution, MNA (Multi-Network Anchoring), is derived to infer anchor links w.r.t. the one-to-one constraint. Extensive experiments on two real-world heterogeneous social networks show that our MNA model consistently outperform other commonly-used baselines on anchor link prediction.
Community-based user recommendation in uni-directional social networks BIBAFull-Text 189-198
  Gang Zhao; Mong Li Lee; Wynne Hsu; Wei Chen; Haoji Hu
Advances in Web 2.0 technology has led to the rising popularity of many social network services. For example, there are over 500 million active users in Twitter. Given the huge number of users, user recommendation has gained importance where the goal is to find a set of users whom a target user is likely to follow. Content-based approaches that rely on tweet content for user recommendation have low precision as tweet contents are typically short and noisy, while collaborative filtering approaches that utilize follower-followee relationships lead to higher precision but data sparsity remains a challenge. In this work, we propose a community-based approach to user recommendation in Twitter-style social networks. Forming communities enables us to reduce data sparsity as the focus is on discover the latent characteristics of communities instead of individuals. We employ an LDA-based method on the follower-followee relationships to discover communities before applying the state-of-the-art matrix factorization method on each of the communities. This approach proves effective in improving the conversion rate (by as much as 20%) as demonstrated by the results of extensive experiments on two real world data sets Twitter and Weibo. In addition, the community-based approach is scalable as the individual community can be analyzed separately.
Personalized influence maximization on social networks BIBAFull-Text 199-208
  Jing Guo; Peng Zhang; Chuan Zhou; Yanan Cao; Li Guo
In this paper, we study a new problem on social network influence maximization. The problem is defined as, given a target user $w$, finding the top-k most influential nodes for the user. Different from existing influence maximization works which aim to find a small subset of nodes to maximize the spread of influence over the entire network (i.e., global optima), our problem aims to find a small subset of nodes which can maximize the influence spread to a given target user (i.e., local optima). The solution is critical for personalized services on social networks, where fully understanding of each specific user is essential. Although some global influence maximization models can be narrowed down as the solution, these methods often bias to the target node itself. To this end, in this paper we present a local influence maximization solution. We first provide a random function, with low variance guarantee, to randomly simulate the objective function of local influence maximization. Then, we present efficient algorithms with approximation guarantee. For online social network applications, we also present a scalable approximate algorithm by exploring the local cascade structure of the target user. We test the proposed algorithms on several real-world social networks. Experimental results validate the performance of the proposed algorithms.

KM track: mining topics

Discovering coherent topics using general knowledge BIBAFull-Text 209-218
  Zhiyuan Chen; Arjun Mukherjee; Bing Liu; Meichun Hsu; Malu Castellanos; Riddhiman Ghosh
Topic models have been widely used to discover latent topics in text documents. However, they may produce topics that are not interpretable for an application. Researchers have proposed to incorporate prior domain knowledge into topic models to help produce coherent topics. The knowledge used in existing models is typically domain dependent and assumed to be correct. However, one key weakness of this knowledge-based approach is that it requires the user to know the domain very well and to be able to provide knowledge suitable for the domain, which is not always the case because in most real-life applications, the user wants to find what they do not know. In this paper, we propose a framework to leverage the general knowledge in topic models. Such knowledge is domain independent. Specifically, we use one form of general knowledge, i.e., lexical semantic relations of words such as synonyms, antonyms and adjective attributes, to help produce more coherent topics. However, there is a major obstacle, i.e., a word can have multiple meanings/senses and each meaning often has a different set of synonyms and antonyms. Not every meaning is suitable or correct for a domain. Wrong knowledge can result in poor quality topics. To deal with wrong knowledge, we propose a new model, called GK-LDA, which is able to effectively exploit the knowledge of lexical relations in dictionaries. To the best of our knowledge, GK-LDA is the first such model that can incorporate the domain independent knowledge. Our experiments using online product reviews show that GK-LDA performs significantly better than existing state-of-the-art models.
Spatio-temporal and events based analysis of topic popularity in Twitter BIBAFull-Text 219-228
  Sebastien Ardon; Amitabha Bagchi; Anirban Mahanti; Amit Ruhela; Aaditeshwar Seth; Rudra Mohan Tripathy; Sipat Triukose
We present the first comprehensive characterization of the diffusion of ideas on Twitter, studying more than 5.96 million topics that include both popular and less popular topics. On a data set containing approximately 10 million users and a comprehensive scraping of 196 million tweets, we perform a rigorous temporal and spatial analysis, investigating the time-evolving properties of the subgraphs formed by the users discussing each topic. We focus on two different notions of the spatial: the network topology formed by follower-following links on Twitter, and the geospatial location of the users. We investigate the effect of initiators on the popularity of topics and find that users with a high number of followers have a strong impact on topic popularity. We deduce that topics become popular when disjoint clusters of users discussing them begin to merge and form one giant component that grows to cover a significant fraction of the network. Our geospatial analysis shows that highly popular topics are those that cross regional boundaries aggressively.
Domain-dependent/independent topic switching model for online reviews with numerical ratings BIBAFull-Text 229-238
  Yasutoshi Ida; Takuma Nakamura; Takashi Matsumoto
We propose a domain-dependent/independent topic switching model based on Bayesian probabilistic modeling for modeling online product reviews that are accompanied with numerical ratings provided by users. In this model, each word is allocated to a domain-dependent topic or a domain-independent topic, and the distribution of topics in an online review is connected to an observed numerical rating via a linear regression model. Domain-dependent topics utilize domain information observed with a corpus, and domain-independent topics utilize the framework of Bayesian Nonparametrics, which can estimate the number of topics in posterior distributions. The posterior distribution is estimated via collapsed Gibbs sampling. Using real data, our proposed model had smaller mean square error and smaller average mean error with a small model size and achieved convergence in fewer iterations for a regression task involving online review ratings, outperforming a baseline model that did not consider domains. Moreover, the proposed model can also tell us whether the words are positive or negative in the form of continuous values. This feature allows us to extract domain-dependent and -independent sentiment words.
A partially supervised cross-collection topic model for cross-domain text classification BIBAFull-Text 239-248
  Yang Bao; Nigel Collier; Anindya Datta
Cross-domain text classification aims to automatically train a precise text classifier for a target domain by using labelled text data from a related source domain. To this end, one of the most promising ideas is to induce a new feature representation so that the distributional difference between domains can be reduced and a more accurate classifier can be learned in this new feature space. However, most existing methods do not explore the duality of the marginal distribution of examples and the conditional distribution of class labels given labeled training examples in the source domain. Besides, few previous works attempt to explicitly distinguish the domain-independent and domain-specific latent features and align the domain-specific features to further improve the cross-domain learning. In this paper, we propose a model called Partially Supervised Cross-Collection LDA topic model (PSCCLDA) for cross-domain learning with the purpose of addressing these two issues in a unified way. Experimental results on nine datasets show that our model outperforms two standard classifiers and four state-of-the-art methods, which demonstrates the effectiveness of our proposed model.
Content coverage maximization on word networks for hierarchical topic summarization BIBAFull-Text 249-258
  Chi Wang; Xiao Yu; Yanen Li; Chengxiang Zhai; Jiawei Han
This paper studies text summarization by extracting hierarchical topics from a given collection of documents. We propose a new approach of text modeling via network analysis. We convert documents into a word influence network, and find the words summarizing the major topics with an efficient influence maximization algorithm. Besides, the influence capability of the topic words on other words in the network reveal the relations among the topic words. Then we cluster the words and build hierarchies for the topics. Experiments on large collections of Web documents show that a simple method based on the influence analysis is effective, compared with existing generative topic modeling and random walk based ranking.

KM track: pattern mining and applications

Mining frequent neighborhood patterns in a large labeled graph BIBAFull-Text 259-268
  Jialong Han; Ji-Rong Wen
Over the years, frequent subgraphs have been an important kind of targeted pattern in pattern mining research, where most approaches deal with databases holding a number of graph transactions, e.g., the chemical structures of compounds. These methods rely heavily on the downward-closure property (DCP) of the support measure to ensure an efficient pruning of the candidate patterns. When switching to the emerging scenario of single-graph databases such as Google's Knowledge Graph and Facebook's social graph, the traditional support measure turns out to be trivial (either 0 or 1). However, to the best of our knowledge, all attempts to redefine a single-graph support have resulted in measures that either lose DCP, or are no longer semantically intuitive. This paper targets pattern mining in the single-graph setting. We propose mining a new class of patterns called frequent neighborhood patterns, which is free from the "DCP-intuitiveness" dilemma of mining frequent subgraphs in a single graph. A neighborhood is a specific topological pattern in which a vertex is embedded, and the pattern is frequent if it is shared by a large portion (above a given threshold) of vertices. We show that the new patterns not only maintain DCP, but also have equally significant interpretations as subgraph patterns. Experiments on real-life datasets support the feasibility of our algorithms on relatively large graphs, as well as the capability of mining interesting knowledge that is not discovered by prior methods.
A two-phase algorithm for mining sequential patterns with differential privacy BIBAFull-Text 269-278
  Luca Bonomi; Li Xiong
Frequent sequential pattern mining is a central task in many fields such as biology and finance. However, release of these patterns is raising increasing concerns on individual privacy. In this paper, we study the sequential pattern mining problem under the differential privacy framework which provides formal and provable guarantees of privacy. Due to the nature of the differential privacy mechanism which perturbs the frequency results with noise, and the high dimensionality of the pattern space, this mining problem is particularly challenging. In this work, we propose a novel two-phase algorithm for mining both prefixes and substring patterns. In the first phase, our approach takes advantage of the statistical properties of the data to construct a model-based prefix tree which is used to mine prefixes and a candidate set of substring patterns. The frequency of the substring patterns is further refined in the successive phase where we employ a novel transformation of the original data to reduce the perturbation noise. Extensive experiment results using real datasets showed that our approach is effective for mining both substring and prefix patterns in comparison to the state-of-the-art solutions.
Mining diabetes complication and treatment patterns for clinical decision support BIBAFull-Text 279-288
  Lu Liu; Jie Tang; Yu Cheng; Ankit Agrawal; Wei-keng Liao; Alok Choudhary
The fast development of hospital information systems (HIS) produces a large volume of electronic medical records, which provides a comprehensive source for exploratory analysis and statistics to support clinical decision-making. In this paper, we investigate how to utilize the heterogeneous medical records to aid the clinical treatments of diabetes mellitus. Diabetes mellitus, simply diabetes, is a group of metabolic diseases, which is often accompanied with many complications. We propose a Symptom-Diagnosis-Treatment model to mine the diabetes complication patterns and to unveil the latent association mechanism between treatments and symptoms from large volume of electronic medical records. Furthermore, we study the demographic statistics of patient population w.r.t. complication patterns in real data and observe several interesting phenomena. The discovered complication and treatment patterns can help physicians better understand their specialty and learn previous experiences. Our experiments on a collection of one-year diabetes clinical records from a famous geriatric hospital demonstrate the effectiveness of our approaches.
Mining-based compression approach of propositional formulae BIBAFull-Text 289-298
  Said Jabbour; Lakhdar Sais; Yakoub Salhi; Takeaki Uno
In this paper, we propose a first application of data mining techniques to propositional satisfiability. Our proposed mining based compression approach aims to discover and to exploit hidden structural knowledge for reducing the size of propositional formulae in conjunctive normal form (CNF). It combines both frequent itemset mining techniques and Tseitin's encoding for a compact representation of CNF formulae. The experimental evaluation of our approach shows interesting reductions of the sizes of many application instances taken from the last SAT competitions.
Correlating medical-dependent query features with image retrieval models using association rules BIBAFull-Text 299-308
  Hajer Ayadi; Mouna Torjmen; Mariam Daoud; Maher Ben Jemaa; Jimmy Xiangji Huang
The increasing quantities of available medical resources have motivated the development of effective search tools and medical decision support systems. Medical image search tools help physicians in searching medical image datasets for diagnosing a disease or monitoring the stage of a disease given previous patient's image screenings. Image retrieval models are classified into three categories: content-based (visual), textual and combined models. In most of previous work, a unique image retrieval model is applied for any user formulated query independently of what retrieval model best suits the information need behind the query. The main challenge in medical image retrieval is to cope the semantic gap between user information needs and retrieval models. In this paper, we propose a novel approach for finding correlations between medical query features and retrieval models based on association rule mining. We define new medical-dependent query features such as image modality and presence of specific medical image terminology and make use of existing generic query features such as query specificity, ambiguity and cohesiveness. The proposed query features are then exploited into association rule mining for discovering rules which correlate query features to visual, textual or combined image retrieval models. Based on the discovered rules, we propose to use an associative classifier that finds the best suitable rule with a maximum feature coverage for a new query. Experiments are performed on Image CLEF queries from 2008 to 2012 where we evaluate the impact of our proposed query features on the classification performance. Results show that combining our proposed specific and generic query features is effective for classifying queries. A comparative study between our classifier, CBA, Naïve Bayes, Bayes Net and decision trees showed that our best coverage associative classifier outperforms existing classifiers where it achieves an improvement of 30%.

DB track: data streams and probabilistic queries

Local correlation detection with linearity enhancement in streaming data BIBAFull-Text 309-318
  Qing Xie; Shuo Shang; Bo Yuan; Chaoyi Pang; Xiangliang Zhang
This paper addresses the challenges in detecting the potential correlation between numerical data streams, which facilitates the research of data stream mining and pattern discovery. We focus on local correlation with delay, which may occur in burst at different time in different streams, and last for a limited period. The uncertainty on the correlation occurrence and the time delay make it difficult to monitor the correlation online. Furthermore, the conventional correlation measure lacks the ability of reflecting visual linearity, which is more desirable in reality. This paper proposes effective methods to continuously detect the correlation between data streams. Our approach is based on the Discrete Fourier Transform to make rapid cross-correlation calculation with time delay allowed. In addition, we introduce a shape-based similarity measure into the framework, which refines the results by representative trend patterns to enhance the significance of linearity. The similarity of proposed linear representations can quickly estimate the correlation, and the window sliding strategy in segment level improves the efficiency for online detection. The empirical study demonstrates the accuracy of our detection approach, as well as more than 30% improvement of efficiency.
Efficient processing of streaming graphs for evolution-aware clustering BIBAFull-Text 319-328
  Mindi Yuan; Kun-Lung Wu; Gabriela Jacques-Silva; Yi Lu
The clustering of vertices often evolves with time in a streaming graph, where graph update events are given as a stream of edge (vertex) insertions and deletions. Although a sliding window in stream processing naturally captures some cluster evolution, it alone may not be adequate, especially if the window size is large and the clustering within the windowed stream is unstable. Prior graph clustering approaches are mostly insensitive to clustering evolution. In this paper, we present an efficient approach to processing streaming graphs for evolution-aware clustering (EAC) of vertices. We incrementally manage individual connected components as clusters subject to a constraint on the maximal cluster size. For each cluster, we keep the relative recency of edges in a sorted order and favor more recent edges in clustering. We evaluate the effectiveness of EAC and compare it with a previous state-of-the-art evolution-insensitive clustering (EIC) approach. The results show that EAC is both effective and efficient in capturing evolution in a streaming graph. Moreover, we implement EAC as a streaming graph operator on IBM's InfoSphere Streams, a large-scale distributed middleware for stream processing, and show snapshots of the user cluster evolution in a streaming Twitter mention graph.
Searching similar segments over textual event sequences BIBAFull-Text 329-338
  Liang Tang; Tao Li; Shu-Ching Chen; Shunzhi Zhu
Sequential data is prevalent in many scientific and commercial applications such as bioinformatics, system security and networking. Similarity search has been widely studied for symbolic and time series data in which each data object is a symbol or numeric value. Textual event sequences are sequences of events, where each object is a message describing an event. For example, system logs are typical textual event sequences and each event is a textual message recording internal system operations, statuses, configuration modifications or execution errors. Similar segments of an event sequence reveals similar system behaviors in the past which are helpful for system administrators to diagnose system problems. Existing search indexing for textual data only focus on unordered data. Substring matching methods are able to efficiently find matched segments over a sequence, however, their sequences are single values rather than texts. In this paper, we propose a method, suffix matrix, for efficiently searching similar segments over textual event sequences. It provides an integration of two disparate techniques: locality-sensitive hashing and suffix arrays. This method also supports the k-dissimilar segment search. A k-dissimilar segment is a segment that has at most k dissimilar events to the query sequence. By using random sequence mask proposed in this paper, this method can have a high probability to reach all k-dissimilar segments without increasing much search cost. We conduct experiments on real system log data and the experimental results show that our proposed method outperforms alternative methods using existing techniques.
RWS-Diff: flexible and efficient change detection in hierarchical data BIBAFull-Text 339-348
  Jan P. Finis; Martin Raiber; Nikolaus Augsten; Robert Brunel; Alfons Kemper; Franz Färber
The problem of generating a cost-minimal edit script between two trees has many important applications. However, finding such a cost-minimal script is computationally hard, thus the only methods that scale are approximate ones. Various approximate solutions have been proposed recently. However, most of them still show quadratic or worse runtime complexity in the tree size and thus do not scale well either. The only solutions with log-linear runtime complexity use simple matching algorithms that only find corresponding subtrees as long as these subtrees are equal. Consequently, such solutions are not robust at all, since small changes in the leaves which occur frequently can make all subtrees that contain the changed leaves unequal and thus prevent the matching of large portions of the trees. This problem could be avoided by searching for similar instead of equal subtrees but current similarity approaches are too costly and thus also show quadratic complexity. Hence, currently no robust log-linear method exists.
   We propose the random walks similarity (RWS) measure which can be used to find similar subtrees rapidly. We use this measure to build the RWS-Diff algorithm that is able to compute an approximately cost-minimal edit script in log-linear time while having the robustness of a similarity-based approach. Our evaluation reveals that random walk similarity indeed increases edit script quality and robustness drastically while still maintaining a runtime comparable to simple matching approaches.
Causality and responsibility: probabilistic queries revisited in uncertain databases BIBAFull-Text 349-358
  Xiang Lian; Lei Chen
Recently, due to ubiquitous data uncertainty in many real-life applications, it has become increasingly important to study efficient and effective processing of various probabilistic queries over uncertain data, which usually retrieve uncertain objects that satisfy query predicates with high probabilities. However, one annoying, yet challenging, problem is that, some probabilistic queries are very sensitive to low-quality objects in uncertain databases, and the returned query answers might miss some important results (due to low data quality). To identify both accurate query answers and those potentially low-quality objects, in this paper, we investigate the causes of query answers/non-answers from a novel angle of causality and responsibility (CR), and propose a new interpretation of probabilistic queries. Particularly, we focus on the problem of CR-based probabilistic nearest neighbor (CR-PNN) query, and design a general framework for answering CR-based queries (including CR-PNN), which can return both query answers with high confidences and low-quality objects that may potentially affect query results (for data cleaning purposes). To efficiently process CR-PNN queries, we propose effective pruning strategies to quickly filter out false alarms, and design efficient algorithms to obtain CR-PNN answers. Extensive experiments have been conducted to verify the efficiency and effectiveness of our proposed approaches.

IR track: search engines

Locality sensitive hashing for scalable structural classification and clustering of web documents BIBAFull-Text 359-368
  Christian Hachenberg; Thomas Gottron
Web content management systems as well as web front ends to databases usually use mechanisms based on homogeneous templates for generating and populating HTML documents containing structured, semi-structured or plain text data. Wrapper based information extraction techniques leverage such templates as an essential cornerstone of their functionality but rely heavily on the availability of proper training documents based on the specific template. Thus, structural classification and structural clustering of web documents is an important contributing factor to the success of those methods. We introduce a novel technique to support these two tasks: template fingerprints. Template fingerprints are locality sensitive hash values in the form of short sequences of characters which effectively represent the underlying template of a web document. Small changes in the document structure, as they may occur in template based documents, lead to no or only minor variations in the corresponding fingerprint. Based on the fingerprints we introduce a scalable index structure and algorithm for large collections of web documents, which can retrieve structurally similar documents efficiently. The effectiveness of our approach is empirically validated in a classification task on a data set of 13,237 documents based on 50 templates from different domains. The general efficiency and scalability is evaluated in a clustering task on a data set retrieved from the Open Directory Project comprising more than 3.6 million web documents. For both tasks, our template fingerprint approach provides results of high quality and demonstrates a linear runtime of O(n) w.r.t. the number of documents.
An index for efficient semantic full-text search BIBAFull-Text 369-378
  Hannah Bast; Björn Buchhold
In this paper we present a novel index data structure tailored towards semantic full-text search. Semantic full-text search, as we call it, deeply integrates keyword-based full-text search with structured search in ontologies. Queries are SPARQL-like, with additional relations for specifying word-entity co-occurrences. In order to build such queries the user needs to be guided. We believe that incremental query construction with context-sensitive suggestions in every step serves that purpose well. Our index has to answer queries and provide such suggestions in real time. We achieve this through a novel kind of posting lists and query processing, avoiding very long (intermediate) result lists and expensive (non-local) operations on these lists. In an evaluation of 8000 queries on the full English Wikipedia (40 GB XML dump) and the YAGO ontology (26.6 million facts), we achieve average query and suggestion times of around 150ms.
Load-sensitive selective pruning for distributed search BIBAFull-Text 379-388
  Daniele Broccolo; Craig Macdonald; Salvatore Orlando; Iadh Ounis; Raffaele Perego; Fabrizio Silvestri; Nicola Tonellotto
A search engine infrastructure must be able to provide the same quality of service to all queries received during a day. During normal operating conditions, the demand for resources is considerably lower than under peak conditions, yet an oversized infrastructure would result in an unnecessary waste of computing power. A possible solution adopted in this situation might consist of defining a maximum threshold processing time for each query, and dropping queries for which this threshold elapses, leading to disappointed users. In this paper, we propose and evaluate a different approach, where, given a set of different query processing strategies with differing efficiency, each query is considered by a framework that sets a maximum query processing time and selects which processing strategy is the best for that query, such that the processing time for all queries is kept below the threshold. The processing time estimates used by the scheduler are learned from past queries. We experimentally validate our approach on 10,000 queries from a standard TREC dataset with over 50 million documents, and we compare it with several baselines. These experiments encompass testing the system under different query loads and different maximum tolerated query response times. Our results show that, at the cost of a marginal loss in terms of response quality, our search system is able to answer 90% of queries within half a second during times of high query volume.
Rank-energy selective query forwarding for distributed search systems BIBAFull-Text 389-398
  Amin Teymorian; Ophir Frieder; Marcus A. Maloof
Scaling high-quality, cost-efficient query evaluation is critical to search system performance. Although partial indexes reduce query processing times, result quality may be jeopardized due to exclusion of relevant non-local documents. Selectively forwarding queries between geographically distributed search sites may help. The basic idea of query forwarding is that after a local site receives a query, it determines non-local sites to forward the query to and returns an aggregation of the local and non-local results. Nevertheless, electricity costs remain substantial sources of operating expenses. We present a hybrid rank-energy query forwarding model termed "RESQ." The novel contribution is to simultaneously consider both ranking quality and spatially-temporally varying energy prices when making forwarding decisions. Experiments with a large-scale query log, publicly-available electricity price data, and real search site locations demonstrate that query forwarding under RESQ achieves the result scalability of partial indexes with the cost savings of energy-aware approaches (e.g., an 87% ranking guarantee with a 46% savings in energy costs).
Augmenting web search surrogates with images BIBAFull-Text 399-408
  Robert Capra; Jaime Arguello; Falk Scholer
While images are commonly used in search result presentation for vertical domains such as shopping and news, web search results surrogates remain primarily text-based. In this paper, we present results of two large-scale user studies to examine the effects of augmenting text-based surrogates with images extracted from the underlying webpage. We evaluate effectiveness and efficiency at both the individual surrogate level and at the results page level. Additionally, we investigate the influence of two factors: the goodness of the image in terms of representing the underlying page content, and the diversity of the results on a results page. Our results show that at the individual surrogate level, good images provide only a small benefit in judgment accuracy versus text-only surrogates, with a slight increase in judgment time. At the results page level, surrogates with good images had similar effectiveness and efficiency compared to the text-only condition. However, in situations where the results page items had diverse senses, surrogates with images had higher click precision versus text-only ones. Results of these studies show tradeoffs in the use of images in web search surrogates, and highlight particular situations where they can provide benefits.

IR track: networks

Building a large-scale corpus for evaluating event detection on Twitter BIBAFull-Text 409-418
  Andrew J. McMinn; Yashar Moshfeghi; Joemon M. Jose
Despite the popularity of Twitter for research, there are very few publicly available corpora, and those which are available are either too small or unsuitable for tasks such as event detection. This is partially due to a number of issues associated with the creation of Twitter corpora, including restrictions on the distribution of the tweets and the difficultly of creating relevance judgements at such a large scale. The difficulty of creating relevance judgements for the task of event detection is further hampered by ambiguity in the definition of event. In this paper, we propose a methodology for the creation of an event detection corpus. Specifically, we first create a new corpus that covers a period of 4 weeks and contains over 120 million tweets, which we make available for research. We then propose a definition of event which fits the characteristics of Twitter, and using this definition, we generate a set of relevance judgements aimed specifically at the task of event detection. To do so, we make use of existing state-of-the-art event detection approaches and Wikipedia to generate a set of candidate events with associated tweets. We then use crowdsourcing to gather relevance judgements, and discuss the quality of results, including how we ensured integrity and prevented spam. As a result of this process, along with our Twitter corpus, we release relevance judgements containing over 150,000 tweets, covering more than 500 events, which can be used for the evaluation of event detection approaches.
On sparsity and drift for effective real-time filtering in microblogs BIBAFull-Text 419-428
  M-Dyaa Albakour; Craig Macdonald; Iadh Ounis
In this paper, we approach the problem of real-time filtering in the Twitter Microblogging platform. We adapt an effective traditional news filtering technique, which uses a text classifier inspired by Rocchio's relevance feedback algorithm, to build and dynamically update a profile of the user's interests in real-time. In our adaptation, we tackle two challenges that are particularly prevalent in Twitter: sparsity and drift. In particular, sparsity stems from the brevity of tweets, while drift occurs as events related to the topic develop or the interests of the user change. First, to tackle the acute sparsity problem, we apply query expansion to derive terms or related tweets for a richer initialisation of the user interests within the profile. Second, to deal with drift, we modify the user profile to balance between the importance of the short-term interests, i.e. emerging subtopics, and the long-term interests in the overall topic. Moreover, we investigate an event detection method from Twitter and newswire streams to predict times at which drift may happen. Through experiments using the TREC Microblog track 2012, we show that our approach is effective for a number of common filtering metrics such as the user's utility, and that it compares favourably with state-of-the-art news filtering baselines. Our results also uncover the impact of different factors on handling topic drifting.
Probabilistic solutions of influence propagation on social networks BIBAFull-Text 429-438
  Miao Zhang; Chunni Dai; Chris Ding; Enhong Chen
Given fixed budgets, companies attempt to obtain maximum coverage on a social network by targeting at influential individuals. This viral marketing is often modeled by the independent cascade model. However, identifying the most influential people by computing influence spread is NP-hard, and various approximate algorithms are developed. In this paper, we emphasize the probabilistic nature of influence propagation. We propose to use exact probabilistic solutions and prove an inclusion-exclusion principle for computing influence spread. Our probabilistic solutions can significantly speed up the computation of influence spread. We also give a probabilistic-additive incremental search strategy to solve the influence maximization problem, i.e., to find a subset of individuals that has the largest influence spread in the end. Experiments on real data sets demonstrated the effectiveness and efficiency of our methods.
Improving pseudo-relevance feedback via tweet selection BIBAFull-Text 439-448
  Taiki Miyanishi; Kazuhiro Seki; Kuniaki Uehara
Query expansion methods using pseudo-relevance feedback have been shown effective for microblog search because they can solve vocabulary mismatch problems often seen in searching short documents such as Twitter messages (tweets), which are limited to 140 characters. Pseudo-relevance feedback assumes that the top ranked documents in the initial search results are relevant and that they contain topic-related words appropriate for relevance feedback. However, those assumptions do not always hold in reality because the initial search results often contain many irrelevant documents. In such a case, only a few of the suggested expansion words may be useful with many others being useless or even harmful. To overcome the limitation of pseudo-relevance feedback for microblog search, we propose a novel query expansion method based on two-stage relevance feedback that models search interests by manual tweet selection and integration of lexical and temporal evidence into its relevance model. Our experiments using a corpus of microblog data (the Tweets2011 corpus) demonstrate that the proposed two-stage relevance feedback approaches considerably improve search result relevance over almost all topics.
Supporting exploratory people search: a study of factor transparency and user control BIBAFull-Text 449-458
  Shuguang Han; Daqing He; Jiepu Jiang; Zhen Yue
People search is an active research topic in recent years. Related works includes expert finding, collaborator recommendation, link prediction and social matching. However, the diverse objectives and exploratory nature of those tasks make it difficult to develop a flexible method for people search that works for every task. In this project, we developed PeopleExplorer, an interactive people search system to support exploratory search tasks when looking for people. In the system, users could specify their task objectives by selecting and adjusting key criteria. Three criteria were considered: the content relevance, the candidate authoritativeness and the social similarity between the user and the candidates. This project represents a first attempt to add transparency to exploratory people search, and to give users full control over the search process. The system was evaluated through an experiment with 24 participants undertaking four different tasks. The results show that with comparable time and effort, users of our system performed significantly better in their people search tasks than those using the baseline system. Users of our system also exhibited many unique behaviors in query reformulation and candidate selection. We found that users' general perceptions about three criteria varied during different tasks, which confirms our assumptions regarding modeling task difference and user variance in people search systems.

KM track: social networks (2)

Location prediction in social media based on tie strength BIBAFull-Text 459-468
  Jeffrey McGee; James Caverlee; Zhiyuan Cheng
We propose a novel network-based approach for location estimation in social media that integrates evidence of the social tie strength between users for improved location estimation. Concretely, we propose a location estimator -- FriendlyLocation -- that leverages the relationship between the strength of the tie between a pair of users, and the distance between the pair. Based on an examination of over 100 million geo-encoded tweets and 73 million Twitter user profiles, we identify several factors such as the number of followers and how the users interact that can strongly reveal the distance between a pair of users. We use these factors to train a decision tree to distinguish between pairs of users who are likely to live nearby and pairs of users who are likely to live in different areas. We use the results of this decision tree as the input to a maximum likelihood estimator to predict a user's location. We find that this proposed method significantly improves the results of location estimation relative to a state-of-the-art technique. Our system reduces the average error distance for 80% of Twitter users from 40 miles to 21 miles using only information from the user's friends and friends-of-friends, which has great significance for augmenting traditional social media and enriching location-based services with more refined and accurate location estimates.
To stay or not to stay: modeling engagement dynamics in social graphs BIBAFull-Text 469-478
  Fragkiskos D. Malliaros; Michalis Vazirgiannis
Given a large social graph, how can we model the engagement properties of nodes? Can we quantify engagement both at node level as well as at graph level? Typically, engagement refers to the degree that an individual participates (or is encouraged to participate) in a community and is closely related to the important property of nodes' departure dynamics, i.e., the tendency of individuals to leave the community. In this paper, we build upon recent work in the field of game theory, where the behavior of individuals (nodes) is modeled by a technology adoption game. That is, the decision of a node to remain engaged in the graph is affected by the decision of its neighbors, and the "best practice" for each individual is captured by its core number -- as arises from the k-core decomposition. After modeling and defining the engagement dynamics at node and graph level, we examine whether they depend on structural and topological features of the graph. We perform experiments on a multitude of real graphs, observing interesting connections with other graph characteristics, as well as a clear deviation from the corresponding behavior of random graphs. Furthermore, similar to the well known results about the robustness of real graphs under random and targeted node removals, we discuss the implications of our findings on a special case of robustness -- regarding random and targeted node departures based on their engagement level.
UNIK: unsupervised social network spam detection BIBAFull-Text 479-488
  Enhua Tan; Lei Guo; Songqing Chen; Xiaodong Zhang; Yihong Zhao
Social network spam increases explosively with the rapid development and wide usage of various social networks on the Internet. To timely detect spam in large social network sites, it is desirable to discover unsupervised schemes that can save the training cost of supervised schemes. In this work, we first show several limitations of existing unsupervised detection schemes. The main reason behind the limitations is that existing schemes heavily rely on spamming patterns that are constantly changing to avoid detection. Motivated by our observations, we first propose a sybil defense based spam detection scheme SD2 that remarkably outperforms existing schemes by taking the social network relationship into consideration. In order to make it highly robust in facing an increased level of spam attacks, we further design an unsupervised spam detection scheme, called UNIK. Instead of detecting spammers directly, UNIK works by deliberately removing non-spammers from the network, leveraging both the social graph and the user-link graph. The underpinning of UNIK is that while spammers constantly change their patterns to evade detection, non-spammers do not have to do so and thus have a relatively non-volatile pattern. UNIK has comparable performance to SD2 when it is applied to a large social network site, and outperforms SD2 significantly when the level of spam attacks increases. Based on detection results of UNIK, we further analyze several identified spam campaigns in this social network site. The result shows that different spammer clusters demonstrate distinct characteristics, implying the volatility of spamming patterns and the ability of UNIK to automatically extract spam signatures.
Modeling dynamics of meta-populations with a probabilistic approach: global diffusion in social media BIBAFull-Text 489-498
  Minkyoung Kim; David Newth; Peter Christen
Increasingly, diverse online social networks are locally and globally interconnected by sharing information in the Web ecosystem. Accordingly, emergent macro-level phenomena have been observed, such as global spread of news across different types of social media. Such real-world diffusion is hard to define with a single social platform alone since dynamic influences between heterogeneous social networks are not negligible. Also, the underlying structural property of networks is important, as it drives the diffusion process in a stochastic way. In this paper, we propose a macro-level diffusion model with a probabilistic approach by combining both heterogeneity and structural connectivity of social networks. As real-world phenomena, we take cases from news diffusion across News, social networking sites (SNS), and Blog media using the ICWSM'11 Spinn3r dataset which contains over 386 million Web documents covering a one-month period in early 2011. We find that influence between different media types is varied by context of information. News media are the most influential in the Arts and Economy categories, while SNS and Blog media are in the Politics and Culture categories, respectively. Also, controversial topics such as political protests and multiculturalism failure tend to spread concurrently across social media, while entertainment topics such as film releases and celebrities are likely driven by internal interactions within single social platforms. We expect that the proposed model applies to a wider class of diffusion phenomena in diverse fields including the social sciences, marketing, and neuroscience, and that it provides a way of interpreting dynamics of meta-populations in terms of strength and directionality of influences among them.
Diffusion of innovations revisited: from social network to innovation network BIBAFull-Text 499-508
  Xin Rong; Qiaozhu Mei
The spreading of innovations among individuals and organizations in a social network has been extensively studied. Although the recent studies among the social computing and data mining communities have produced various insightful conclusions about the diffusion process of innovations by focusing on the properties and evolution of social network structures, less attention has been paid to the interrelationships among the multiple innovations being diffused, such as the competitive and collaborative relationships between innovations. In this paper, we take a formal quantitative approach to address how different pieces of innovations socialize with each other and how the interrelationships among innovations affect users' adoption behavior, which provides a novel perspective of understanding the diffusion of innovations. Networks of innovations are constructed by mining large scale text collections in an unsupervised fashion. We are particularly interested in the following questions: what are the meaningful metrics on the network of innovations? What effects do these metrics exert on the diffusion of innovations? Do these effects vary among users with different adoption preferences or communication styles? While existing studies primarily address social influence, we provide a detailed discussion of how innovations interrelate and influence the diffusion process.

KM track: mining big data

StaticGreedy: solving the scalability-accuracy dilemma in influence maximization BIBAFull-Text 509-518
  Suqi Cheng; Huawei Shen; Junming Huang; Guoqing Zhang; Xueqi Cheng
Influence maximization, defined as a problem of finding a set of seed nodes to trigger a maximized spread of influence, is crucial to viral marketing on social networks. For practical viral marketing on large scale social networks, it is required that influence maximization algorithms should have both guaranteed accuracy and high scalability. However, existing algorithms suffer a scalability-accuracy dilemma: conventional greedy algorithms guarantee the accuracy with expensive computation, while the scalable heuristic algorithms suffer from unstable accuracy
   In this paper, we focus on solving this scalability-accuracy dilemma. We point out that the essential reason of the dilemma is the surprising fact that the submodularity, a key requirement of the objective function for a greedy algorithm to approximate the optimum, is not guaranteed in all conventional greedy algorithms in the literature of influence maximization. Therefore a greedy algorithm has to afford a huge number of Monte Carlo simulations to reduce the pain caused by unguaranteed submodularity. Motivated by this critical finding, we propose a static greedy algorithm, named StaticGreedy, to strictly guarantee the submodularity of influence spread function during the seed selection process. The proposed algorithm makes the computational expense dramatically reduced by two orders of magnitude without loss of accuracy. Moreover, we propose a dynamical update strategy which can speed up the StaticGreedy algorithm by 2-7 times on large scale social networks.
Online multitasking and user engagement BIBAFull-Text 519-528
  Janette Lehmann; Mounia Lalmas; Georges Dupret; Ricardo Baeza-Yates
Users often access and re-access more than one site during an online session, effectively engaging in multitasking. In this paper, we study the effect of online multitasking on two widely used engagement metrics designed to capture users browsing behavior with a site. Our study is based on browsing data of 2.5M users across 760 sites encompassing diverse types of services such as social media, news and mail. To account for multitasking we need to redefine how user sessions are represented and we need to adapt the metrics under study. We introduce a new representation of user sessions: tree-streams -- as opposed to the commonly used click-streams -- present a more accurate picture of the browsing behavior of a user that includes how users switch between sites (e.g., hyperlinking, teleporting, backpaging). We then discuss a number of insights on multitasking patterns, and show how these help to better understand how users engage with sites. Finally, we define metrics that characterize multitasking during online sessions and show how they provide additional insights to standard engagement metrics.
PATRIC: a parallel algorithm for counting triangles in massive networks BIBAFull-Text 529-538
  Shaikh Arifuzzaman; Maleq Khan; Madhav Marathe
Massive networks arising in numerous application areas poses significant challenges for network analysts as these networks grow to billions of nodes and are prohibitively large to fit in the main memory. Finding the number of triangles in a network is an important problem in the analysis of complex networks. Several interesting graph mining applications depend on the number of triangles in the graph. In this paper, we present an efficient MPI-based distributed memory parallel algorithm, called PATRIC, for counting triangles in massive networks. PATRIC scales well to networks with billions of nodes and can compute the exact number of triangles in a network with one billion nodes and 10 billion edges in 16 minutes. Balancing computational loads among processors for a graph problem like counting triangles is a challenging issue. We present and analyze several schemes for balancing load among processors for the triangle counting problem. These schemes achieve very good load balancing. We also show how our parallel algorithm can adapt an existing edge sparsification technique to approximate the number of triangles with very high accuracy. This modification allows us to count triangles in even larger networks.
An efficient MapReduce algorithm for counting triangles in a very large graph BIBAFull-Text 539-548
  Ha-Myung Park; Chin-Wan Chung
Triangle counting problem is one of the fundamental problem in various domains. The problem can be utilized for computation of clustering coefficient, transitivity, triangular connectivity, trusses, etc. The problem have been extensively studied in internal memory but the algorithms are not scalable for enormous graphs. In recent years, the MapReduce has emerged as a de facto standard framework for processing large data through parallel computing. A MapReduce algorithm was proposed for the problem based on graph partitioning. However, the algorithm redundantly generates a large number of intermediate data that cause network overload and prolong the processing time. In this paper, we propose a new algorithm based on graph partitioning with a novel idea of triangle classification to count the number of triangles in a graph. The algorithm substantially reduces the duplication by classifying triangles into three types and processing each triangle differently according to its type. In the experiments, we compare the proposed algorithm with recent existing algorithms using both synthetic datasets and real-world datasets that are composed of millions of nodes and billions of edges. The proposed algorithm outperforms other algorithms in most cases. Especially, for a twitter dataset, the proposed algorithm is more than twice as fast as existing MapReduce algorithms. Moreover, the performance gap increases as the graph becomes larger and denser.
Parallel motif extraction from very long sequences BIBAFull-Text 549-558
  Majed Sahli; Essam Mansour; Panos Kalnis
Motifs are frequent patterns used to identify biological functionality in genomic sequences, periodicity in time series, or user trends in web logs. In contrast to a lot of existing work that focuses on collections of many short sequences, modern applications require mining of motifs in one very long sequence (i.e., in the order of several gigabytes). For this case, there exist statistical approaches that are fast but inaccurate; or combinatorial methods that are sound and complete. Unfortunately, existing combinatorial methods are serial and very slow. Consequently, they are limited to very short sequences (i.e., a few megabytes), small alphabets (typically 4 symbols for DNA sequences), and restricted types of motifs.
   This paper presents ACME, a combinatorial method for extracting motifs from a single very long sequence. ACME arranges the search space in contiguous blocks that take advantage of the cache hierarchy in modern architectures, and achieves almost an order of magnitude performance gain in serial execution. It also decomposes the search space in a smart way that allows scalability to thousands of processors with more than 90% speedup. ACME is the only method that: (i) scales to gigabyte-long sequences; (ii) handles large alphabets; (iii) supports interesting types of motifs with minimal additional cost; and (iv) is optimized for a variety of architectures such as multi-core systems, clusters in the cloud, and supercomputers. ACME reduces the extraction time for an exact-length query from 4 hours to 7 minutes on a typical workstation; handles 3 orders of magnitude longer sequences; and scales up to 16,384 cores on a supercomputer.

KM track: ontologies

The logical diversity of explanations in OWL ontologies BIBAFull-Text 559-568
  Samantha Bail; Bijan Parsia; Ulrike Sattler
Given the high expressivity of the Web Ontology Language OWL 2, there is a potential for great diversity in the logical content of OWL ontologies. The fact that many naturally occurring entailments of such ontologies have multiple justifications indicates that ontologies often overdetermine their consequences, suggesting a diversity in supporting reasons. On closer inspection, however, we often find that justifications -- even for multiple entailments -- appear to be structurally similar, suggesting that their multiplicity might be due to diverse material, not formal grounds for an entailment.
   In this paper, we introduce and explore several equivalence relations over justifications for entailments of OWL ontologies which partition a set of justifications into structurally similar subsets. These equivalence relations range from strict isomorphism to looser notions of similarity, covering justifications which contain different class expressions, or even different numbers of axioms. We present the results of a survey of 78 ontologies from the biomedical domain which shows that OWL ontologies used in practice often contain large numbers of structurally similar justifications. We find that a large justification corpus can be reduced by 97% of its original size to a small core of frequently occurring justification templates.
Ontology authoring with FORZA BIBAFull-Text 569-578
  C. Maria Keet; Muhammad Tahir Khan; Chiara Ghidini
Generic, reusable ontology elements, such as a foundational ontology's categories and part-whole relations, are essential for good and interoperable knowledge representation. Ontology developers, which include domain experts and novices, face the challenge to figure out which category or relationship to choose for their ontology authoring task. To reduce this bottleneck, there is a need to have guidance to handle these Ontology-laden entities. We solve this with a generic approach and realize it with the Foundational Ontology and Reasoner-enhanced axiomatiZAtion (FORZA) method, containing DOLCE, a decision diagram for DOLCE categories, part-whole relations, and an automated reasoner that is used during the authoring process to propose feasible axioms. This fusion has been integrated in the MoKi ontology development tool to validate its implementability.
Aligning freebase with the YAGO ontology BIBAFull-Text 579-588
  Elena Demidova; Iryna Oelze; Wolfgang Nejdl
Linked Open Data (LOD) has emerged as the de-facto standard for publishing data on the Web. The cross-domain large scale Freebase and YAGO datasets represent central hubs and reference points for the LOD cloud. Freebase is an open-world dataset, which contains about 22 million entities and more than 350 million facts in more than 100 domains. The scale of Freebase makes it difficult for the users to get an overview of the data and efficiently retrieve the desired information. Integration of Freebase with the YAGO ontology that contains more than 360,000 concepts enables us to provide more semantic information for Freebase and to facilitate novel applications, such as efficient query construction, over large scale data. In this paper we analyze the structure of YAGO in more depth and show how to match YAGO and Freebase categories. The new YAGO+F structure that results from our matching tightly connects both datasets and provides an important next step to systematically interconnect LOD subcollections. We make our YAGO+F structure available online in the hope that it can provide a good starting point for future applications, which can build upon a wide variety of Freebase data clearly arranged in the semantic categories of YAGO.
PIDGIN: ontology alignment using web text as interlingua BIBAFull-Text 589-598
  Derry Wijaya; Partha Pratim Talukdar; Tom Mitchell
The problem of aligning ontologies and database schemas across different knowledge bases and databases is fundamental to knowledge management problems, including the problem of integrating the disparate knowledge sources that form the semantic web's Linked Data [5].
   We present a novel approach to this ontology alignment problem that employs a very large natural language text corpus as an interlingua to relate different knowledge bases (KBs). The result is a scalable and robust method (PIDGIN) that aligns relations and categories across different KBs by analyzing both (1) shared relation instances across these KBs, and (2) the verb phrases in the text instantiations of these relation instances. Experiments with PIDGIN demonstrate its superior performance when aligning ontologies across large existing KBs including NELL, Yago and Freebase. Furthermore, we show that in addition to aligning ontologies, PIDGIN can automatically learn from text, the verb phrases to identify relations, and can also type the arguments of relations of different KBs.
Mapping adaptation actions for the automatic reconciliation of dynamic ontologies BIBAFull-Text 599-608
  Julio Cesar Dos Reis; Duy Dinh; Cédric Pruski; Marcos Da Silveira; Chantal Reynaud-Delaître
The highly dynamic nature of domain ontologies has a direct impact on semantic mappings established between concepts from different ontologies. Mappings must therefore be maintained according to ongoing ontology changes. Since many software applications exploit mappings for managing information and knowledge, it is important to define appropriate adaptation strategies to apply to existing mappings in order to keep their validity over time. In this article, we propose a set of mapping adaptation actions and present how they are used to maintain mappings up-to-date based on ontology change operations of different nature. We conduct an experimental evaluation using life sciences ontologies and mappings. We measure the evolution of mappings based on the proposed approach to mapping adaptation. The results confirm that mappings must be individually adapted according to the different types of ontology change.

KM track: mobile and event mining

On mining mobile apps usage behavior for predicting apps usage in smartphones BIBAFull-Text 609-618
  Zhung-Xun Liao; Yi-Chin Pan; Wen-Chih Peng; Po-Ruey Lei
Predicting Apps usage has become an important task due to the proliferation of Apps, and the complex of Apps. However, the previous research works utilized a considerable number of different sensors as training data to infer Apps usage. To save the energy consumption for the task of predicting Apps usages, only the temporal information is considered in this paper. We propose a Temporal-based Apps Predictor (abbreviated as TAP) to dynamically predict the Apps which are most likely to be used. First, we extract three Apps usage features, global usage feature, temporal usage feature, and periodical usage feature from the Apps usage trace. Then, based on those explored features, we dynamically derive an Apps usage probability model to estimate the current usage probability of each App in each feature. Finally, we investigate the usage probability in each feature and select k Apps with highest usage probability from the probability model. In this paper, we propose two selection algorithms, MaxProb and MinEntropy. To evaluate the performance of TAP, we use two real mobile Apps usage traces and assess the accuracy and efficiency. The experimental results show that the proposed TAP with the MinEntropy selection algorithm could have shorter response time of Apps prediction. Moreover, the accuracy reaches to 80% when k is 5, and when k is 7, the accuracy achieves almost 100% in both of the two real datasets.
Ranking fraud detection for mobile apps: a holistic view BIBAFull-Text 619-628
  Hengshu Zhu; Hui Xiong; Yong Ge; Enhong Chen
Ranking fraud in the mobile App market refers to fraudulent or deceptive activities which have a purpose of bumping up the Apps in the popularity list. Indeed, it becomes more and more frequent for App develops to use shady means, such as inflating their Apps' sales or posting phony App ratings, to commit ranking fraud. While the importance of preventing ranking fraud has been widely recognized, there is limited understanding and research in this area. To this end, in this paper, we provide a holistic view of ranking fraud and propose a ranking fraud detection system for mobile Apps. Specifically, we investigate two types of evidences, ranking based evidences and rating based evidences, by modeling Apps' ranking and rating behaviors through statistical hypotheses tests. In addition, we propose an optimization based aggregation method to integrate all the evidences for fraud detection. Finally, we evaluate the proposed system with real-world App data collected from the Apple's App Store for a long time period. In the experiments, we validate the effectiveness of the proposed system, and show the scalability of the detection algorithm as well as some regularity of ranking fraud activities.
AnchorMF: towards effective event context identification BIBAFull-Text 629-638
  Hansu Gu; Mike Gartrell; Liang Zhang; Qin Lv; Dirk Grunwald
Online social networks (OSNs) such as Twitter provide a good platform for event discussions. Recent research [26][25] as shown that event discussions in OSNs are diverse and innovative and encourage public engagement in events. Although much research has been conducted in OSNs to track and detect events, there has been limited research on detecting or understanding the event context. Event context helps to better predict users' participation in events, identify relations among events, and recommend friends who share similar event context.
   In this work, we have developed AnchorMF, a matrix factorization based technique that aims to identify event context by leveraging a prevalent feature in OSNs, the anchor information. Our AnchorMF work makes three key contributions: (1) a formal definition of the event context identification problem; (2) anchor selection and incorporation into the matrix factorization process for effective event context identification; and (3) demonstration of applying event context for user-event participation prediction, relevant events retrieval, and friendship recommendation. Evaluation based on 1.1 million Twitter users over a one-month data collection period shows that AnchorMF achieves a 20.0% improvement in terms of user-event participation prediction.
How the live web feels about events BIBAFull-Text 639-648
  George Valkanas; Dimitrios Gunopulos
Microblogging platforms, such as Twitter, Tumblr etc., have been established as key components in the contemporary Web ecosystem. Users constantly post snippets of information regarding their actions, interests or perception of their surroundings, which is why they have been attributed the term Live Web. Nevertheless, research on such platforms has been quite limited when it comes to identifying events, but is rapidly gaining ground. Event identification is a key step to news reporting, proactive or reactive crisis management at multiple scales, efficient resource allocation, etc. In this paper, we focus on the problem of automatically identifying events as they occur, in such a user-driven, fast paced and voluminous setting. We propose a novel and natural way to address the issue using notions from emotional theories, combined with spatiotemporal information and employ online event detection mechanisms to solve it at large scale in a distributed fashion. We present a modular framework that incorporates all of our key ideas and experimentally validate its superiority, in terms of both efficiency and effectiveness, over the state-of-the-art using real life data from the Twitter stream. We also present empirical evidence on the importance of spatiotemporal information in event detection for this setting.
Boolean satisfiability for sequence mining BIBAFull-Text 649-658
  Said Jabbour; Lakhdar Sais; Yakoub Salhi
In this paper, we propose a SAT-based encoding for the problem of discovering frequent, closed and maximal patterns in a sequence of items and a sequence of itemsets. Our encoding can be seen as an improvement of the approach proposed in [8] for the sequences of items. In this case, we show experimentally on real world data that our encoding is significantly better. Then we introduce a new extension of the problem to enumerate patterns in a sequence of itemsets. Thanks to the flexibility and to the declarative aspects of our SAT-based approach, an encoding for the sequences of itemsets is obtained by a very slight modification of that for the sequences of items.

IR track: evaluation

Users versus models: what observation tells us about effectiveness metrics BIBAFull-Text 659-668
  Alistair Moffat; Paul Thomas; Falk Scholer
Retrieval system effectiveness can be measured in two quite different ways: by monitoring the behavior of users and gathering data about the ease and accuracy with which they accomplish certain specified information-seeking tasks; or by using numeric effectiveness metrics to score system runs in reference to a set of relevance judgments. In the second approach, the effectiveness metric is chosen in the belief that user task performance, if it were to be measured by the first approach, should be linked to the score provided by the metric.
   This work explores that link, by analyzing the assumptions and implications of a number of effectiveness metrics, and exploring how these relate to observable user behaviors. Data recorded as part of a user study included user self-assessment of search task difficulty; gaze position; and click activity. Our results show that user behavior is influenced by a blend of many factors, including the extent to which relevant documents are encountered, the stage of the search process, and task difficulty. These insights can be used to guide development of batch effectiveness metrics.
Evaluating aggregated search using interleaving BIBAFull-Text 669-678
  Aleksandr Chuklin; Anne Schuth; Katja Hofmann; Pavel Serdyukov; Maarten de Rijke
A result page of a modern web search engine is often much more complicated than a simple list of "ten blue links." In particular, a search engine may combine results from different sources (e.g., Web, News, and Images), and display these as grouped results to provide a better user experience. Such a system is called an aggregated or federated search system.
   Because search engines evolve over time, their results need to be constantly evaluated. However, one of the most efficient and widely used evaluation methods, interleaving, cannot be directly applied to aggregated search systems, as it ignores the need to group results originating from the same source (vertical results).
   We propose an interleaving algorithm that allows comparisons of search engine result pages containing grouped vertical documents. We compare our algorithm to existing interleaving algorithms and other evaluation methods (such as A/B-testing), both on real-life click log data and in simulation experiments. We find that our algorithm allows us to perform unbiased and accurate interleaved comparisons that are comparable to conventional evaluation techniques. We also show that our interleaving algorithm produces a ranking that does not substantially alter the user experience, while being sensitive to changes in both the vertical result block and the non-vertical document rankings. All this makes our proposed interleaving algorithm an essential tool for comparing IR systems with complex aggregated pages.
Using historical click data to increase interleaving sensitivity BIBAFull-Text 679-688
  Eugene Kharitonov; Craig Macdonald; Pavel Serdyukov; Iadh Ounis
Interleaving is an online evaluation method to compare two alternative ranking functions based on the users' implicit feedback. In an interleaving experiment, the results from two ranking functions are merged in a single result list and presented to the users. The users' click feedback on the merged result list is analysed to derive preferences over the ranking functions. An important property of interleaving methods is their sensitivity, i.e. their ability to reliably derive the comparison outcome with a relatively small amount of user behaviour data. This allows testing of changes in the search engine ranking functions frequently and, as a result, rapid iterations in developing search quality improvements can be achieved. In this paper we propose a novel approach to further improve interleaving sensitivity by using pre-experimental user behaviour data. In particular, the click history is used to train a click model, which is then used to predict which interleaved result pages are likely to contribute to the experiment outcome. The probabilities of presenting these interleaved result pages to the users are then optimised, such that the sensitivity of interleaving is maximised. In order to evaluate the proposed approach, we re-use data from six actual interleaving experiments, previously performed by a commercial search engine. Our results demonstrate that the proposed approach outperforms a state-of-the-art baseline, achieving up to a median of 48% reduction in the number of impressions for the same level of confidence.
On the reliability and intuitiveness of aggregated search metrics BIBAFull-Text 689-698
  Ke Zhou; Mounia Lalmas; Tetsuya Sakai; Ronan Cummins; Joemon M. Jose
Aggregating search results from a variety of diverse verticals such as news, images, videos and Wikipedia into a single interface is a popular web search presentation paradigm. Although several aggregated search (AS) metrics have been proposed to evaluate AS result pages, their properties remain poorly understood. In this paper, we compare the properties of existing AS metrics under the assumptions that (1) queries may have multiple preferred verticals; (2) the likelihood of each vertical preference is available; and (3) the topical relevance assessments of results returned from each vertical is available. We compare a wide range of AS metrics on two test collections. Our main criteria of comparison are (1) discriminative power, which represents the reliability of a metric in comparing the performance of systems, and (2) intuitiveness, which represents how well a metric captures the various key aspects to be measured (i.e. various aspects of a user's perception of AS result pages). Our study shows that the AS metrics that capture key AS components (e.g., vertical selection) have several advantages over other metrics. This work sheds new lights on the further developments and applications of AS metrics.
User intent and assessor disagreement in web search evaluation BIBAFull-Text 699-708
  Gabriella Kazai; Emine Yilmaz; Nick Craswell; S. M. M. Tahaghoghi
Preference based methods for collecting relevance data for information retrieval (IR) evaluation have been shown to lead to better inter-assessor agreement than the traditional method of judging individual documents. However, little is known as to why preference judging reduces assessor disagreement and whether better agreement among assessors also means better agreement with user satisfaction, as signaled by user clicks. In this paper, we examine the relationship between assessor disagreement and various click based measures, such as click preference strength and user intent similarity, for judgments collected from editorial judges and crowd workers using single absolute, pairwise absolute and pairwise preference based judging methods. We find that trained judges are significantly more likely to agree with each other and with users than crowd workers, but inter-assessor agreement does not mean agreement with users. Switching to a pairwise judging mode improves crowdsourcing quality close to that of trained judges. We also find a relationship between intent similarity and assessor-user agreement, where the nature of the relationship changes across judging modes. Overall, our findings suggest that the awareness of different possible intents, enabled by pairwise judging, is a key reason of the improved agreement, and a crucial requirement when crowdsourcing relevance data.

IR track

The water filling model and the cube test: multi-dimensional evaluation for professional search BIBAFull-Text 709-714
  Jiyun Luo; Christopher Wing; Hui Yang; Marti Hearst
Professional search activities such as patent and legal search are often time sensitive and consist of rich information needs with multiple aspects or subtopics. This paper proposes a 3D water filling model to describe this search process, and derives a new evaluation metric, the Cube Test, to encompass the complex nature of professional search. The new metric is compared against state-of-the-art patent search evaluation metrics as well as Web search evaluation metrics over two distinct patent datasets. The experimental results show that the Cube Test metric effectively captures the characteristics and requirements of professional search.
Disinformation techniques for entity resolution BIBAFull-Text 715-720
  Steven Euijong Whang; Hector Garcia-Molina
We study the problem of disinformation. We assume that an "agent" has some sensitive information that the "adversary" is trying to obtain. For example, a camera company (the agent) may secretly be developing its new camera model, and a user (the adversary) may want to know in advance the detailed specs of the model. The agent's goal is to disseminate false information to "dilute" what is known by the adversary. We model the adversary as an Entity Resolution (ER) process that pieces together available information. We formalize the problem of finding the disinformation with the highest benefit given a limited budget for creating the disinformation and propose efficient algorithms for solving the problem. We then evaluate our disinformation planning algorithms on real and synthetic data and compare the robustness of existing ER algorithms. In general, our disinformation techniques can be used as a framework for testing ER robustness.
Location recommendation for out-of-town users in location-based social networks BIBAFull-Text 721-726
  Gregory Ference; Mao Ye; Wang-Chien Lee
Most previous research on location recommendation services in location-based social networks (LBSNs) makes recommendations without considering where the targeted user is currently located. Such services may recommend a place near her hometown even if the user is traveling out of town. In this paper, we study the issues in making location recommendations for out-of-town users by taking into account user preference, social influence and geographical proximity. Accordingly, we propose a collaborative recommendation framework, called User Preference, Proximity and Social-Based Collaborative Filtering} (UPS-CF), to make location recommendation for mobile users in LBSNs. We validate our ideas by comprehensive experiments using real datasets collected from Foursquare and Gowalla. By comparing baseline algorithms and conventional collaborative filtering approach (and its variants), we show that UPS-CF exhibits the best performance. Additionally, we find that preference derived from similar users is important for in-town users while social influence becomes more important for out-of-town users.
Short text classification by detecting information path BIBAFull-Text 727-732
  Shitao Zhang; Xiaoming Jin; Dou Shen; Bin Cao; Xuetao Ding; Xiaochen Zhang
Short text is becoming ubiquitous in many modern information systems. Due to the shortness and sparseness of short texts, there are less informative word co-occurrences among them, which naturally pose great difficulty for classification tasks on such data. To overcome this difficulty, this paper proposes a new way for effectively classifying the short texts. Our method is based on a key observation that there usually exists ordered subsets in short texts, which is termed "information path" in this work, and classification on each subset based on the classification results of some previous subsets can yield higher overall accuracy than classifying the entire data set directly. We propose a method to detect the information path and employ it in short text classification. Different from the state-of-art methods, our method does not require any external knowledge or corpus that usually need careful fine-tuning, which makes our method easier and more robust on different data sets. Experiments on two real world data sets show the effectiveness of the proposed method and its superiority over the existing methods.
Personalized point-of-interest recommendation by mining users' preference transition BIBAFull-Text 733-738
  Xin Liu; Yong Liu; Karl Aberer; Chunyan Miao
Location-based social networks (LBSNs) offer researchers rich data to study people's online activities and mobility patterns. One important application of such studies is to provide personalized point-of-interest (POI) recommendations to enhance user experience in LBSNs. Previous solutions directly predict users' preference on locations but fail to provide insights about users' preference transitions among locations. In this work, we propose a novel category-aware POI recommendation model, which exploits the transition patterns of users' preference over location categories to improve location recommendation accuracy. Our approach consists of two stages: (1) preference transition (over location categories) prediction, and (2) category-aware POI recommendation. Matrix factorization is employed to predict a user's preference transitions over categories and then her preference on locations in the corresponding categories. Real data based experiments demonstrate that our approach outperforms the state-of-the-art POI recommendation models by at least 39.75% in terms of recall.
Proximity²-aware ranking for textual, temporal, and geographic queries BIBAFull-Text 739-744
  Jannik Strötgen; Michael Gertz
Temporal and geographic information needs are frequent and important but not well served by standard IR systems. Recent approaches address such needs by extracting and normalizing temporal and geographic expressions from documents. They calculate specific scores for the temporal and/or geographic parts of a query. However, all approaches assume independence between the different query parts. In this paper, we present a new model to rank documents according to combined textual, temporal, and geographic queries. The independence assumption between the query parts is eliminated by calculating proximity scores. Thus, documents are regarded to be more relevant if terms and expressions satisfying the different query parts occur close to each other in a document. As our evaluations based on the NTCIR-GeoTime data show, our proposed model outperforms baseline models that do not use proximity information.
Timely crawling of high-quality ephemeral new content BIBAFull-Text 745-750
  Damien Lefortier; Liudmila Ostroumova; Egor Samosvat; Pavel Serdyukov
In this paper, we study the problem of timely finding and crawling of ephemeral new pages, i.e., for which user traffic grows really quickly right after they appear, but lasts only for several days (e.g., news, blog and forum posts). Traditional crawling policies do not give any particular priority to such pages and may thus crawl them not quickly enough, and even crawl already obsolete content. We thus propose a new metric, well thought out for this task, which takes into account the decrease of user interest for ephemeral pages over time.
   We show that most ephemeral new pages can be found at a relatively small set of content sources and suggest a method for finding such a set. Our idea is to periodically recrawl content sources and crawl newly created pages linked from them, focusing on high-quality (in terms of user interest) content. One of the main difficulties here is to divide resources between these two activities in an efficient way. We find the adaptive balance between crawls and recrawls by maximizing the proposed metric. Further, we incorporate search engine click logs to give our crawler an insight about the current user demands. The effectiveness of our approach is finally demonstrated experimentally on real-world data.
LearNext: learning to predict tourists movements BIBAFull-Text 751-756
  Ranieri Baraglia; Cristina Ioana Muntean; Franco Maria Nardini; Fabrizio Silvestri
In this paper, we tackle the problem of predicting the "next" geographical position of a tourist given her history (i.e., the prediction is done accordingly to the tourist's current trail) by means of supervised learning techniques, namely Gradient Boosted Regression Trees and Ranking SVM. The learning is done on the basis of an object space represented by a 68 dimension feature vector, specifically designed for tourism related data. Furthermore, we propose a thorough comparison of several methods that are considered state-of-the-art in touristic recommender and trail prediction systems as well as a strong popularity baseline. Experiments show that the methods we propose outperform important competitors and baselines thus providing strong evidence of the performance of our solutions.
Where shall we go today?: planning touristic tours with tripbuilder BIBAFull-Text 757-762
  Igo Brilhante; Jose Antonio Macedo; Franco Maria Nardini; Raffaele Perego; Chiara Renso
In this paper we propose TripBuilder, a new framework for personalized touristic tour planning. We mine from Flickr the information about the actual itineraries followed by a multitude of different tourists, and we match these itineraries on the touristic Point of Interests available from Wikipedia. The task of planning personalized touristic tours is then modeled as an instance of the Generalized Maximum Coverage problem. Wisdom-of-the-crowds information allows us to derive touristic plans that maximize a measure of interest for the tourist given her preferences and visiting time-budget. Experimental results on three different touristic cities show that our approach is effective and outperforms strong baselines.

DB track: data streams and ranking

Efficient filtering and ranking schemes for finding inclusion dependencies on the web BIBAFull-Text 763-768
  Atsuyuki Morishima; Erika Yumiya; Masami Takahashi; Shigeo Sugimoto; Hiroyuki Kitagawa
Data integrity constraints are fundamental in various applications, such as data management, integration, cleaning, and schema extraction. In this paper, we address the problem of finding inclusion dependencies on the Web. The problem is important because (1) applications of inclusion dependencies, such as data quality management, are beneficial in the Web context, and (2) such dependencies are not explicitly given in general. In our approach, we enumerate pairs of HTML/XML elements that possibly represent inclusion dependencies and then rank the results for verification. First, we propose a bit-based signature scheme to efficiently select candidates (element pairs) in the enumeration process. The signature scheme is unique in that it supports Jaccard containment to deal with the incomplete nature of data on the Web, and preserves the semiorder inclusion relationship among sets of words. Second, we propose a ranking scheme to support a user in checking whether each enumerated pair actually suggests inclusion dependencies. The ranking scheme sorts the enumerated pairs so that we can examine a small number of pairs for simultaneously verifying many pairs.
A generic front-stage for semi-stream processing BIBAFull-Text 769-774
  M. Asif Naeem; Gerald Weber; Gillian Dobbie; Christof Lutteroth
Recently, a number of semi-stream join algorithms have been published. The typical system setup for these consists of one fast stream input that has to be joined with a disk-based relation R. These semi-stream join approaches typically perform the join with a limited main memory partition assigned to them, which is generally not large enough to hold the whole relation R. We propose a caching approach that can be used as a front-stage for different semi-stream join algorithms, resulting in significant performance gains for common applications. We analyze our approach in the context of a seminal semi-stream join, MESHJOIN (Mesh Join), and provide a cost model for the resulting semi-stream join algorithm, which we call CMESHJOIN (Cached Mesh Join). The algorithm takes advantage of skewed distributions; this article presents results for Zipfian distributions of the type that appears in many applications.
Scalable diversification of multiple search results BIBAFull-Text 775-780
  Hina A. Khan; Marina Drosou; Mohamed A. Sharaf
The explosion of big data emphasizes the need for scalable data diversification, especially for applications based on web, scientific, and business databases. However, achieving effective diversification in a multi-user environment is a rather challenging task due to the inherent high processing costs of current data diversification techniques. In this paper, we address the concurrent diversification of multiple search results using various approximation techniques that provide orders of magnitude reductions in processing cost, while maintaining comparable quality of diversification as compared to sequential methods. Our extensive experimental evaluation shows the scalability exhibited by our proposed methods under various workload settings.
Parallel triangle counting in massive streaming graphs BIBAFull-Text 781-786
  Kanat Tangwongsan; A. Pavan; Srikanta Tirthapura
The number of triangles in a graph is a fundamental metric widely used in social network analysis, link classification and recommendation, and more. In these applications, modern graphs of interest tend to both large and dynamic. This paper presents the design and implementation of a fast parallel algorithm for estimating the number of triangles in a massive undirected graph whose edges arrive as a stream. Our algorithm is designed for shared-memory multicore machines and can make efficient use of parallelism and the memory hierarchy. We provide theoretical guarantees on performance and accuracy, and our experiments on real-world datasets show accurate results and substantial speedups compared to an optimized sequential implementation.
Cache refreshing for online social news feeds BIBAFull-Text 787-792
  Xiao Bai; Flavio P. Junqueira; Adam Silberstein
Several social networking applications enable users to view the events generated by other users, typically friends in the social network, in the form of "news feeds". Friends and events are typically maintained per user and cached in memory to enable efficient generation of news feeds. Caching user friends and events, however, raises concerns about the freshness of news feeds as users may not observe the most recent events when cache content becomes stale. Mechanisms to keep cache content fresh are thus critical for user satisfaction while computing news feeds efficiently through caching.
   We propose a novel cache scheme called SOCR (Social Online Cache Refreshing) for identifying and refreshing cache entries. SOCR refreshes the cache in an online manner and does not require the backend data store to push updates to the cache. SOCR uses a utility-based strategy to accurately identify cache entries that need to be refreshed. The basic idea is to estimate at the time of each request to generate news feed whether refreshing would lead to different results for a news feed. To make such estimation, we model the rates of changes to social networks and events, and assess the performance of SOCR by analyzing datasets from Facebook and Yahoo! News Activity. Our experimental evaluation shows that the utility-based strategy ensures fresh news feeds (43% fewer stales) and efficient news feed responses (51% fewer false positives) compared to the TTL-based strategy. SOCR also reduces data transmission between the backend data store and the cache by 27% compared to a hybrid push-pull cache refreshing scheme.
A new operator for efficient stream-relation join processing in data streaming engines BIBAFull-Text 793-798
  Roozbeh Derakhshan; Abdul Sattar; Bela Stantic
In the last decade, Stream Processing Engines (SPEs) have emerged as a new processing paradigm that can process huge amounts of data while retaining low latency and high-throughputs. Yet, it is often necessary to join streaming data with traditional databases to provide more contextual information for the end-users and applications. The major problem that we confront is to join the fast arriving stream tuples with the static relation tuples that are on a slow database. This is what we call the Stream-Relation Join (SRJ) problem. Currently, SPEs use a naive tuple-by-tuple approach for SRJ processing where the SPE accesses the database for every incoming tuple. Some SPEs use cache to avoid accessing the database for every incoming tuple, while others do not because of the stochastic nature of streaming data. In this paper, we propose a new SRJ operator to facilitate SRJ processing regardless of the cache performance using two techniques: batching and out-of-order processing. The proposed operator provides an effective generic solution to the SRJ problem and the cost of incorporating our operator into different SPEs is minimal. Our experiments use a variety of synthetic and real data sets demonstrating that our operator outperforms the state-of-the-art tuple-by-tuple approach in terms of maximizing the throughput under ordering and memory constraints.
SCISSOR: scalable and efficient reachability query processing in time-evolving hierarchies BIBAFull-Text 799-804
  Phani Rohit Mullangi; Lakshmish Ramaswamy
A time-evolving hierarchy (TEH) consists of multiple snapshots of the hierarchy (collection of one or more trees) as it evolves over time. It is often important to test reachability between a given pair of vertices in an arbitrary (possibly past) snapshot of the hierarchy. While interval-based indexing has been a popular strategy for reachability testing in static hierarchies, a straightforward extension of this strategy to TEHs is impractical because of the exorbitant indexing overheads. In this paper, we propose SCISSOR (selective snapshot indexing with progressive solution refinement), which, to the best of our knowledge is the first time and space efficient framework for answering reachability queries in TEHs. The main idea here is to maintain indexes only for a selective interspersed subset of TEH snapshots. A query on a non-indexed snapshot will be answered by utilizing the index of a temporally-nearby indexed snapshot and analyzing the structural changes that have occurred between the two snapshots. We also present a experimental study demonstrating the scalability and efficiency of the SCISSOR framework in terms of both indexing costs and query latencies.

KM track: graphs and networks

Towards metric fusion on multi-view data: a cross-view based graph random walk approach BIBAFull-Text 805-810
  Yang Wang; Xuemin Lin; Qing Zhang
Many real-world objects described by multiple attributes or features can be decomposed as multiple "views" (e.g., an image can be described by a color view or a shape view), which often provides complementary information to each other. Learning a metric (similarity measures) for multi-view data is primary due to its wide applications in practices. However, leveraging multi-view information to produce a good metric is a great challenge and existing techniques are concerned with pairwise similarities, leading to undesirable fusion metric and high computational complexity. In this paper, we propose a novel Metric Fusion technique via cross-view graph Random Walk, named MFRW, regarding a multi-view based similarity graphs (with each similarity graph constructed under each view). Instead of using pairwise similarities, we seek a high-order metric yielded by graph random walks over constructed similarity graphs. Observing that "outlier views" may exist in the fusion process, we incorporate the coefficient matrices representing the correlation strength between any two views into MFRW, named WMFRW. The principle of WMFRW is implemented by exploring the "common latent structure" between views. The empirical studies conducted on real-world databases demonstrate that our approach outperforms the state-of-the-art competitors in terms of effectiveness and efficiency.
Discovering latent blockmodels in sparse and noisy graphs using non-negative matrix factorisation BIBAFull-Text 811-816
  Jeffrey Chan; Wei Liu; Andrey Kan; Christopher Leckie; James Bailey; Kotagiri Ramamohanarao
Blockmodelling is an important technique in social network analysis for discovering the latent structure in graphs. A blockmodel partitions the set of vertices in a graph into groups, where there are either many edges or few edges between any two groups. For example, in the reply graph of a question and answer forum, blockmodelling can identify the group of experts by their many replies to questioners, and the group of questioners by their lack of replies among themselves but many replies from experts.
   Non-negative matrix factorisation has been successfully applied to many problems, including blockmodelling. However, these existing approaches can fail to discover the true latent structure when the graphs have strong background noise or are sparse, which is typical of most real graphs. In this paper, we propose a new non-negative matrix factorisation approach that can discover blockmodels in sparse and noisy graphs. We use synthetic and real datasets to show that our approaches have much higher accuracy and comparable running times.
Understanding the roles of sub-graph features for graph classification: an empirical study perspective BIBAFull-Text 817-822
  Ting Guo; Xingquan Zhu
Graph classification concerns the learning of discriminative models, from structured training data, to classify previously unseen graph samples into specific categories, where the main challenge is to explore structural information in the training data to build classifiers. One of the most common graph classification approaches is to use sub-graph features to convert graphs into instance-feature representations, so generic learning algorithms can be applied to derive learning models. Finding good sub-graph features is regarded as an important task for this type of learning approaches, despite that there is no comprehensive understanding on (1) how effective sub-graph features can be used for graph classification? (2) how many sub-graph features are sufficient for good classification results? (3) does the length of the sub-graph features play major roles for classification? and (4) whether some random sub-graphs can be used for graph representation and classification?
   Motivated by the above concerns, we carry out empirical studies on four real-world graph classification tasks, by using three types of sub-graph features, including frequent sub-graphs, frequent sub-graph selected by using information gain, and random sub-graphs, and by using two types of learning algorithms including Support Vector Machines and Nearest Neighbour. Our experiments show that (1) the discriminative power of sub-graphs varies by their sizes; (2) random sub-graphs have a reasonably good performance; (3) number of sub-graphs is important to ensure good performance; and (4) increasing number of sub-graphs reduces the difference between classifiers built from different sub-graphs. Our studies provide a practical guidance for designing effective sub-graph based graph classification methods.
PAGE: a partition aware graph computation engine BIBAFull-Text 823-828
  Yingxia Shao; Junjie Yao; Bin Cui; Lin Ma
Graph partitioning is one of the key components in parallel graph computation, and the partition quality significantly affects the overall computing performance. In the existing graph computing systems, "good" partition schemes are preferred as they have smaller edge cut ratio and hence reduce the communication cost among working nodes. However, in an empirical study on Giraph[1], we found that the performance over well partitioned graph might be even two times worse than simple partitions. The cause is that the local message processing cost in graph computing systems may surpass the communication cost in several cases.
   In this paper, we analyse the cost of parallel graph computing systems as well as the relationship between the cost and underlying graph partitioning. Based on these observation, we propose a novel Partition Aware Graph computation Engine named PAGE. PAGE is equipped with two newly designed modules, i.e., the communication module with a dual concurrent message processor, and a partition aware one to monitor the system's status. The monitored information can be utilized to dynamically adjust the concurrency of dual concurrent message processor with a novel Dynamic Concurrency Control Model (DCCM). The DCCM applies several heuristic rules to determine the optimal concurrency for the message processor.
   We have implemented a prototype of PAGE and conducted extensive studies on a moderate size of cluster. The experimental results clearly demonstrate the PAGE's robustness under different graph partition qualities and show its advantages over existing systems with up to 59% improvement.
Active exploration: simultaneous sampling and labeling for large graphs BIBAFull-Text 829-834
  Meng Fang; Jie Yin; Xingquan Zhu
Modern information networks, such as social networks, are often characterized with large sizes and dynamic changing structures. To analyze these networks, existing solutions commonly rely on graph sampling techniques to reduce network sizes, and then carry out succeeding mining processes, such as labeling network nodes to build classification models. Such a sampling-then-labeling paradigm assumes that the whole network is available for sampling and the sampled network is useful for all subsequent tasks (such as network classification). Yet real-world networks are rarely immediately available unless the sampling process progressively crawls every single node and its connections. Meanwhile, without knowing the underlying analytic objective, the sampled network can hardly produce quality results. In this paper, we propose an Active Exploration framework for large graphs where the goal is to carry out network sampling and node labeling at the same time. To achieve this goal, we consider a network as a Markov chain and compute its stationary distribution by using supervised random walks. The stationary distribution of the sampled network help identify important nodes to be explored in the next step, and the labeling process labels the most informative node which in turn strengthens the sampling process. The mutually and simultaneously enhanced sampling and labeling processes ensure that the final network contains a maximum number of nodes directly related to the underlying mining tasks.
Local clustering in provenance graphs BIBAFull-Text 835-840
  Peter Macko; Daniel Margo; Margo Seltzer
Systems that capture and store data provenance, the record of how an object has arrived at its current state, accumulate historical metadata over time, forming a large graph. Local clustering in these graphs, in which we start with a seed vertex and grow a cluster around it, is of paramount importance because it supports critical provenance applications such as identifying semantically meaningful tasks in an object's history. However, generic graph clustering algorithms are not effective at these tasks. We identify three key properties of provenance graphs and exploit them to justify two new centrality metrics we developed for use in performing local clustering on provenance graphs.
Content-centric flow mining for influence analysis in social streams BIBAFull-Text 841-846
  Karthik Subbian; Charu Aggarwal; Jaideep Srivastava
The problem of discovering information flow trends and influencers in social networks has become increasingly relevant both because of the increasing amount of content available from online networks in the form of social streams, and because of its relevance as a tool for content trends analysis. An important part of this analysis is to determine the key patterns of flow and corresponding influencers in the underlying network. Almost all the work on influence analysis has focused on fixed models of the network structure, and edge-based transmission between nodes. In this paper, we propose a fully content-centered model of flow analysis in social network streams, in which the analysis is based on actual content transmissions in the network, rather than a static model of transmission on the edges. First, we introduce the problem of information flow mining in social streams, and then propose a novel algorithm InFlowMine to discover the information flow patterns in the network. We then leverage this approach to determine the key influencers in the network. Our approach is flexible, since it can also determine topic-specific influencers. We experimentally show the effectiveness and efficiency of our model.
Labels or attributes?: rethinking the neighbors for collective classification in sparsely-labeled networks BIBAFull-Text 847-852
  Luke K. McDowell; David W. Aha
Many classification tasks involve linked nodes, such as people connected by friendship links. For such networks, accuracy might be increased by including, for each node, the (a) labels or (b) attributes of neighboring nodes as model features. Recent work has focused on option (a), because early work showed it was more accurate and because option (b) fit poorly with discriminative classifiers. We show, however, that when the network is sparsely labeled, "relational classification" based on neighbor attributes often has higher accuracy than "collective classification" based on neighbor labels. Moreover, we introduce an efficient method that enables discriminative classifiers to be used with neighbor attributes, yielding further accuracy gains. We show that these effects are consistent across a range of datasets, learning choices, and inference algorithms, and that using both neighbor attributes and labels often produces the best accuracy.

KM track: clusters, topics and similarity

Fast parameterless density-based clustering via random projections BIBAFull-Text 861-866
  Johannes Schneider; Michail Vlachos
Clustering offers significant insights in data analysis. Density based algorithms have emerged as flexible and efficient techniques, able to discover high-quality and potentially irregularly-shaped clusters. We present two fast density-based clustering algorithms based on random projections. Both algorithms demonstrate one to two orders of magnitude speedup compared to equivalent state-of-art density based techniques, even for modest-size datasets. We give a comprehensive analysis of both our algorithms and show runtime of O(dNlog2 N), for a d-dimensional dataset. Our first algorithm can be viewed as a fast variant of the OPTICS density-based algorithm, but using a softer definition of density combined with sampling. The second algorithm is parameter-less, and identifies areas separating clusters.
Mining entity attribute synonyms via compact clustering BIBAFull-Text 867-872
  Yanen Li; Bo-June Paul Hsu; ChengXiang Zhai; Kuansan Wang
Entity attribute values, such as "lord of the rings" for movie.title or "infant" for shoe.gender, are atomic components of entity expressions. Discovering alternative surface forms of attribute values is important for improving entity recognition and retrieval. In this work, we propose a novel compact clustering framework to jointly identify synonyms for a set of attribute values. The framework can integrate signals from multiple information sources into a similarity function between attribute values. And the weights of these signals are optimized in an unsupervised manner. Extensive experiments across multiple domains demonstrate the effectiveness of our clustering framework for mining entity attribute synonyms.
Modeling interaction features for debate side clustering BIBAFull-Text 873-878
  Minghui Qiu; Liu Yang; Jing Jiang
Online discussion forums are popular social media platforms for users to express their opinions and discuss controversial issues with each other. To automatically identify the sides/stances of posts or users from textual content in forums is an important task to help mine online opinions. To tackle the task, it is important to exploit user posts that implicitly contain support and dispute (interaction) information. The challenge we face is how to mine such interaction information from the content of posts and how to use them to help identify stances. This paper proposes a two-stage solution based on latent variable models: an interaction feature identification stage to mine interaction features from structured debate posts with known sides and reply intentions; and a clustering stage to incorporate interaction features and model the interplay between interactions and sides for debate side clustering. Empirical evaluation shows that the learned interaction features provide good insights into user interactions and that with these features our debate side model shows significant improvement over other baseline methods.
Dynamic multi-faceted topic discovery in Twitter BIBAFull-Text 879-884
  Jan Vosecky; Di Jiang; Kenneth Wai-Ting Leung; Wilfred Ng
Microblogging platforms, such as Twitter, already play an important role in cultural, social and political events around the world. Discovering high-level topics from social streams is therefore important for many downstream applications. However, traditional text mining methods that rely on the bag-of-words model are insufficient to uncover the rich semantics and temporal aspects of topics in Twitter. In particular, topics in Twitter are inherently dynamic and often focus on specific entities, such as people or organizations. In this paper, we therefore propose a method for mining multifaceted topics from Twitter streams. The Multi-Faceted Topic Model (MfTM) is proposed to jointly model latent semantics among terms and entities and captures the temporal characteristics of each topic. We develop an efficient online inference method for MfTM, which enables our model to be applied to large-scale and streaming data. Our experimental evaluation shows the effectiveness and efficiency of our model compared with state-of-the-art baselines. We further demonstrate the effectiveness of our framework in the context of tweet clustering.
Mining causal topics in text data: iterative topic modeling with time series feedback BIBAFull-Text 885-890
  Hyun Duk Kim; Malu Castellanos; Meichun Hsu; ChengXiang Zhai; Thomas Rietz; Daniel Diermeier
Many applications require analyzing textual topics in conjunction with external time series variables such as stock prices. We develop a novel general text mining framework for discovering such causal topics from text. Our framework naturally combines any given probabilistic topic model with time-series causal analysis to discover topics that are both coherent semantically and correlated with time series data. We iteratively refine topics, increasing the correlation of discovered topics with the time series. Time series data provides feedback at each iteration by imposing prior distributions on parameters. Experimental results show that the proposed framework is effective.
Navigating the topical structure of academic search results via the Wikipedia category network BIBAFull-Text 891-896
  Daniil Mirylenka; Andrea Passerini
Searching for scientific publications on the Web is a tedious task, especially when exploring an unfamiliar domain. Typical scholarly search engines produce lengthy unstructured result lists that are difficult to comprehend, interpret and browse. We propose a novel method of organizing the search results into concise and informative topic hierarchies. The method consists of two steps: extracting interrelated topics from the result set, and summarizing the topic graph. In the first step we map the search results to articles and categories of Wikipedia, constructing a graph of relevant topics with hierarchical relations. In the second step we sequentially build nested summaries of the produced topic graph using a structured output prediction approach. Trained on a small number of examples, our method learns to construct informative summaries for unseen topic graphs, and outperforms unsupervised state-of-the-art Wikipedia-based clustering.
A multimodal framework for unsupervised feature fusion BIBAFull-Text 897-902
  Xiaoyi Li; Jing Gao; Hui Li; Le Yang; Rohini K. Srihari
With the overwhelming amounts of visual contents on the Internet nowadays, it is very important to generate meaningful and succinct descriptions of multimedia contents including images and videos. Although human taggings and annotations can partially label some of the images or videos, it is impossible to exhaustively describe all the multimedia data due to its huge scale. Therefore, the key to this important task is to develop an effective algorithm that can automatically generate a description of an image or a frame. In this paper, we propose a multimodal feature fusion framework which can model any given image-description pair using semantically meaningful features. This framework is trained as a combination of multi-modal deep networks having two integral components: An ensemble of image descriptors and a recursive bigram encoder with fixed length output feature vector. These two components are then integrated into a joint model characterizing the correlations between images and texts. The proposed framework can not only model the unique characteristics of images or texts, but also take into account their correlations at the semantic level. Experiments on real image-text data sets show that the proposed framework is effective and efficient in indexing and retrieving semantically similar pairs, which will be very useful to help people locate interesting images or videos in large-scale databases.
Probabilistic semantic similarity measurements for noisy short texts using Wikipedia entities BIBAFull-Text 903-908
  Masumi Shirakawa; Kotaro Nakayama; Takahiro Hara; Shojiro Nishio
This paper describes a novel probabilistic method of measuring semantic similarity for real-world noisy short texts like microblog posts. Our method adds related Wikipedia entities to a short text as its semantic representation and uses the vector of entities for computing semantic similarity. Adding related entities to texts is generally a compound problem that involves the extraction of key terms, finding related entities for each key term, and the aggregation of related entities. Explicit Semantic Analysis (ESA), a popular Wikipedia-based method, solves these problems by summing the weighted vectors of related entities. However, this heuristic weighting highly depends on the rule of majority decision and is not suited to short texts that contain few key terms but many noisy terms. The proposed probabilistic method synthesizes these procedures by extending naive Bayes and achieves robust estimates of related Wikipedia entities for short texts. Experimental results on short text clustering using Twitter data indicated that our method outperformed ESA for short texts containing noisy terms.

DB track: graphs and social networks

Linear-time enumeration of maximal K-edge-connected subgraphs in large networks by random contraction BIBAFull-Text 909-918
  Takuya Akiba; Yoichi Iwata; Yuichi Yoshida
Capturing sets of closely related vertices from large networks is an essential task in many applications such as social network analysis, bioinformatics, and web link research. Decomposing a graph into k-core components is a standard and efficient method for this task, but obtained clusters might not be well-connected. The idea of using maximal k-edge-connected subgraphs was recently proposed to address this issue. Although we can obtain better clusters with this idea, the state-of-the-art method is not efficient enough to process large networks with millions of vertices.
   In this paper, we propose a new method to decompose a graph into maximal k-edge-connected components, based on random contraction of edges. Our method is simple to implement but improves performance drastically. We experimentally show that our method can successfully decompose large networks and it is thousands times faster than the previous method. Also, we theoretically explain why our method is efficient in practice. To see the importance of maximal k-edge-connected subgraphs, we also conduct experiments using real-world networks to show that many k-core components have small edge-connectivity and they can be decomposed into a lot of maximal k-edge-connected subgraphs.
External memory K-bisimulation reduction of big graphs BIBAFull-Text 919-928
  Yongming Luo; George H. L. Fletcher; Jan Hidders; Yuqing Wu; Paul De Bra
In this paper, we present, to our knowledge, the first known I/O efficient solutions for computing the k-bisimulation partition of a massive directed graph, and performing maintenance of such a partition upon updates to the underlying graph. Ubiquitous in the theory and application of graph data, bisimulation is a robust notion of node equivalence which intuitively groups together nodes in a graph which share fundamental structural features. k-bisimulation is the standard variant of bisimulation where the topological features of nodes are only considered within a local neighborhood of radius k > 0.
   The I/O cost of our partition construction algorithm is bounded by O(k · sort}(|Et|) + k · scan(|Nt|) + sort(|Nt|)), while our maintenance algorithms are bounded by O(k · sort}(|Et|) + k · scan(|Nt|). The space complexity bounds are O(|Nt|+|Et|)$ and O(k · |Nt|+k ·|Et|), resp. Here, |Et| and |Nt| are the number of disk pages occupied by the input graph's edge set and node set, resp., and sort(n) and scan(n) are the cost of sorting and scanning, resp., a file occupying n pages in external memory. Empirical analysis on a variety of massive real-world and synthetic graph datasets shows that our algorithms perform efficiently in practice, scaling gracefully as graphs grow in size.
Querying graphs with preferences BIBAFull-Text 929-938
  Valeria Fionda; Giuseppe Pirro'
This paper presents GuLP a graph query language that enables to declaratively express preferences. Preferences enable to order the answers to a query and can be stated in terms of nodes/edge attributes and complex paths. We present the formal syntax and semantics of GuLP and a polynomial time algorithm for evaluating GuLP expressions. We describe an implementation of GuLP in the GuLP-it system, which is available for download. We evaluate the GuLP-it system on real-world and synthetic data.
Network-aware search in social tagging applications: instance optimality versus efficiency BIBAFull-Text 939-948
  Silviu Maniu; Bogdan Cautis
We consider in this paper top-k query answering in social applications, with a focus on social tagging. This problem requires a significant departure from socially agnostic techniques. In a network-aware context, one can (and should) exploit the social links, which can indicate how users relate to the seeker and how much weight their tagging actions should have in the result build-up. We propose algorithms that have the potential to scale to current applications. While the problem has already been considered in previous literature, this was done either under strong simplifying assumptions or under choices that cannot scale to even moderate-size real-world applications. We first revisit a key aspect of the problem, which is accessing the closest or most relevant users for a given seeker. We describe how this can be done on the fly (without any pre-computations) for several possible choices -- arguably the most natural ones -- of proximity computation in a user network. Based on this, our top-k algorithm is sound and complete, addressing the applicability issues of the existing ones. Moreover, it performs significantly better in general and is instance optimal in the case when the search relies exclusively on the social weight of tagging actions.
   To further address the efficiency needs of online applications, for which the exact search, albeit optimal, may still be expensive, we then consider approximate algorithms. Specifically, these rely on concise statistics about the social network or on approximate shortest-paths computations. Extensive experiments on real-world data from Twitter show that our techniques can drastically improve response time, without sacrificing precision.
A comparison of two physical data designs for interactive social networking actions BIBAFull-Text 949-958
  Sumita Barahmand; Shahram Ghandeharizadeh; Jason Yap
This paper compares the performance of an SQL solution that implements a relational data model with a document store named MongoDB. We report on the performance of a single node configuration of each data store and assume the database is small enough to fit in main memory. We analyze utilization of the CPU cores and the network bandwidth to compare the two data stores. Our key findings are as follows. First, for those social networking actions that read and write a small amount of data, the join operator of the SQL solution is not slower than the JSON representation of MongoDB. Second, with a mix of actions, the SQL solution provides either the same performance as MongoDB or outperforms it by 20%. Third, a middle-tier cache enhances the performance of both data stores as query result look up is significantly faster than query processing with either system.

IR track: data classification

Community question topic categorization via hierarchical kernelized classification BIBAFull-Text 959-968
  Wen Chan; Weidong Yang; Jinhui Tang; Jintao Du; Xiangdong Zhou; Wei Wang
We present a hierarchical kernelized classification model for the automatic classification of general questions into their corresponding topic categories in community Question Answering service (cQAs). This could save many efforts of manual classification and facilitate browsing as well as better retrieving of questions from the cQA archives. To deal with the challenge of short text message of questions, we explore and optimally combine various cQA features by introducing multiple kernel learning strategy into the hierarchical classification framework. We propose a hybrid regularization approach of combining orthogonal constraint and L1 sparseness in our framework to promote the discriminative power on similar topics as well as sparsing the model parameters. The experimental results on a real world dataset from Yahoo! Answers demonstrate the effectiveness of our proposed model as compared to the state-of-the-art methods and strong baselines.
Building structures from classifiers for passage reranking BIBAFull-Text 969-978
  Aliaksei Severyn; Massimo Nicosia; Alessandro Moschitti
This paper shows that learning to rank models can be applied to automatically learn complex patterns, such as relational semantic structures occurring in questions and their answer passages. This is achieved by providing the learning algorithm with a tree representation derived from the syntactic trees of questions and passages connected by relational tags, where the latter are again provided by the means of automatic classifiers, i.e., question and focus classifiers and Named Entity Recognizers. This way effective structural relational patterns are implicitly encoded in the representation and can be automatically utilized by powerful machine learning models such as kernel methods.
   We conduct an extensive experimental evaluation of our models on well-known benchmarks from the question answer (QA) track of TREC challenges. The comparison with state-of-the-art systems and BM25 show a relative improvement in MAP of more than 14% and 45%, respectively. Further comparison on the task restricted to the answer sentence reranking shows an improvement in MAP of more than 8% over the state of the art.
Uncovering collusive spammers in Chinese review websites BIBAFull-Text 979-988
  Chang Xu; Jie Zhang; Kuiyu Chang; Chong Long
As the rapid development of China's e-commerce in recent years and the underlying evolution of adversarial spamming tactics, more sophisticated spamming activities may carry out in Chinese review websites. Empirical analysis, on recently crawled product reviews from a popular Chinese e-commerce website, reveals the failure of many state-of-the-art spam indicators on detecting collusive spammers. Two novel methods are then proposed: 1) a KNN-based method that considers the pairwise similarity of two reviewers based on their group-level relational information and selects k most similar reviewers for voting; 2) a more general graph-based classification method that jointly classifies a set of reviewers based on their pairwise transaction correlations. Experimental results show that both our methods promisingly outperform the indicator-only classifiers in various settings.
Towards minimizing the annotation cost of certified text classification BIBAFull-Text 989-998
  Mossaab Bagdouri; William Webber; David D. Lewis; Douglas W. Oard
The common practice of testing a sequence of text classifiers learned on a growing training set, and stopping when a target value of estimated effectiveness is first met, introduces a sequential testing bias. In settings where the effectiveness of a text classifier must be certified (perhaps to a court of law), this bias may be unacceptable. The choice of when to stop training is made even more complex when, as is common, the annotation of training and test data must be paid for from a common budget: each new labeled training example is a lost test example. Drawing on ideas from statistical power analysis, we present a framework for joint minimization of training and test annotation that maintains the statistical validity of effectiveness estimates, and yields a natural definition of an optimal allocation of annotations to training and test data. We identify the development of allocation policies that can approximate this optimum as a central question for research. We then develop simulation-based power analysis methods for van Rijsbergen's F-measure, and incorporate them in four baseline allocation policies which we study empirically. In support of our studies, we develop a new analytic approximation of confidence intervals for the F-measure that is of independent interest.
A heterogenous automatic feedback semi-supervised method for image reranking BIBAFull-Text 999-1008
  Xin-Chao Xu; Xin-Shun Xu; Yafang Wang; Xiaolin Wang
Image reranking, which aims at enhancing the quality of keyword-based image search with the help of image features, recently has become attractive in image search community. A major challenging in this task is that image's visual features do not always well reflect image's semantic meaning. Thus, reranking methods only depending on visual features cannot guarantee to obtain good results. In addition, it is well known that the visual features of an image have strong/weak correlations with its surrounding text. Thus, it is expected that a model considering both visual features and its surrounding text can perform better than those only considering visual features. Motivated by this, in this paper, we propose the HAFSRerank -- Heterogenous Automatic Feedback Semi-supervised Reranking method which makes use of both visual and textual features simultaneously during reranking. Specifically, in HAFSRerank, a multigraph is firstly constructed in which each node representing an image includes visual and textual features, and the parallel edges between them are weighted by intra-modal similarity and inter-modal similarity. A heterogenous complete graph is further derived from the multigraph. Then, an automatic feedback graph-based semi-supervised learning method is proposed to propagate the reranking scores on the complete graph, which can make use of the inter-modal similarity to update the weights of heterogenous graph automatically. Finally, the result of the semi-supervised learning is used to rerank the images. The experimental results show that HAFSRerank is superior or highly competitive to some state-of-the-art graph-based reranking methods. Moreover, the proposed reranking algorithm can be well interpreted by Bayesian theory, and does not require complex search models for special queries and any additional input from users.

KM track: networks

Accurate and scalable nearest neighbors in large networks based on effective importance BIBAFull-Text 1009-1018
  Petko Bogdanov; Ambuj Singh
Nearest neighbor proximity search in large graphs is an important analysis primitive with a variety of applications in graph data from different domains. We propose a novel proximity measure for weighted graphs called Effective Importance which incorporates multiple paths between nodes and captures the inherent structural clusters within a network. We develop effective bounds on the EI value using a modified small subnetwork around a query node, enabling scalable exact nearest neighbor (NN) search at query time. Our NN search does not require heavy offline analysis or holistic knowledge of the graph, making our method suitable for very large dynamically changing networks or composite network overlays.
   We employ our NN search algorithm on social, information and biological networks and demonstrate the effectiveness and scalability of the approach. For million-node networks, our method retrieves the exact top 20 neighbors using less than $0.2%$ of the network edges in a fraction of a second on a conventional desktop machine. We also evaluate the effectiveness of our proximity measure and NN search for three applications, namely (i) finding good local clusters, (ii) network sparsification and (iii) prediction of node attributes in information networks. The EI measure and NN search method outperform recent counterparts from the literature in all applications.
Spatial-temporal query homogeneity for KNN object search on road networks BIBAFull-Text 1019-1028
  Ying-Ju Chen; Kun-Ta Chuang; Ming-Syan Chen
We in this paper explore a new research paradigm, called query homogeneity, to process KNN queries on road networks for online LBS applications. While previous works in the literature concentrate on the improvement of query processing time, we turn to examine the issue of response time for a user query, which needs to additionally consider the waiting time in the queue. Note that the response time is the more precise value corresponding to the user experience in an online service, and the unacceptable response time is likely to turn away disgruntled users. Surprisingly, we will show in this paper that the response time will be more significantly dominated by the waiting time but it is left unexplored thus far. Since previous works all perform queries in the one-by-one fashion, which will lead to unexpected long waiting time, we thus in this paper propose a novel query framework, called SHI, aiming at diminishing the waiting time by a new group-by-group solution. SHI relies on the natural phenomenon of query homogeneity, which refers to the behavior that queries are usually issued in the sense of spatial and temporal correlation. Motivated by this natural behavior, operations of query processing and queue processing are incorporated in the SHI framework. During the network expansion for a query, a group of homogeneity queries in the waiting queue, which have results identical to the processing query, will be picked up and flushed out together when the query processing is accomplished, achieving the group-by-group query processing and reducing the waiting time significantly.
Discovering influential authors in heterogeneous academic networks by a co-ranking method BIBAFull-Text 1029-1036
  Qinxue Meng; Paul J. Kennedy
Research in ranking networked entities is widely applicable to many problems such as optimizing search engines, building recommendation systems and discovering influential nodes in social networks. However, many famous ranking approaches like PageRank are limited to solving this problem in homogeneous networks and are not applicable to heterogeneous networks. Faced with this problem, we propose a co -- ranking method to evaluate scientific publications and authors. This novel approach is a flexible framework based on a set of customized rules taking into account both topological features of networks and the included citations. The approach ranks authors and publications iteratively and uses the results of each round to reinforce the ranks of authors and publications. Unlike traditional approaches to assessing publication, which require a great number of citations, our method lowers this requirement. This co -- ranking approach has been validated using data collected from DBLP and CiteSeer, and the results suggest that it is effective and efficient in ranking authors and publications based on limited numbers of citations in heterogeneous networks and that it has fast convergence.
Entity disambiguation in anonymized graphs using graph kernels BIBAFull-Text 1037-1046
  Linus Hermansson; Tommi Kerola; Fredrik Johansson; Vinay Jethava; Devdatt Dubhashi
This paper presents a novel method for entity disambiguation in anonymized graphs using local neighborhood structure. Most existing approaches leverage node information, which might not be available in several contexts due to privacy concerns, or information about the sources of the data. We consider this problem in the supervised setting where we are provided only with a base graph and a set of nodes labelled as ambiguous or unambiguous. We characterize the similarity between two nodes based on their local neighborhood structure using graph kernels; and solve the resulting classification task using SVMs. We give empirical evidence on two real-world datasets, comparing our approach to a state-of-the-art method, highlighting the advantages of our approach. We show that using less information, our method is significantly better in terms of either speed or accuracy or both. We also present extensions of two existing graphs kernels, namely, the direct product kernel and the shortest-path kernel, with significant improvements in accuracy. For the direct product kernel, our extension also provides significant computational benefits. Moreover, we design and implement the algorithms of our method to work in a distributed fashion using the GraphLab framework, ensuring high scalability.
Estimating the relative utility of networks for predicting user activities BIBAFull-Text 1047-1056
  Nina Mishra; Daniel M. Romero; Panayiotis Tsaparas
Link structure in online networks carries varying semantics. For example, Facebook links carry social semantics while LinkedIn links carry professional semantics. It has been shown that online networks are useful for predicting users' future activities. In this paper, we introduce a new related problem: given a collection of networks, how can we determine the relative importance of each network for predicting user activities? We propose a framework that allows us to quantify the relative predictive value of each network in a setting where multiple networks are available. We give an ε-net algorithm to solve the problem and prove that it finds a solution that is arbitrarily close to the optimal solution. Experimentally, we focus our study on the prediction of ad clicks, where it is already known that a single social network improves prediction. The networks we study are implicit affiliations networks, which are based on users' browsing history rather than declared relationships between the users. We create two networks based on covisitation to pages in the Facebook domain and Wikipedia domain. The learned relative weighting of these networks demonstrates covisitation networks are indeed useful for prediction, but that no single network is predictive of all kinds of ads. Rather, each category of ads calls for a significantly different weighting of these networks.

KM track: mining reviews and Wiki

Exploring weakly supervised latent sentiment explanations for aspect-level review analysis BIBAFull-Text 1057-1066
  Lei Fang; Minlie Huang; Xiaoyan Zhu
In sentiment analysis, aspect-level review analysis has been an important task because it can catalogue, aggregate, or summarize various opinions according to a product's properties. In this paper, we explore a new concept for aspect-level review analysis, latent sentiment explanations, which are defined as a set of informative aspect-specific sentences whose polarities are consistent with that of the review. In other words, sentiment explanations best represent a review in terms of both aspect and polarity. We formulate the problem as a structure learning problem, and sentiment explanations are modeled with latent variables. Training samples are automatically identified through a set of pre-defined aspect signature terms (i.e., without manual annotation on samples), which we term the way weakly supervised.
   Our major contributions lie in two folds: first, we formalize the use of aspect signature terms as weak supervision in a structural learning framework, which remarkably promotes aspect-level analysis; second, the performance of aspect analysis and document-level sentiment classification are mutually enhanced through joint modeling. The proposed method is evaluated on restaurant and hotel reviews respectively, and experimental results demonstrate promising performance in both document-level and aspect-level sentiment analysis.
Using micro-reviews to select an efficient set of reviews BIBAFull-Text 1067-1076
  Thanh-Son Nguyen; Hady W. Lauw; Panayiotis Tsaparas
Online reviews are an invaluable resource for web users trying to make decisions regarding products or services. However, the abundance of review content, as well as the unstructured, lengthy, and verbose nature of reviews make it hard for users to locate the appropriate reviews, and distill the useful information. With the recent growth of social networking and micro-blogging services, we observe the emergence of a new type of online review content, consisting of bite-sized, 140 character-long reviews often posted reactively on the spot via mobile devices. These micro-reviews are short, concise, and focused, nicely complementing the lengthy, elaborate, and verbose nature of full-text reviews.
   We propose a novel methodology that brings together these two diverse types of review content, to obtain something that is more than the sum of its parts. We use micro-reviews as a crowdsourced way to extract the salient aspects of the reviewed item, and propose a new formulation of the review selection problem that aims to find a small set of reviews that efficiently cover the micro-reviews. Our approach consists of a two-step process: matching review sentences to micro-reviews and then selecting reviews such that we cover as many micro-reviews as possible, with few sentences. We perform a detailed evaluation of all the steps of our methodology using data collected from Foursquare and Yelp.
Automatic construction of domain and aspect specific sentiment lexicons for customer review mining BIBAFull-Text 1077-1086
  Juergen Bross; Heiko Ehrig
Automatically analyzing the opinions expressed in customer reviews is of high relevance in many application scenarios, e.g., market research, trend analysis, or reputation management. A great share of current sentiment analysis approaches makes use of special purpose lexicons that provide information about the polarity (e.g., positive or negative) of individual words and phrases. One major challenge is that the actual sentiment polarity of a specific expression is often context dependent (e.g., "long+ battery life" vs. "long- flash recycle time"). However, the vast majority of existing approaches focuses on creating general purpose lexicons. Especially in the context of mining customer review data, the use of such lexicons is rather suboptimal as they fail to adequately reflect the domain specific lexical usage. We propose a novel method that allows to automatically adapt and extend existing lexicons to a specific product domain. We follow a corpus-based approach and exploit the fact that many customer reviews exhibit some form of semi-structure. The method is fully automatic and thus scales well across different product domains. Our experiments show that the extracted lexicons are highly accurate and significantly improve the performance in a sentiment classification scenario.
Wikification via link co-occurrence BIBAFull-Text 1087-1096
  Zhiyuan Cai; Kaiqi Zhao; Kenny Q. Zhu; Haixun Wang
Wikification, which stands for the process of linking terms in a plain text document to Wikipedia articles which represent the correct meanings of the terms, can be thought of as a generalized Word Sense Disambiguation problem. It disambiguates multi-word expressions (MWEs) in addition to single words. Existing Wikification techniques either models the context of a given term as well as the Wikipedia article as bags of words, or compute global constraints among Wikipedia concepts by the link graph or link distributions. The first method doesn't achieve good results because the MWEs can have very different meanings than its constituent words which themselves are ambiguous. The second method doesn't produce high accuracy because the link structure or link distribution is often biased or incomplete by themselves due to the fact that Wikipedia pages are often sparsely linked. In this paper, we present a simple but powerful framework of sense disambiguation using co-occurrences of Wikipedia links in the Wikipedia corpus. We propose an iterative method to enrich the sparsely-linked articles by adding more links and then use the resulting link co-occurrence matrix to disambiguate an input document by a sliding window algorithm. Our prototype system achieves 89.97% precision and 76.43% recall on average for three benchmark data and compares favorably against four state-of-the-art wikification techniques.
Manipulation among the arbiters of collective intelligence: how wikipedia administrators mold public opinion BIBAFull-Text 1097-1106
  Sanmay Das; Allen Lavoie; Malik Magdon-Ismail
Our reliance on networked, collectively built information is a vulnerability when the quality or reliability of this information is poor. Wikipedia, one such collectively built information source, is often our first stop for information on all kinds of topics; its quality has stood up to many tests, and it prides itself on having a "Neutral Point of View". Enforcement of neutrality is in the hands of comparatively few, powerful administrators. We find a surprisingly large number of editors who change their behavior and begin focusing more on a particular controversial topic once they are promoted to administrator status. The conscious and unconscious biases of these few, but powerful, administrators may be shaping the information on many of the most sensitive topics on Wikipedia; some may even be explicitly infiltrating the ranks of administrators in order to promote their own points of view. Neither prior history nor vote counts during an administrator's election can identify those editors most likely to change their behavior in this suspicious manner. We find that an alternative measure, which gives more weight to influential voters, can successfully reject these suspicious candidates. This has important implications for how we harness collective intelligence: even if wisdom exists in a collective opinion (like a vote), that signal can be lost unless we carefully distinguish the true expert voter from the noisy or manipulative voter.

IR track: applications I

Robust question answering over the web of linked data BIBAFull-Text 1107-1116
  Mohamed Yahya; Klaus Berberich; Shady Elbassuoni; Gerhard Weikum
Knowledge bases and the Web of Linked Data have become important assets for search, recommendation, and analytics. Natural-language questions are a user-friendly mode of tapping this wealth of knowledge and data. However, question answering technology does not work robustly in this setting as questions have to be translated into structured queries and users have to be careful in phrasing their questions. This paper advocates a new approach that allows questions to be partially translated into relaxed queries, covering the essential but not necessarily all aspects of the user's input. To compensate for the omissions, we exploit textual sources associated with entities and relational facts. Our system translates user questions into an extended form of structured SPARQL queries, with text predicates attached to triple patterns. Our solution is based on a novel optimization model, cast into an integer linear program, for joint decomposition and disambiguation of the user question. We demonstrate the quality of our methods through experiments with the QALD benchmark.
Expertise retrieval in bibliographic network: a topic dominance learning approach BIBAFull-Text 1117-1126
  Seyyed Hadi Hashemi; Mahmood Neshati; Hamid Beigy
Expert finding in bibliographic networks has received increased interests in recent years. This task concerns with finding relevant researchers for a given topic. Motivated by the observation that rarely do all coauthors contribute to a paper equally, in this paper, we propose a discriminative method to realize leading authors contributing in a scientific publication. Specifically, we cast the problem of expert finding in a bibliographic network to find leading experts in a research group, which is easier to solve. According to some observations, we recognize three feature groups that can discriminate relevant and irrelevant experts. Experimental results on a real dataset, and an automatically generated one that is gathered from Microsoft academic search show that the proposed model significantly improves the performance of expert finding in terms of all common Information Retrieval evaluation metrics.
Instant foodie: predicting expert ratings from grassroots BIBAFull-Text 1127-1136
  Chenhao Tan; Ed H. Chi; David Huffaker; Gueorgi Kossinets; Alexander J. Smola
Consumer review sites and recommender systems typically rely on a large volume of user-contributed ratings, which makes rating acquisition an essential component in the design of such systems. User ratings are then summarized to provide an aggregate score representing a popular evaluation of an item. An inherent problem in such summarization is potential bias due to raters self-selection and heterogeneity in terms of experience, tastes and rating scale interpretation. There are two major approaches to collecting ratings, which have different advantages and disadvantages. One is to allow a large number of volunteers to choose and rate items directly (a method employed by e.g. Yelp and Google Places). Alternatively, a panel of raters may be maintained and invited to rate a predefined set of items at regular intervals (such as in Zagat Survey). The latter approach arguably results in more consistent reviews and reduced selection bias, however, at the expense of much smaller coverage (fewer rated items).
   In this paper, we examine the two different approaches to collecting user ratings of restaurants and explore the question of whether it is possible to reconcile them. Specifically, we study the problem of inferring the more calibrated Zagat Survey ratings (which we dub 'expert ratings') from the user-generated ratings ('grassroots') in Google Places. To that effect, we employ latent factor models and provide a probabilistic treatment of the ordinal rankings. We can predict Zagat Survey ratings accurately from ad hoc user-generated ratings by joint optimization on two datasets. We analyze the resulting model, and find that users become more discerning as they submit more ratings. We also describe an approach towards cross-city recommendations, answering questions such as 'What is the equivalent of the Per Se restaurant in Chicago'?
On segmentation of eCommerce queries BIBAFull-Text 1137-1146
  Nish Parikh; Prasad Sriram; Mohammad Al Hasan
In this paper, we present QSEGMENT, a real-life query segmentation system for eCommerce queries. QSEGMENT uses frequency data from the query log which we call buyers' data and also frequency data from product titles what we call sellers' data. We exploit the taxonomical structure of the marketplace to build domain specific frequency models. Using such an approach, QSEGMENT performs better than previously described baselines for query segmentation. Also, we perform a large scale evaluation by using an unsupervised IR metric which we refer to as user-intent-score. We discuss the overall architecture of QSEGMENT as well as various use cases and interesting observations around segmenting eCommerce queries.
Scientific articles recommendation BIBAFull-Text 1147-1156
  Yingming Li; Ming Yang; Zhongfei (Mark) Zhang
We study the problem of recommending scientific articles to users in an online community and present a novel matrix factorization model, the topic regression Matrix Factorization (tr-MF), to solve the problem. The main idea of tr-MF lies in extending the matrix factorization with a probabilistic topic modeling. Instead of regularizing item factors through the probabilistic topic modeling as in the framework of the CTR model, tr-MF introduces a regression model to regularize user factors through the probabilistic topic modeling under the basic hypothesis that users share the similar preferences if they rate similar sets of items. Consequently, tr-MF provides interpretable latent factors for users and items, and makes accurate predictions for community users. Specifically, it is effective in making predictions for users with only few ratings or even no ratings, and supports tasks that are specific to a certain field, neither of which is addressed in the existing literature. Further, we demonstrate the efficacy of tr-MF on a large subset of the data from CiteULike, a bibliography sharing service dataset. The proposed model outperforms the state-of-the-art matrix factorization models with a significant margin.

Poster session: DB+IR track

MRPacker: an SQL to mapreduce optimizer BIBAFull-Text 1157-1160
  Xuelian Lin; Yue Ye; Shuai Ma
There have been recently quite a few works on optimizing the MapReduce execution plans, which either optimize the join operators or apply a set of translation rules to reduce the number of MapReduce jobs in an execution plan. However, none of these works has put into consideration and utilized how MapReduce jobs are generated and combined. To further improve the efficiency of MapReduce execution plans, we incorporate into our optimization approach the way how MapReduce jobs are generated and combined. In this paper, we propose MRPacker, a novel SQL-to-MapReduce optimizer by (a) using a set of transformation rules to reduce the number of MapReduce jobs, and (b) merging MapReduce jobs in a more reasonable way. We have finally experimentally demonstrated the effectiveness and efficiency of MRPacker, using the TPC-H benchmark.
A hybrid approach for privacy-preserving processing of knn queries in mobile database systems BIBAFull-Text 1161-1164
  Shixin Tian; Ying Cai; Qinghua Zheng
In mobile object database systems, both query issuers and queried objects are subject to location privacy intrusion. One solution to this problem is to have users reduce their location resolution when making location update. Such location cloaking allows mobile objects to achieve a desired level of protection, but may not produce accurate query results. Alternatively, one can apply cryptography techniques such as secure multiparty computation to compute the spatial relationship among mobile objects without having mobile objects to disclose their location at all. This strategy produces high quality query results, but in general are computation-intensive, especially when a large number of mobile objects are involved. In this paper, we present a hybrid approach that mitigates the above dilemma. Our idea is to compute approximate query results based on cloaked location information and then refine query results by applying homomorphic encryption. We demonstrate that this approach can be used for efficient and privacy-preserving processing of KNN queries and evaluate its performance through simulation.
Flexible and extensible generation and corruption of personal data BIBAFull-Text 1165-1168
  Peter Christen; Dinusha Vatsalan
With much of today's data being generated by people or referring to people, researchers increasingly require data that contain personal identifying information to evaluate their new algorithms. In areas such as record matching and de-duplication, fraud detection, cloud computing, and health informatics, issues such as data entry errors, typographical mistakes, noise, or recording variations, can all significantly affect the outcomes of data integration, processing, and mining projects. However, privacy concerns make it challenging to obtain real data that contain personal details. An alternative to using sensitive real data is to create synthetic data which follow similar characteristics. The advantages of synthetic data are that (1) they can be generated with well defined characteristics; (2) it is known which records represent an individual created entity (this is often unknown in real data); and (3) the generated data and the generator program itself can be published. We present a sophisticated data generation and corruption tool that allows the creation of various types of data, ranging from names and addresses, dates, social security and credit card numbers, to numerical values such as salary or blood pressure. Our tool can model dependencies between attributes, and it allows the corruption of values in various ways. We describe the overall architecture and main components of our tool, and illustrate how a user can easily extend this tool with novel functionalities.
An efficient and robust privacy protection technique for massive streaming choice-based information BIBAFull-Text 1169-1172
  Ji Zhang; Xuemei Liu; Yonglong Luo
Protecting users' privacy when transmitting a large amount of data over the Internet is becoming increasingly important nowadays. In this paper, we focus on the streaming choice-based information and propose a novel anonymization technique for providing a strong privacy protection to safeguard against privacy disclosure and information tampering. Our technique utilizes an innovative two-phase encoding-and-decoding approach which is very easy to implement, highly efficient in terms of speed and communication, and is robust against possible tampering from adversaries. The experimental evaluation demonstrates the promising performance of our technique.
RCached-tree: an index structure for efficiently answering popular queries BIBAFull-Text 1173-1176
  Manash Pal; Arnab Bhattacharya; Debjyoti Paul
In many applications of similarity searching in databases, a set of similar queries appear more frequently. Since it is rare that a query point with its associated parameters (range or number of nearest neighbors) will repeat exactly, intelligent caching mechanisms are required to efficiently answer such queries. In addition, the performance of non-repeating and non-cached queries should not suffer too much either. In this paper, we propose RCached-tree, belonging to the family of R-trees, that aims to solve this problem. In every internal node of the tree up to a certain level, a portion of the space is reserved for storing popular queries and their solutions. For a new query that is encompassed by a cached query, this enables bypassing the traversal of lower levels of the subtree corresponding to the node as the answers can be obtained directly from the result set of the cached query. The structure adapts itself to varying query patterns; new popular queries replace the old cached ones that are not popular any more. Queries that are not popular as well as insertions, deletions and updates are handled in the same manner as in a general R-tree. Experiments show that the RCached-tree can outperform R-tree and other such structures by a significant margin when the proportion of popular queries is 20% or more by reserving 30-40% of the internal nodes as cache.
Label constrained shortest path estimation BIBAFull-Text 1177-1180
  Ankita Likhyani; Srikanta Bedathur
Shortest path querying is a fundamental graph problem which is computationally quite challenging when operating over massive scale graphs. Recent results have addressed the problem of computing either exact or good approximate shortest path distances efficiently. Some of these techniques also return the path corresponding to the estimated shortest path distance fast.
   However, none of these techniques work very well when we have additional constraints on the labels associated with edges that constitute the path. In this paper, we develop SkIt index structure, which supports a wide range of label constraints on paths, and returns an accurate estimation of the shortest path that satisfies the constraints. We conduct experiments over graphs such as social networks, and knowledge graphs that contain millions of nodes/edges, and show that SkIt index is fast, accurate in the estimated distance and has a high recall for paths that satisfy the constraints.
Feature-based models for improving the quality of noisy training data for relation extraction BIBAFull-Text 1181-1184
  Benjamin Roth; Dietrich Klakow
Supervised relation extraction from text relies on annotated data. Distant supervision is a scheme to obtain noisy training data by using a knowledge base of relational tuples as the ground truth and finding entity pair matches in a text corpus. We propose and evaluate two feature-based models for increasing the quality of distant supervision extraction patterns.
   The first model is an extension of a hierarchical topic model that induces background, relation specific and argument-pair specific feature distributions. The second model is a perceptron, trained to match an objective function that enforces two constraints: 1) an at-least-one semantics, i.e. at least one training example per relational tuple is assumed to be correct; 2) high scores for a dedicated NIL label that accounts for the noise in the training data. For both algorithms, neither explicit negative data nor the ratio of negatives has to be provided. Both algorithms give improvements over a maximum likelihood baseline as well as over a previous topic model without features, evaluated on TAC KBP data.
Weighted hashing for fast large scale similarity search BIBAFull-Text 1185-1188
  Qifan Wang; Dan Zhang; Luo Si
Similarity search, or finding approximate nearest neighbors, is an important technique for many applications. Many recent research demonstrate that hashing methods can achieve promising results for large scale similarity search due to its computational and memory efficiency.
   However, most existing hashing methods treat all hashing bits equally and the distance between data examples is calculated as the Hamming distance between their hashing codes, while different hashing bits may carry different amount of information. This paper proposes a novel method, named Weighted Hashing (WeiHash), to assign different weights to different hashing bits. The hashing codes and their corresponding weights are jointly learned in a unified framework by simultaneously preserving the similarity between data examples and balancing the variance of each hashing bit. An iterative coordinate descent optimization algorithm is designed to derive desired hashing codes and weights. Extensive experiments on two large scale datasets demonstrate the superior performance of the proposed research over several state-of-the-art hashing methods.
Term associations in query expansion: a structural linguistic perspective BIBAFull-Text 1189-1192
  Michael Symonds; Guido Zuccon; Bevan Koopman; Peter Bruza; Laurianne Sitbon
Many successful query expansion techniques ignore information about the term dependencies that exist within natural language. However, researchers have recently demonstrated that consistent and significant improvements in retrieval effectiveness can be achieved by explicitly modelling term dependencies within the query expansion process. This has created an increased interest in dependency-based models.
   State-of-the-art dependency-based approaches primarily model term associations known within structural linguistics as syntagmatic associations, which are formed when terms co-occur together more often than by chance. However, structural linguistics proposes that the meaning of a word is also dependent on its paradigmatic associations, which are formed between words that can substitute for each other without effecting the acceptability of a sentence. Given the reliance on word meanings when a user formulates their query, our approach takes the novel step of modelling both syntagmatic and paradigmatic associations within the query expansion process based on the (pseudo) relevant documents returned in web search. The results demonstrate that this approach can provide significant improvements in web retrieval effectiveness when compared to a strong benchmark retrieval system.
Predicting event-relatedness of popular queries BIBAFull-Text 1193-1196
  Seyyedeh Newsha Ghoreishi; Aixin Sun
Many but not all popular queries are related to ongoing or recent events. In this paper, we identify 20 features including both contextual and temporal features from a small set of search results of a query and predict its event-relatedness. Search results from news and blog search engines are evaluated. Our analysis shows that the number of named entities in search results and their appearances in Wikipedia are among the most discriminative features for query event-relatedness prediction. Our study also shows that contextual features are more effective than temporal features. Evaluated with four classifiers (i.e., Support Vector Machine, Naive Bayes, Multinomial Logistic Regression, and Bayesian Logistic Regression) on two datasets, our experiments show that query event-relatedness can be predicted with high accuracy using the proposed features.
Modeling latent topic interactions using quantum interference for information retrieval BIBAFull-Text 1197-1200
  Alessandro Sordoni; Jing He; Jian-Yun Nie
Recently, increasing attention has been given to a possible reinterpretation of information retrieval issues in the more general probabilistic framework offered by Quantum Theory.
   In this paper, we investigate the use of the well-known wave-like phenomenon of Quantum Interference for topic models such as Latent Dirichlet Allocation (LDA). We use interference effects in order to model interactions between latent topics. Our aim is to elaborate a way to build more precise document models starting from original LDA estimations. Experiments in ad-hoc retrieval show statistically significant improvements on several TREC collections.
Generalizing diversity detection in blog feed retrieval BIBAFull-Text 1201-1204
  Mostafa Keikha; Fabio Crestani; Bruce Croft
The goal of a blog retrieval system is to retrieve and rank blogs, as collections of documents, in response to a given query. Previous studies have shown that diversity among the top retrieved posts from a blog is a positive feature for indicating relevance of the blog to the query. However, existing methods capture the diversity of a blog using post-level properties that limits their application to a specific category of retrieval methods. In this paper, we propose a blog-level diversity measure where there is no assumption made about the underlying blog-ranking technique. The proposed measure enables us to integrate diversity in any existing blog retrieval method. Our experimental results show that the proposed method, while being more general, produces comparable results to the post-level diversity detection methods.
Dynamic query intent mining from a search log stream BIBAFull-Text 1205-1208
  Yanan Qian; Tetsuya Sakai; Junting Ye; Qinghua Zheng; Cong Li
It has long been recognized that search queries are often broad and ambiguous. Even when submitting the same query, different users may have different search intents. Moreover, the intents are dynamically evolving. Some intents are constantly popular with users, others are more bursty. We propose a method for mining dynamic query intents from search query logs. By regarding the query logs as a data stream, we identify constant intents while quickly capturing new bursty intents. To evaluate the accuracy and efficiency of our method, we conducted experiments using 50 topics from the NTCIR INTENT-9 data and additional five popular topics, all supplemented with six-month query logs from a commercial search engine. Our results show that our method can accurately capture new intents with short response time.
Latency-aware strategy for static list caching in flash-based web search engines BIBAFull-Text 1209-1212
  Jiancong Tong; Gang Wang; Xiaoguang Liu
Caching is a widely used technique to boost the performance of search engines. Based on the observation that the speed gap between the random access of flash-based solid state drive and its sequential access is much inapparent than that of magnetic hard disk drive, we introduce a new static list caching algorithm which takes the block-level access latency into consideration. The experimental results show that the proposed policy can reduce the average disk access latency per query by up to 14% over the state-of-the-art algorithms in the SSD-based infrastructure. Besides, the results also reveal that our new strategy outperforms other existing algorithms even on HDD-based architecture.
Bootstrapping active name disambiguation with crowdsourcing BIBAFull-Text 1213-1216
  Yu Cheng; Zhengzhang Chen; Jiang Wang; Ankit Agrawal; Alok Choudhary
Name disambiguation is a challenging and important problem in many domains, such as digital libraries, social media management and people search systems. Traditional methods, based on direct assignment using supervised machine learning techniques, seem to be the most effective, but their performances are highly dependent on the amount of training data, while large data annotation can be expensive and time-consuming requiring hours of manual inspection by a domain expert. To efficiently acquire labeled data, we propose a bootstrapping algorithm for the name disambiguation task based on active learning and crowdsourced labeling. We show that the proposed method can leverage the advantages of exploration and exploitation by combining two strategies, thereby improving the overall quality of the training data at minimal expense. The experimental results on two datasets DBLP and ArnetMiner demonstrate the superiority of our framework over existing methods.
Modeling clicks beyond the first result page BIBAFull-Text 1217-1220
  Aleksandr Chuklin; Pavel Serdyukov; Maarten de Rijke
Most modern web search engines yield a list of documents of a fixed length (usually 10) in response to a user query. The next ten search results are usually available in one click. These documents either replace the current result page or are appended to the end. Hence, in order to examine more documents than the first 10 the user needs to explicitly express her intention. Although clickthrough numbers are lower for documents on the second and later result pages, they still represent a noticeable amount of traffic.
   We propose a modification of the Dynamic Bayesian Network (DBN) click model by explicitly including into the model the probability of transition between result pages. We show that our new click model can significantly better capture user behavior on the second and later result pages while giving the same performance on the first result page.
Maintaining discriminatory power in quantized indexes BIBAFull-Text 1221-1224
  Matt Crane; Andrew Trotman; Richard O'Keefe
The time cost of searching with an inverted index is directly proportional to the number of postings processed and the cost of processing each posting. Dynamic pruning reduces the number of postings examined. Pre-calculation then quantization of term / document weights reduces the cost of evaluating each posting. The effect of quantization on precision, latency, and index size is examined herein. We show empirically that there is an ideal size (in bits) for storing the quantized scores. Increasing this adversely affects index size and search latency; decreasing it adversely affects precision. We observe a relationship between the collection size and ideal quantization size, and provide a way to determine the number of bits to use from the collection size.
Retrieving opinions from discussion forums BIBAFull-Text 1225-1228
  Laura Dietz; Ziqi Wang; Samuel Huston; W. Bruce Croft
Abstract Understanding the landscape of opinions on a given topic or issue is important for policy makers, sociologists, and intelligence analysts. The first step in this process is to retrieve relevant opinions. Discussion forums are potentially a good source of this information, but comes with a unique set of retrieval challenges. In this short paper, we test a range of existing techniques for forum retrieval and develop new retrieval models to differentiate between opinionated and factual forum posts. We are able to demonstrate some significant performance improvements over the baseline retrieval models, demonstrating that this as a promising avenue for further study.
Retrieval of trending keywords in a peer-to-peer micro-blogging OSN BIBAFull-Text 1229-1232
  H. Asthana; Ingemar Cox
We investigate the problem of identifying trending information in a peer-to-peer micro-blogging online social network. In a distributed decentralized environment, the participating nodes do not have access to global statistics such as the frequencies of the keywords and the information creation rate. We propose a two step solution. First, nodes make a local estimate of the frequency of keywords in the network based on their local information. At each iteration a subset of nodes collect this information from a small subset of random nodes in the network and aggregate the results. The most frequently occurring keywords are identified. In the second step, a node requests another small random subset of nodes to identify when, in the recent past, the more frequently occurring keywords were seen in micro-blogs. Once again this information is aggregated the fraction of time within a consecutive period that keywords were encountered is calculated. If this fraction, referred to as the trending fraction, is close to 1, then the keyword is predicted to be trending. A simulation on a network of 10,000 nodes shows that the solution is capable of detecting multiple trending keywords with a moderate increase in bandwidth.
Trustable aggregation of online ratings BIBAFull-Text 1233-1236
  Hyun-Kyo Oh; Sang-Wook Kim; Sunju Park; Ming Zhou
The average of the customer ratings on the product, which we call reputation, is one of the key factors in online purchasing decision of a product. There is, however, no guarantee in the trustworthiness of the reputation since it can be manipulated rather easily. In this paper, we define false reputation as the problem of the reputation to be manipulated by unfair ratings, and design a general framework that provides trustable reputation. For this purpose, we propose TRUEREPUTATION, an algorithm that iteratively adjusts the reputation based on the confidence of customer ratings.
Exploiting proximity feature in statistical translation models for information retrieval BIBAFull-Text 1237-1240
  Xinhui Tu; Jing Luo; Bo Li; Tingting He; Maofu Liu
A main challenge in applying translation language models to information retrieval is how to estimate the 'true' probability that a query could be generated as a translation of a document. The state-of-art methods rely on document-based word co-occurrences to estimate word-word translation probabilities. However, these methods do not take into account the proximity of co-occurrences. Intuitively, the proximity of co-occurrences can be exploited to estimate more accurate translation probabilities, since two words occur closer are more likely to be related. In this paper, we study how to explicitly incorporate proximity information into the existing translation language model, and propose a proximity-based translation language model, called TM-P, with three variants. In our TM-P models, a new concept (proximity-based word co-occurrence frequency) is introduced to model the proximity of word co-occurrences, which is then used to estimate translation probabilities. Experimental results on standard TREC collections show that our TM-P models achieve significant improvements over the state-of-the-art translation models.
Position-based contextualization for passage retrieval BIBAFull-Text 1241-1244
  David Carmel; Anna Shtok; Oren Kurland
We present a novel contextualization approach for passage retrieval. The core principle is to let any occurrence of a query term in a document affect the passage retrieval score, whether the occurrence is in the passage or not. This effect is controlled by the distance between the term occurrence and the passage. Empirical evaluation demonstrates the merits of our approach; the resultant retrieval performance substantially transcends that of previously proposed passage retrieval methods, including those that use various contextualization approaches.
High throughput filtering using FPGA-acceleration BIBAFull-Text 1245-1248
  Wim Vanderbauwhede; Anton Frolov; Leif Azzopardi; Sai Rahul Chalamalasetti; Martin Margala
With the rise in the amount information of being streamed across networks, there is a growing demand to vet the quality, type and content itself for various purposes such as spam, security and search. In this paper, we develop an energy-efficient high performance information filtering system that is capable of classifying a stream of incoming document at high speed. The prototype parses a stream of documents using a multicore CPU and then performs classification using Field-Programmable Gate Arrays (FPGAs). On a large TREC data collection, we implemented a Naive Bayes classifier on our prototype and compared it to an optimized CPU based-baseline. Our empirical findings show that we can classify documents at 10Gb/s which is up to 94 times faster than the CPU baseline (and up to 5 times faster than previous FPGA based implementations). In future work, we aim to increase the throughput by another order of magnitude by implementing both the parser and filter on the FPGA.
On challenges with mobile e-health: lessons from a game-theoretic perspective BIBAFull-Text 1249-1252
  Ann-Marie Eklund
Health portals play an important role in today's health care, and the increased mobility places demands on the portals to provide as accurate and few suggestions as possible. Often the information seekers may be in distress, lacking medical knowledge and expressing themselves in ways that make it difficult for the portal to interpret the seekers' needs. This raises the question on how portal providers may be able both to better model, or describe, the user behaviour and to predict the impact of changes in search algorithms to address these challenges.
   This paper highlights some possibilities and benefits of a theoretic framework, based on existing works on game-theoretic treatments of information retrieval and communication, to allow for both descriptive and predictive analysis of internet-based health communication. This is especially important in the context of increased mobility, demanding more accurate and fewer interactions. We also elaborate on how one of the fundamental results of game theory on equilibria may be used as a basis for improved information search. Possibly counter-intuitive, this is done not by tweaking the portal, but instead by trying to change the seekers' behaviour towards passing more diversifying queries.
Improving entity search over linked data by modeling latent semantics BIBAFull-Text 1253-1256
  Nikita Zhiltsov; Eugene Agichtein
Entity ranking has become increasingly important, both for retrieving structured entities and for use in general web search applications. The most common format for linked data, RDF graphs, provide extensive semantic structure via predicate links. While the semantic information is potentially valuable for effective search, the resulting adjacency matrices are often sparse, which introduces challenges for representation and ranking. In this paper, we propose a principled and scalable approach for integrating of latent semantic information into a learning-to-rank model, by combining compact representation of semantic similarity, achieved by using a modified algorithm for tensor factorization, with explicit entity information. Our experiments show that the resulting ranking model scales well to the graphs with millions of entities, and outperforms the state-of-the-art baseline on realistic Yahoo! SemSearch Challenge data sets.

Industry session

Challenges in commerce search BIBAFull-Text 1257-1258
  Hugh Williams
Commerce search engines allow users to discover products, learn about them, and, importantly, make purchases. Commerce search is a challenging problem -- one that is very different to conventional text and web search. In this talk, we discuss what makes commerce search hard, how eBay has solved some of these problems, and what challenges eBay faces in the next generation of its search technologies. We also discuss the recent release of eBay's Cassini engine, share facts and figures about its scale, and outline the progress eBay has made in ranking and relevance for commerce search.
Clustering: probably approximately useless? BIBAFull-Text 1259-1260
  Rich Caruana
Clustering never seems to live up to the hype. To paraphrase the popular saying, clustering looks good in theory, yet often fails to deliver in practice. Why? You would think that something so simple and elegant as finding groups of similar items in data would be incredibly useful. Yet often it isn't. The problem is that clustering rarely finds the groups you want, or expected, or that are most useful for the task at hand. There are so many good ways to cluster a dataset that the odds of coming up with the clustering that is best for what you are doing now are small. How do we fix this and make clustering more useful in practice? How do we make clustering do what you want, while still giving it the freedom to "do its own thing" and surprise us?

IR track: ranking

Is top-k sufficient for ranking? BIBAFull-Text 1261-1270
  Yanyan Lan; Shuzi Niu; Jiafeng Guo; Xueqi Cheng
Recently, 'top-k learning to rank' has attracted much attention in the community of information retrieval. The motivation comes from the difficulty in obtaining a full-order ranking list for training, when employing reliable pairwise preference judgment. Inspired by the observation that users mainly care about top ranked search result, top-k learning to rank proposes to utilize top-k ground-truth for training, where only the total order of top k items are provided, instead of a full-order ranking list. However, it is not clear whether the underlying assumption holds, i.e. top-k ground-truth is sufficient for training. In this paper, we propose to study this problem from both empirical and theoretical aspects. Empirically, our experimental results on benchmark datasets LETOR4.0 show that the test performances of both pairwise and listwise ranking algorithms will quickly increase to a stable value, with the growth of k in the top-k ground-truth. Theoretically, we prove that the losses of these typical ranking algorithms in top-k setting are tighter upper bounds of (1 -- NDCG@k), compared with that in full-order setting. Therefore, our studies reveal that learning on top-k ground-truth is surely sufficient for ranking, which lay a foundation for the new learning to rank framework.
How fresh do you want your search results? BIBAFull-Text 1271-1280
  Shiwen Cheng; Anastasios Arvanitis; Vagelis Hristidis
Researchers have recognized the importance of utilizing temporal features for improving the performance of information retrieval systems. Specifically, the timeliness of a web document can be a significant factor for determining whether it is relevant for a search query. Previous works have proposed time-aware retrieval models with particular focus on news queries, where recent web documents related with a real-world event are generally preferable. These queries typically exhibit bursts in the volume of published documents or submitted queries. However, no work has studied the role of time in queries such as "credit card overdraft fees" that have no major spikes in either document or query volumes over time, yet they still favor more recently published documents. In this work, we focus on this class of queries that we refer to as "timely queries". We show that the change in the terms distribution of results of timely queries over time is strongly correlated with the users' perception of time sensitivity. Based on this observation, we propose a method to estimate the query timeliness requirements and we propose principled ways to incorporate document freshness into the ranking model. Our study shows that our method yields a more accurate estimation of timeliness compared to volume-based approaches. We experimentally compare our ranking strategy with other time-sensitive and non time-sensitive ranking algorithms and we show that it improves the results' retrieval quality for timely queries.
TellMyRelevance!: predicting the relevance of web search results from cursor interactions BIBAFull-Text 1281-1290
  Maximilian Speicher; Andreas Both; Martin Gaedke
It is crucial for the success of a search-driven web application to answer users' queries in the best possible way. A common approach is to use click models for guessing the relevance of search results. However, these models are imprecise and waive valuable information one can gain from non-click user interactions. We introduce TellMyRelevance! -- a novel automatic end-to-end pipeline for tracking cursor interactions at the client, analyzing these and learning according relevance models. Yet, the models depend on the layout of the search results page involved, which makes them difficult to evaluate and compare. Thus, we use a Random Mouse Cursor as an extension to our pipeline for generating layout-dependent baselines. Based on these, we can perform evaluations of real-world relevance models. A large-scale interaction log analysis showed that we can learn relevance models whose predictions compare favorably to predictions of an existing state-of-the-art click model.
Selection fusion in semi-structured retrieval BIBAFull-Text 1291-1300
  Muhammad Ali Norozi; Paavo Arvola
Semi-structured retrieval aims at providing focused answers to the users queries. A successful retrieval experience in semi-structured environment would mean a satisfactory combination of (a) matching or scoring and (b) selection of appropriate and focused fragments of the text. The need to retrieve items of different sizes arises today with users querying the retrieval systems with varied use case, user interface and screen-size requirements. Which means that different selection scenario serve different requirements and constraints. Hence we propose, a novel type of fusion; the selection fusion -- a fusion methodology which fuses an all-purpose and comprehensive ranking of elements with a specific selection scheme, and also enables evaluation of the ranking in many selection perspectives. With the standard Wikipedia XML test collection, we are able to demonstrate that a strong and competitive baseline ranking system improves retrieval quality irrespective of the selection criteria. Our baseline ranking system is based on data fusion over the official submitted runs at INEX 2009.
Incorporating user preferences into click models BIBAFull-Text 1301-1310
  Qianli Xing; Yiqun Liu; Jian-Yun Nie; Min Zhang; Shaoping Ma; Kuo Zhang
Click models are developed to interpret clicks by making assumptions on how users browse the search result page. Most existing click models implicitly assume that all users are homogeneous and act in the same way when browsing the search results. However, a number of researches have shown that users have diverse behavioral patterns, which is also observed in this paper by eye-tracking experiments and click-through log analysis. As a uniform click model for all users can hardly capture the diverse click behavior, in this paper we incorporate user preferences into both a variety of existing click models and a novel click model. The experimental results on a large-scale click-through data set show consistent and significant performance improvement of the click models with user preferences integrated.

KM track: learning and applications (1)

Feedback-driven multiclass active learning for data streams BIBAFull-Text 1311-1320
  Yu Cheng; Zhengzhang Chen; Lu Liu; Jiang Wang; Ankit Agrawal; Alok Choudhary
Active learning is a promising way to efficiently build up training sets with minimal supervision. Most existing methods consider the learning problem in a pool-based setting. However, in a lot of real-world learning tasks, such as crowdsourcing, the unlabeled samples, arrive sequentially in the form of continuous rapid streams. Thus, preparing a pool of unlabeled data for active learning is impractical. Moreover, performing exhaustive search in a data pool is expensive, and therefore unsuitable for supporting on-the-fly interactive learning in large scale data. In this paper, we present a systematic framework for stream-based multi-class active learning. Following the reinforcement learning framework, we propose a feedback-driven active learning approach by adaptively combining different criteria in a time-varying manner. Our method is able to balance exploration and exploitation during the learning process. Extensive evaluation on various benchmark and real-world datasets demonstrates the superiority of our framework over existing methods.
Discriminative feature selection for multi-view cross-domain learning BIBAFull-Text 1321-1330
  Zheng Fang; Zhongfei (Mark) Zhang
In many data mining applications, we often face the problem of cross-domain learning, i.e., to transfer the already learned knowledge from a source domain to a target domain. In particular, this problem becomes very challenging when there is no or little labeled training data available in the target domain, which is not an uncommon scenario as it is expensive and in certain cases even impossible to obtain any labeled training data in the target domain in many real world applications. In the literature, though few efforts are reported to attempt to solve this challenging problem, the solutions are all rather limited making this problem still open and challenging. On the other hand, as it is not uncommon to face this problem in many applications, an effective solution to this problem shall generate substantial societal impacts. In this paper, we address this problem and propose a new framework, called DISMUTE, taking advantage of the typically available multiple views of the data in domains. Consequently, DISMUTE is based on discriminative feature selection for multi-view cross-domain learning. Theoretic analysis and extensive evaluations in the specific application of object identification and image classification against several state-of-the-art methods demonstrate the outstanding superiority of DISMUTE.
Functional dirichlet process BIBAFull-Text 1331-1340
  Lijing Qin; Xiaoyan Zhu
Dirichlet process mixture (DPM) model is one of the most important Bayesian nonparametric models owing to its efficiency of inference and flexibility for various applications. A fundamental assumption made by DPM model is that all data items are generated from a single, shared DP. This assumption, however, is restrictive in many practical settings where samples are generated from a collection of dependent DPs, each associated with a point in some covariate space. For example, documents in the proceedings of a conference are organized by year, or photos may be tagged and recorded with GPS locations. We present a general method for constructing dependent Dirichlet processes (DP) on arbitrary covariate space. The approach is based on restricting and projecting a DP defined on a space of continuous functions with different domains, which results in a collection of dependent random measures, each associated with a point in covariate space and is marginally DP distributed. The constructed collection of dependent DPs can be used as a nonparametric prior of infinite dynamic mixture models, which allow each mixture component to appear/disappear and vary in a subspace of covariate space. Furthermore, we discuss choices of base distributions of functions in a variety of settings as a flexible method to control dependencies. In addition, we develop an efficient Gibbs sampler for model inference where all underlying random measures are integrated out. Finally, experiment results on temporal modeling and spatial modeling datasets demonstrate the effectiveness of the method in modeling dynamic mixture models on different types of covariates.
Spatio-temporal meme prediction: learning what hashtags will be popular where BIBAFull-Text 1341-1350
  Krishna Y. Kamath; James Caverlee
In this paper, we tackle the problem of predicting what online memes will be popular in what locations. Specifically, we develop data-driven approaches building on the global footprint of 755 million geo-tagged hashtags spread via Twitter. Our proposed methods model the geo-spatial propagation of online information spread to identify which hashtags will become popular in specific locations. Concretely, we develop a novel reinforcement learning approach that incrementally updates the best geo-spatial model. In experiments, we find that the proposed method outperforms alternative linear regression based methods.
Cost-sensitive learning for large-scale hierarchical classification BIBAFull-Text 1351-1360
  Jianfu Chen; David Warren
We study hierarchical classification of products in electronic commerce, classifying a text description of a product into one of the leaf classes of a tree-structure taxonomy. In particular, we investigate two essential problems, performance evaluation and learning, in a synergistic way. Unless we know what is the appropriate performance evaluation metric for a task, we are not going to learn a classifier of maximum utility for the task. Given the characteristics of the task of hierarchical product classification, we shed insight into how and why common evaluation metrics such as error rate can be misleading, which is applicable for treating other real world applications. The analysis leads to a new performance evaluation metric that tailors this task to reflect a vendor's business goal of maximizing revenue. The proposed metric has an intuitive meaning as the average revenue loss, which depends on both the value of individual products and the hierarchical distance between the true class and the predicted class. Correspondingly, our learning algorithm uses multi-class SVM with margin re-scaling to optimize the proposed metric, instead of error rate or other common metrics. Margin re-scaling is sensitive to the scaling of loss functions. We propose a loss normalization approach to appropriately calibrating the scaling of loss functions, which is applicable to general classification and structured prediction tasks whenever using structured SVM with margin re-scaling. Experiments on a large dataset show that our approach outperforms standard multi-class SVM in terms of the proposed metric, effectively reducing the average revenue loss.

KM track: similarity, clustering, and outlier mining

Effective measures for inter-document similarity BIBAFull-Text 1361-1370
  John S. Whissell; Charles L. A. Clarke
While supervised learning-to-rank algorithms have largely supplanted unsupervised query-document similarity measures for search, the exploration of query-document measures by many researchers over many years produced insights that might be exploited in other domains. For example, the BM25 measure substantially and consistently outperforms cosine across many tested environments, and potentially provides retrieval effectiveness approaching that of the best learning-to-rank methods over equivalent features sets. Other measures based on language modeling and divergence from randomness can outperform BM25 in some circumstances. Despite this evidence, cosine remains the prevalent method for determining inter-document similarity for clustering and other applications. However, recent research demonstrates that BM25 terms weights can significantly improve clustering. In this work, we extend that result, presenting and evaluating novel inter-document similarity measures based on BM25, language modeling, and divergence from randomness. In our first experiment we analyze the accuracy of nearest neighborhoods when using our measures. In our second experiment, we analyze using clustering algorithms in conjunction with our measures. Our novel symmetric BM25 and language modeling similarity measures outperform alternative measures in both experiments. This outcome strongly recommends the adoption of these measures, replacing cosine similarity in future work.
Efficient hierarchical clustering of large high dimensional datasets BIBAFull-Text 1371-1380
  Sean Gilpin; Buyue Qian; Ian Davidson
Hierarchical clustering is extensively used to organize high dimensional objects such as documents and images into a structure which can then be used in a multitude of ways. However, existing algorithms are limited in their application since the time complexity of agglomerative style algorithms can be as much as O(n2log n) where n is the number of objects. Furthermore the computation of similarity between such objects is itself time consuming given they are high dimension and even optimized built in functions found in MATLAB take the best part of a day to handle collections of just 10,000 objects on typical machines. In this paper we explore using angular hashing to hash objects with similar angular distance to the same hash bucket. This allows us to create hierarchies of objects within each hash bucket and to hierarchically cluster the hash buckets themselves. With our formal guarantees on the similarity of objects in the same bucket this leads to an elegant agglomerative algorithm with strong performance bounds. Our experimental results show that not only is our approach thousands of times faster than regular agglomerative algorithms but surprisingly the accuracy of our results is typically as good and can sometimes be substantially better.
Flexible and adaptive subspace search for outlier analysis BIBAFull-Text 1381-1390
  Fabian Keller; Emmanuel Müller; Andreas Wixler; Klemens Böhm
There exists a variety of traditional outlier models, which measure the deviation of outliers with respect to the full attribute space. However, these techniques fail to detect outliers that deviate only w.r.t. an attribute subset. To address this problem, recent techniques focus on a selection of subspaces that allow: (1) A clear distinction between clustered objects and outliers; (2) a description of outlier reasons by the selected subspaces. However, depending on the outlier model used, different objects in different subspaces have the highest deviation. It is an open research issue to make subspace selection adaptive to the outlier score of each object and flexible w.r.t. the use of different outlier models.
   In this work we propose such a flexible and adaptive subspace selection scheme. Our generic processing allows instantiations with different outlier models. We utilize the differences of outlier scores in random subspaces to perform a combinatorial refinement of relevant subspaces. Our refinement allows an individual selection of subspaces for each outlier, which is tailored to the underlying outlier model. In the experiments we show the flexibility of our subspace search w.r.t. various outlier models such as distance-based, angle-based, and local-density-based outlier detection.
Query matching for report recommendation BIBAFull-Text 1391-1400
  Veronika Thost; Konrad Voigt; Daniel Schuster
Today, reporting is an essential part of everyday business life. But the preparation of complex Business Intelligence data by formulating relevant queries and presenting them in meaningful visualizations, so-called reports, is a challenging task for non-expert database users. To support these users with report creation, we leverage existing queries and present a system for query recommendation in a reporting environment, which is based on query matching. Targeting at large-scale, real-world reporting scenarios, we propose a scalable, index-based query matching approach. Moreover, schema matching is applied for a more fine-grained, structural comparison of the queries. In addition to interactively providing content-based query recommendations of good quality, the system works independent of particular data sources or query languages.
   We evaluate our system with an empirical data set and show that it achieves an F1-Measure of 0.56 and outperforms the approaches applied by state-of-the-art reporting tools (e.g., keyword search) by up to 30%.
Computing term similarity by large probabilistic isA knowledge BIBAFull-Text 1401-1410
  Peipei Li; Haixun Wang; Kenny Q. Zhu; Zhongyuan Wang; Xindong Wu
Computing semantic similarity between two terms is essential for a variety of text analytics and understanding applications. However, existing approaches are more suitable for semantic similarity between words rather than the more general multi-word expressions (MWEs), and they do not scale very well. Therefore, we propose a lightweight and effective approach for semantic similarity using a large scale semantic network automatically acquired from billions of web documents. Given two terms, we map them into the concept space, and compare their similarity there. Furthermore, we introduce a clustering approach to orthogonalize the concept space in order to improve the accuracy of the similarity measure. Extensive studies demonstrate that our approach can accurately compute the semantic similarity between terms with MWEs and ambiguity, and significantly outperforms 12 competing methods.

IR track: applications II

Interactive collaborative filtering BIBAFull-Text 1411-1420
  Xiaoxue Zhao; Weinan Zhang; Jun Wang
In this paper, we study collaborative filtering (CF) in an interactive setting, in which a recommender system continuously recommends items to individual users and receives interactive feedback. Whilst users enjoy sequential recommendations, the recommendation predictions are constantly refined using up-to-date feedback on the recommended items. Bringing the interactive mechanism back to the CF process is fundamental because the ultimate goal for a recommender system is about the discovery of interesting items for individual users and yet users' personal preferences and contexts evolve over time during the interactions with the system. This requires us not to distinguish between the stages of collecting information to construct the user profile and making recommendations, but to seamlessly integrate these stages together during the interactive process, with the goal of maximizing the overall recommendation accuracy throughout the interactions. This mechanism naturally addresses the cold-start problem as any user can immediately receive sequential recommendations without providing ratings beforehand. We formulate the interactive CF with the probabilistic matrix factorization (PMF) framework, and leverage several exploitation-exploration algorithms to select items, including the empirical Thompson sampling and upper confidence bound based algorithms. We conduct our experiment on cold-start users as well as warm-start users with drifting taste. Results show that the proposed methods have significant improvements over several strong baselines for the MovieLens, EachMovie and Netflix datasets.
Building optimal information systems automatically: configuration space exploration for biomedical information systems BIBAFull-Text 1421-1430
  Zi Yang; Elmer Garduno; Yan Fang; Avner Maiberg; Collin McCormack; Eric Nyberg
Software frameworks which support integration and scaling of text analysis algorithms make it possible to build complex, high performance information systems for information extraction, information retrieval, and question answering; IBM's Watson is a prominent example. As the complexity and scaling of information systems become ever greater, it is much more challenging to effectively and efficiently determine which toolkits, algorithms, knowledge bases or other resources should be integrated into an information system in order to achieve a desired or optimal level of performance on a given task. This paper presents a formal representation of the space of possible system configurations, given a set of information processing components and their parameters (configuration space) and discusses algorithmic approaches to determine the optimal configuration within a given configuration space (configuration space exploration or CSE). We introduce the CSE framework, an extension to the UIMA framework which provides a general distributed solution for building and exploring configuration spaces for information systems. The CSE framework was used to implement biomedical information systems in case studies involving over a trillion different configuration combinations of components and parameter values operating on question answering tasks from the TREC Genomics. The framework automatically and efficiently evaluated different system configurations, and identified configurations that achieved better results than prior published results.
Learning to handle negated language in medical records search BIBAFull-Text 1431-1440
  Nut Limsopatham; Craig Macdonald; Iadh Ounis
Negated language is frequently used by medical practitioners to indicate that a patient does not have a given medical condition. Traditionally, information retrieval systems do not distinguish between the positive and negative contexts of terms when indexing documents. For example, when searching for patients with angina, a retrieval system might wrongly consider a patient with a medical record stating "no evidence of angina" to be relevant. While it is possible to enhance a retrieval system by taking into account the context of terms within the indexing representation of a document, some non-relevant medical records can still be ranked highly, if they include some of the query terms with the intended context. In this paper, we propose a novel learning framework that effectively handles negated language. Based on features related to the positive and negative contexts of a term, the framework learns how to appropriately weight the occurrences of the opposite context of any query term, thus preventing documents that may not be relevant from being retrieved. We thoroughly evaluate our proposed framework using the TREC 2011 and 2012 Medical Records track test collections. Our results show significant improvements over existing strong baselines. In addition, in combination with a traditional query expansion and a conceptual representation approach, our proposed framework could achieve a retrieval effectiveness comparable to the performance of the best TREC 2011 and 2012 systems, while not addressing other challenges in medical records search, such as the exploitation of semantic relationships between medical terms.
A pattern-based selective recrawling approach for object-level vertical search BIBAFull-Text 1441-1450
  Yaqian Zhou; Qi Zhang; Xuanjing Huang; Lide Wu
Traditional recrawling methods learn navigation patterns in order to crawl related web pages. However, they cannot remove the redundancy found on the web, especially at the object level. To deal with this problem, we propose a new hypertext resource discovery method, called "selective recrawling" for object-level vertical search applications. The goal of selective recrawling is to automatically generate URL patterns, then select those pages that have the widest coverage, and least irrelevance and redundancy relative to a pre-defined vertical domain. This method only requires a few seed objects and can select the set of URL patterns that covers the greatest number of objects. The selected set can continue to be used for some time to recrawl web pages and can be renewed periodically. This leads to significant savings in hardware and network resources.
   In this paper we present a detailed framework of selective recrawling for object-level vertical search. The selective recrawling method automatically extends the set of candidate websites from initial seed objects. Based on the objects extracted from these websites it learns a set of URL patterns which covers the greatest number of target objects with little redundancy. Finally, the navigation patterns generated from the selected URL pattern set are used to guide future crawling. Experiments on local event data show that our method can greatly reduce downloading of web pages while maintaining comparative object coverage.
Robust models of mouse movement on dynamic web search results pages BIBAFull-Text 1451-1460
  Fernando Diaz; Ryen White; Georg Buscher; Dan Liebling
Understanding how users examine result pages across a broad range of information needs is critical for search engine design. Cursor movements can be used to estimate visual attention on search engine results page (SERP) components, including traditional snippets, aggregated results, and advertisements. However, these signals can only be leveraged for SERPs where cursor tracking was enabled, limiting their utility for informing the design of new SERPs. In this work, we develop robust, log-based mouse movement models capable of estimating searcher attention on novel SERP arrangements. These models can help improve SERP design by anticipating searchers' engagement patterns given a proposed arrangement. We demonstrate the efficacy of our method using a large set of mouse-tracking data collected from two independent commercial search engines.

Poster Session: KM track

Cross-domain sparse coding BIBAFull-Text 1461-1464
  Jim Jing-Yan Wang; Halima Bensmail
Sparse coding has shown its power as an effective data representation method. However, up to now, all the sparse coding approaches are limited within the single domain learning problem. In this paper, we extend the sparse coding to cross domain learning problem, which tries to learn from a source domain to a target domain with significant different distribution. We impose the Maximum Mean Discrepancy (MMD) criterion to reduce the cross-domain distribution difference of sparse codes, and also regularize the sparse codes by the class labels of the samples from both domains to increase the discriminative ability. The encouraging experiment results of the proposed cross-domain sparse coding algorithm on two challenging tasks -- image classification of photograph and oil painting domains, and multiple user spam detection -- show the advantage of the proposed method over other cross-domain data representation methods.
Motif discovery in spatial trajectories using grammar inference BIBAFull-Text 1465-1468
  Tim Oates; Arnold P. Boedihardjo; Jessica Lin; Crystal Chen; Susan Frankenstein; Sunil Gandhi
Spatial trajectory analysis is crucial to uncovering insights into the motives and nature of human behavior. In this work, we study the problem of discovering motifs in trajectories based on symbolically transformed representations and context free grammars. We propose a fast and robust grammar induction algorithm called mSEQUITUR to infer a grammar rule set from a trajectory for motif generation. Second, we designed the Symbolic Trajectory Analysis and VIsualization System (STAVIS), the first of its kind trajectory analytical system that applies grammar inference to derive trajectory signatures and enable mining tasks on the signatures. Third, an empirical evaluation is performed to demonstrate the efficiency and effectiveness of mSEQUITUR for generating trajectory signatures and discovering motifs.
LCMKL: latent-community and multi-kernel learning based image annotation BIBAFull-Text 1469-1472
  Qing Li; Yun Gu; Xueming Qian
Automatic image annotation is an important function for online photo sharing service. The concurrence of labels is pretty common in multi-label annotation. In this paper, we propose a novel approach called latent-community and multi-kernel learning (LCMKL). The established graph of labels is regarded as a semantic network. Community detection method is introduced that treats the label set as communities. Multi-kernel learning SVM is adopted for specifying communities and settling difficulty of extracting semantically meaningful entities with some simple features. Experiments on NUS-WIDE database demonstrate that LCMKL outperforms other state-of-the-art approaches.
Random walk-based graphical sampling in unbalanced heterogeneous bipartite social graphs BIBAFull-Text 1473-1476
  Yusheng Xie; Zhengzhang Chen; Ankit Agrawal; Alok Choudhary; Lu Liu
We investigate sampling techniques in unbalanced heterogeneous bipartite graphs (UHBGs), which have wide applications in real world web-scale social networks. We propose random walked-based link sampling and stratified sampling for UHBGs and show that they have advantages over generic random walk samplers. In addition, each sampler's node degree distribution parameter estimator statistic is analytically derived to be used as a quality indicator. In the experiments, we apply the two sampling techniques, with a baseline node sampling method, to both synthetic and real Facebook data. The experimental results show that random walk-based stratified sampler has significant advantage over node sampler and link sampler on UHBGs.
Modeling information diffusion over social networks for temporal dynamic prediction BIBAFull-Text 1477-1480
  Dong Li; Zhiming Xu; Yishu Luo; Sheng Li; Anika Gupta; Katia Sycara; Shengmei Luo; Lei Hu; Hong Chen
How to model the process of information diffusion in social networks is a critical research task. Although numerous attempts have been made for this study, few of them can simulate and predict the temporal dynamics of the diffusion process. To address this problem, we propose a novel information diffusion model (GT model), which considers the users in network as intelligent agents. The agent jointly considers all his interacting neighbors and calculates the payoffs for his different choices to make strategic decision. We introduce the time factor into the user payoff, enabling the GT model to not only predict the behavior of a user but also to predict when he will perform the behavior. Both the global influence and social influence are explored in the time-dependent payoff calculation, where a new social influence representation method is designed to fully capture the temporal dynamic properties of social influence between users. Experimental results on Sina Weibo and Flickr validate the effectiveness of our methods.
Predicting retweet count using visual cues BIBAFull-Text 1481-1484
  Ethem F. Can; Hüseyin Oktay; R. Manmatha
Social media platforms allow rapid information diffusion, and serve as a source of information to many of the users. Particularly, in Twitter information provided by tweets diffuses over the users through retweets. Hence, being able to predict the retweet count of a given tweet is important for understanding and controlling information diffusion on Twitter. Since the length of a tweet is limited to 140 characters, extracting relevant features to predict the retweet count is a challenging task. However, visual features of images linked in tweets may provide predictive features. In this study, we focus on predicting the expected retweet count of a tweet by using visual cues of an image linked in that tweet in addition to content and structure-based features.
Identifying multilingual Wikipedia articles based on cross language similarity and activity BIBAFull-Text 1485-1488
  Khoi-Nguyen Tran; Peter Christen
Wikipedia is an online free and open access encyclopedia available in many languages. Wikipedia articles across over 280 languages are written by millions of editors. However, the growth of articles and their content is slowing, especially within the largest Wikipedia language: English. The stabilization of articles presents opportunities for multilingual Wikipedia editors to apply their translation skills to add articles and content to smaller Wikipedia languages. In this poster, we propose similarity and activity measures of Wikipedia articles across two languages: English and German. These measures allow us to evaluate the distribution of articles based on their knowledge coverage and their activity across languages. We show the state of Wikipedia articles as of June 2012 and discuss how these measures allow us to develop recommendation and verification models for multilingual editors to enrich articles and content in Wikipedia languages with relatively smaller knowledge coverage.
An efficient algorithm for approximate betweenness centrality computation BIBAFull-Text 1489-1492
  Mostafa Haghir Chehreghani
Betweenness centrality is an important centrality measure widely used in social network analysis, route planning etc. However, even for mid-size networks, it is practically intractable to compute exact betweenness scores. In this paper, we propose a generic randomized framework for unbiased approximation of betweenness centrality. The proposed framework can be adapted with different sampling techniques and give diverse methods. We discuss the conditions a promising sampling technique should satisfy to minimize the approximation error and present a sampling method partially satisfying the conditions. We perform extensive experiments and show the high efficiency and accuracy of the proposed method.
Exploiting collaborative filtering techniques for automatic assessment of student free-text responses BIBAFull-Text 1493-1496
  Tao Ge; Zhifang Sui; Baobao Chang
The automatic assessment of free-text responses of students is a relatively newer task in both computational linguistics and educational technology. The goal of the task is to produce an assessment of student answers to explanation and definition questions typically asked in problems seen in practice exercises or tests. Unlike some conventional methods which assess the student responses based on only information about their corresponding questions, this paper exploits idea of collaborative filtering to analyze student responses and used an effective collaborative filtering model -- feature-based matrix factorization model to deal with this challenge. The experimental results show that our feature-based matrix factorization model outperforms the baseline models and the model with a re-ranking phase can achieve a better and competitive performance -- 63.6% overall accuracy on the Beetle dataset.
Automated probabilistic modeling for relational data BIBAFull-Text 1497-1500
  Sameer Singh; Thore Graepel
Probabilistic graphical model representations of relational data provide a number of desired features, such as inference of missing values, detection of errors, visualization of data, and probabilistic answers to relational queries. However, adoption has been slow due to the high level of expertise expected both in probability and in the domain from the user. Instead of requiring a domain expert to specify the probabilistic dependencies of the data, we present an approach that uses the relational DB schema to automatically construct a Bayesian graphical model for a database. This resulting model contains customized distributions for the attributes, latent variables that cluster the records, and factors that reflect and represent the foreign key links, whilst allowing efficient inference. Experiments demonstrate the accuracy of the model and scalability of inference on synthetic and real-world data.
Semantic discovery from web comparison queries BIBAFull-Text 1501-1504
  Tingting Zhong; Wensheng Wu
Users frequently pose comparison queries (e.g., ibm vs apple) on web search engines. However, little research has been done on understanding these queries. To fill in this gap, this paper describes a first solution to discovering and mining comparison queries. We present a novel snowballing algorithm that "crawls" comparison queries from search engines via their query autocompletion services. We propose a novel modeling approach that represents comparison queries in a comparison graph and develop a novel algorithm that mines closely related concepts from comparison graphs via spectral clustering. Initial experiments indicate that our approach can reveal the inherent semantic relationship among the concepts and discover different senses of a concept, e.g., "toyota" as a car brand or a company name.
Joint learning on sentiment and emotion classification BIBAFull-Text 1505-1508
  Wei Gao; Shoushan Li; Sophia Yat Mei Lee; Guodong Zhou; Chu-Ren Huang
Sentiment and emotion classification have been popularly but separately studied in natural language processing. In this paper, we address joint learning on sentiment and emotion classification where both the labeled data for sentiment and emotion classification are available. The objective of this joint-learning is to benefit the two tasks from each other for improving their performances. Specifically, an extra data set that is annotated with both sentiment and emotion labels are employed to estimate the transformation probability between the two kinds of labels. Furthermore, the transformation probability is leveraged to transfer the classification labels to benefit the two tasks from each other. Empirical studies demonstrate the effectiveness of our approach for the novel joint learning task.
A unified graph model for personalized query-oriented reference paper recommendation BIBAFull-Text 1509-1512
  Fanqi Meng; Dehong Gao; Wenjie Li; Xu Sun; Yuexian Hou
With the tremendous amount of research publications, it has become increasingly important to provide a researcher with a rapid and accurate recommendation of a list of reference papers about a research field or topic. In this paper, we propose a unified graph model that can easily incorporate various types of useful information (e.g., content, authorship, citation and collaboration networks etc.) for efficient recommendation. The proposed model not only allows to thoroughly explore how these types of information can be better combined, but also makes personalized query-oriented reference paper recommendation possible, which as far as we know is a new issue that has not been explicitly addressed in the past. The experiments have demonstrated the clear advantages of personalized recommendation over non-personalized recommendation.
Probabilistic latent class models for predicting student performance BIBAFull-Text 1513-1516
  Suleyman Cetintas; Luo Si; Yan Ping Xin; Ron Tzur
Predicting student performance is an important task for many core problems in intelligent tutoring systems. This paper proposes a set of novel probabilistic latent class models for the task. The most effective probabilistic model utilizes all available information about the educational content and users/students to jointly identify hidden classes of students and educational content that share similar characteristics, and to learn a specialized and fine-grained regression model for each latent educational content and student class. Experiments carried out on large-scale real-world datasets demonstrate the advantages of the proposed probabilistic latent class models.
Timeline adaptation for text classification BIBAFull-Text 1517-1520
  Fumiyo Fukumoto; Yoshimi Suzuki; Atsuhiro Takasu
In this paper, we address the text classification problem that a period of time created test data is different from the training data, and present a method for text classification based on temporal adaptation. We first applied lexical chains for the training data to collect terms with semantic relatedness, and created sets (we call these Sem sets). Semantically related terms in the documents are replaced to their representative term. For the results, we identified short terms that are salient for a specific period of time. Finally, we trained SVM classifiers by applying a temporal weighting function to each selected short terms within the training data, and classified test data. Temporal weighting function is weighted each short term in the training data according to the temporal distance between training and test data. The results using MedLine data showed that the method was comparable to the current state-of-the-art biased-SVM method, especially the method is effective when testing on data far from the training data.
Recommendation via user's personality and social contextual BIBAFull-Text 1521-1524
  He Feng; Xueming Qian
With the advent and popularity of social network, more and more users like to share their experiences, such as ratings, reviews, and blogs. The new factors of social network like interpersonal influence and interest based on circles of friends bring opportunities and challenges for recommender system (RS) to solve the cold start and sparsity problem of datasets. Some of the social factors have been used in RS, but have not been fully considered. In this paper, three social factors, personal interest, interpersonal interest similarity and interpersonal influence, fuse into a unified personalized recommendation model based on probabilistic matrix factorization. The factor of personal interest can make the RS recommend items to meet users' individualities, especially for experienced users. Moreover, for cold start users, the interpersonal interest similarity and interpersonal influence can enhance the intrinsic link among features in the latent space. We conduct a series of experiments on real rating datasets. Experimental results show the proposed approach outperforms the existing RS approaches.
A fast convergence clustering algorithm merging MCMC and EM methods BIBAFull-Text 1525-1528
  David Sergio Matusevich; Carlos Ordonez; Veerabhadran Baladandayuthapani
Clustering is a fundamental problem in statistics and machine learning, whose solution is commonly computed by the Expectation-Maximization (EM) method, which finds a locally optimal solution for an objective function called log-likelihood. Since the surface of the log-likelihood function is non convex, a stochastic search with Markov Chain Monte Carlo (MCMC) methods can help escaping locally optimal solutions. In this article, we tackle two fundamental conflicting goals: Finding higher quality solutions and achieving faster convergence. With that motivation in mind, we introduce an efficient algorithm that combines elements of the EM and MCMC methods to find clustering solutions that are qualitatively better than those found by the standard EM method. Moreover, our hybrid algorithm allows tuning model parameters and understanding the uncertainty in their estimation. The main issue with MCMC methods is that they generally require a very large number of iterations to explore the posterior of each model parameter. Convergence is accelerated by several algorithmic improvements which include sufficient statistics, simplified model parameter priors, fixing covariance matrices and iterative sampling from small blocks of the data set. A brief experimental evaluation shows promising results.
Discrimination aware classification for imbalanced datasets BIBAFull-Text 1529-1532
  Goce Ristanoski; Wei Liu; James Bailey
The problem of learning a discrimination aware model has recently received attention in the data mining community. Various methods and improved models have been proposed, with the main approach being the detection of a discrimination sensitive attribute. Once the discrimination sensitive attribute is identified, the methods aim to develop a strategy that will include the useful information from that attribute without causing any additional discrimination. Our work focuses on an aspect often overlooked in the discrimination aware classification -- the scenario of an imbalanced dataset, where the number of samples from one class is disproportionate to the other. We also investigate a strategy that is directly minimizing discrimination and is independent of the class balance. Our empirical results indicate additional concerns that need to be considered when developing discrimination aware classifiers, and our proposed strategy shows promise in overcoming these concerns.
Incremental shared nearest neighbor density-based clustering BIBAFull-Text 1533-1536
  Sumeet Singh; Amit Awekar
Shared Nearest Neighbor Density-based clustering (SNN-DBSCAN) is a robust graph-based clustering algorithm and has wide applications from climate data analysis to network intrusion detection. We propose an incremental extension to this algorithm IncSNN-DBSCAN, capable of finding clusters on a dataset to which frequent inserts are made. For each data point, the algorithm maintains four properties: nearest neighbor list, strengths of shared links, total connection strength and topic property. Algorithm only targets points that undergo change to their properties. We prove that, to obtain the exact clustering it is sufficient to re-compute properties for only the targeted points, followed by possible cluster mergers on newly formed links and cluster splits on the deleted links.
   Experiments on KDD Cup 1999 and Mopsi search engine 2012 datasets respectively demonstrate 75% and 99% reduction in the size of the set of points involved in property re-computations. By avoiding most of the redundant property computations, algorithm generates speedup up to 250 and 1000 times respectively on those datasets, while generating the exact same clustering as the non-incremental algorithm. We experimentally verify our claim for up to 2500 inserts on both datasets. However, speedup comes at the cost of up to 48 times more memory usage.
The essence of knowledge (bases) through entity rankings BIBAFull-Text 1537-1540
  Evica Ilieva; Sebastian Michel; Aleksandar Stupar
We consider the task of automatically phrasing and computing top-k rankings over the information contained in common knowledge bases (KBs), such as YAGO or DBPedia. We assemble the thematic focus and ranking criteria of rankings by inspecting the present Subject, Predicate, Object (SPO) triples. Making use of numerical attributes contained in the KB we are also able to compute the actual ranking content, i.e., entities and their performances. We further discuss the integration of existing rankings into the ranking generation process for increased coverage and ranking quality. We report on first results obtained using the YAGO knowledge base.
Chinese syntactic parsing based on linguistic entity-relationship model BIBAFull-Text 1541-1544
  Dechun Yin
In this paper, we present a new parsing method for Chinese based on a newly proposed linguistic entity relationship model. In the model, we extract and define the linguistic entity relationship modes to describe the most basic syntactic and semantic structures of Chinese, and use the relationship modes as the foundation to implement the parsing algorithm. Compared with the rule-based and corpus-based methods, we neither manually write a large number of rules as used in traditional rule-based methods nor use the corpus to train the model. We only use the few meta-rules to describe the grammars in the parsing procedure. The system performance of syntactic parsing based on the model outperforms the corpus-based baseline system.
Clustering-based anomaly detection in multi-view data BIBAFull-Text 1545-1548
  Alejandro Marcos Alvarez; Makoto Yamada; Akisato Kimura; Tomoharu Iwata
This paper proposes a simple yet effective anomaly detection method for multi-view data. The proposed approach detects anomalies by comparing the neighborhoods in different views. Specifically, clustering is performed separately in the different views and affinity vectors are derived for each object from the clustering results. Then, the anomalies are detected by comparing affinity vectors in the multiple views. An advantage of the proposed method over existing methods is that the tuning parameters can be determined effectively from the given data. Through experiments on synthetic and benchmark datasets, we show that the proposed method outperforms existing methods.
Discovering relations using matrix factorization methods BIBAFull-Text 1549-1552
  Ervina Cergani; Pauli Miettinen
Traditional relation extraction methods work on manually defined relations and typically expect manually labelled extraction patterns for each relation. This strongly limits the scalability of these systems. In Open Relation Extraction (ORE), the relations are identified automatically based on co-occurrences of "surface relations" (contexts) and entity pairs. The recently-proposed methods for ORE use partition clustering to find the relations. In this work we propose the use of matrix factorization methods instead of clustering. Specifically, we study Non-Negative Matrix Factorization (NMF) and Boolean Matrix Factorization (BMF). These methods overcome many problems inherent in clustering and perform better than the k-means clustering in our evaluation.
On exploiting content and citations together to compute similarity of scientific papers BIBAFull-Text 1553-1556
  Masoud Reyhani Hamedani; Sang-Wook Kim; Sang-Chul Lee; Dong-Jin Kim
In computing the similarity of scientific papers, previous text-based and link-based similarity measures look at only a single side of the content and citations. In this paper, we propose a novel approach called SimCC that effectively combines the content and citation information to accurately compute the similarity of scientific papers. Unlike previous approaches, SimCC effectively represents both authority and context of a scientific paper simultaneously in computing similarities. Also, we propose SimCC+A to consider recently-published papers. The effectiveness of our proposed method is demonstrated via extensive experiments on a real-world dataset of scientific papers, with more than 100% improvement in accuracy compared with previous methods.
Taxonomy-based regression model for cross-domain sentiment classification BIBAFull-Text 1557-1560
  Cong-Kai Lin; Yang-Yin Lee; Chi-Hsin Yu; Hsin-Hsi Chen
Most cross-domain sentiment classification techniques consider a domain as a whole set of instances for training. However, many online shopping websites organize their data in terms of taxonomy. This paper takes Amazon shopping website as an example, and proposes a tree-structured domain representation scheme in which each node in the tree is encoded as a bit sequence to preserve its relationship with all the other nodes in the tree. To select an appropriate source node for training in the domain taxonomy, we propose a Taxonomy-Based Regression Model (TBRM) which predicts the accuracy loss from multiple source nodes to a target node using the tree-structured domain representation combined with domain similarity and domain complexity. The source node with the smallest accuracy loss is used to train a classifier which makes a prediction on the target node. The results show that our TBRM achieves better performance than the regression models without considering the taxonomy information.
Reconciliation of categorical opinions from multiple sources BIBAFull-Text 1561-1564
  Adway Mitra; Srujana Merugu
Reconciling opinions from multiple sources on questions of interest to determine the correct answers is an important problem encountered in collaborative information systems such as Q & A forums and prediction markets. Our current work focuses on a widely applicable variant of the above problem where the opinions and answers are categorical-valued with the set of values possibly varying across questions. Most of the existing techniques are tailored only for binary opinions and cannot be effectively adapted for questions with categorical opinions. To address this, we propose a generic Bayesian framework for opinion reconciliation that can readily incorporate latent and observed attributes of sources and subjects. For the scenario of interest, we derive three specific model instantiations of the general approach (CTM, CTM-OSF, CTM-LSG), which respectively capture the latent source behavior, variations of source behavior across subject groups, and inter-source correlations. Empirical results on real-world datasets point to the relative superiority of the proposed models over existing baselines.
An unsupervised transfer learning approach to discover topics for online reputation management BIBAFull-Text 1565-1568
  Tamara Martín-Wanton; Julio Gonzalo; Enrique Amigó
Microblogs play an important role for Online Reputation Management. Companies and organizations in general have an increasing interest in obtaining the last minute information about which are the emerging topics that concern their reputation. In this paper, we present a new technique to cluster a collection of tweets emitted within a short time span about a specific entity. Our approach relies on transfer learning by contextualizing a target collection of tweets with a large set of unlabeled "background" tweets that help improving the clustering of the target collection. We include background tweets together with target tweets in a TwitterLDA process, and we set the total number of clusters. In practice, this means that the system can adapt to find the right number of clusters for the target data, overcoming one of the limitations of using LDA-based approaches (the need of establishing a priori the number of clusters). Our experiments using RepLab 2012 data show that using the background collection gives a 20% improvement over a direct application of TwitterLDA using only the target collection. Our data also confirms that the approach can effectively predict the right number of target clusters in a way that is robust with respect to the total number of clusters established a priori.
Discovering facts with boolean tensor tucker decomposition BIBAFull-Text 1569-1572
  Dora Erdos; Pauli Miettinen
Open Information Extraction (Open IE) has gained increasing research interest in recent years. The first step in Open IE is to extract raw subject -- predicate -- object triples from the data. These raw triples are rarely usable per se, and need additional post-processing. To that end, we proposed the use of Boolean Tucker tensor decomposition to simultaneously find the entity and relation synonyms and the facts connecting them from the raw triples. Our method represents the synonym sets and facts using (sparse) binary matrices and tensor that can be efficiently stored and manipulated. We consider the presentation of the problem as a Boolean tensor decomposition as one of this paper's main contributions. To study the validity of this approach, we use a recent algorithm for scalable Boolean Tucker decomposition. We validate the results with empirical evaluation on a new semi-synthetic data set, generated to faithfully reproduce real-world data features, as well as with real-world data from existing Open IE extractor. We show that our method obtains high precision while the low recall can easily be remedied by considering the original data together with the decomposition.
Intelligent SSD: a turbo for big data mining BIBAFull-Text 1573-1576
  Duck-Ho Bae; Jin-Hyung Kim; Sang-Wook Kim; Hyunok Oh; Chanik Park
This paper introduces the notion of intelligent SSDs. First, we present the design considerations of intelligent SSDs, and then examine their potential benefits under various settings in data mining applications.
Software plagiarism detection: a graph-based approach BIBAFull-Text 1577-1580
  Dong-Kyu Chae; Jiwoon Ha; Sang-Wook Kim; BooJoong Kang; Eul Gyu Im
As plagiarism of software increases rapidly, there are growing needs for software plagiarism detection systems. In this paper, we propose a software plagiarism detection system using an API-labeled control flow graph (A-CFG) that abstracts the functionalities of a program. The A-CFG can reflect both the sequence and the frequency of APIs, while previous work rarely considers both of them together. To perform a scalable comparison of a pair of A-CFGs, we use random walk with restart (RWR) that computes an importance score for each node in a graph. By the RWR, we can generate a single score vector for an A-CFG and can also compare A-CFGs by comparing their score vectors. Extensive evaluations on a set of Windows applications demonstrate the effectiveness and the scalability of our proposed system compared with existing methods.
Objectionable content filtering by click-through data BIBAFull-Text 1581-1584
  Lung-Hao Lee; Yen-Cheng Juan; Hsin-Hsi Chen; Yuen-Hsien Tseng
This paper explores users' browsing intents to predict the category of a user's next access during web surfing, and applies the results to objectionable content filtering. A user's access trail represented as a sequence of URLs reveals the contextual information of web browsing behaviors. We extract behavioral features of each clicked URL, i.e., hostname, bag-of-words, gTLD, IP, and port, to develop a linear chain CRF model for context-aware category prediction. Large-scale experiments show that our method achieves a promising accuracy of 0.9396 for objectionable access identification without requesting their corresponding page content. Error analysis indicates that our proposed model results in a low false positive rate of 0.0571. In real-life filtering simulations, our proposed model accomplishes macro-averaging blocking rate 0.9271, while maintaining a favorably low macro-averaging over-blocking rate 0.0575 for collaboratively filtering objectionable content with time change on the dynamic web.

Industry session

Computational advertising: the LinkedIn way BIBAFull-Text 1585-1586
  Deepak Agarwal
LinkedIn is the largest professional social network in the world with more than 238M members. It provides a platform for advertisers to reach out to professionals and target them using rich profile and behavioral data. Thus, online advertising is an important business for LinkedIn. In this talk, I will give an overview of machine learning and optimization components that power LinkedIn self-serve display advertising systems. The talk will not only focus on machine learning and optimization methods, but various practical challenges that arise when running such components in a real production environment. I will describe how we overcome some of these challenges to bridge the gap between theory and practice.
   The major components that will be described in details include Response prediction: The goal of this component is to estimate click-through rates (CTR) when an ad is shown to a user in a given context. Given the data sparseness due to low CTR for advertising applications in general and the curse of dimensionality, estimating such interactions is known to be a challenging. Furthermore, the goal of the system is to maximize expected revenue, hence this is an explore/exploit problem and not a supervised learning problem. Our approach takes recourse to supervised learning to reduce dimensionality and couples it with classical explore/exploit schemes to balance the explore/exploit tradeoff. In particular, we use a large scale logistic regression to estimate user and ad interactions. Such interactions are comprised of two additive terms a) stable interactions captured by using features for both users and ads whose coefficients change slowly over time, and b) ephemeral interactions that capture ad-specific residual idiosyncrasies that are missed by the stable component. Exploration is introduced via Thompson sampling on the ephemeral interactions (sample coefficients from the posterior distribution), since the stable part is estimated using large amounts of data and subject to very little statistical variance. Our model training pipeline estimates the stable part using a scatter and gather approach via the ADMM algorithm, ephemeral part is estimated more frequently by learning a per ad correction through an ad-specific logistic regression. Scoring thousands of ads at runtime under tight latency constraints is a formidable challenge when using such models, the talk will describe methods to scale such computations at runtime.
   Automatic Format Selection: The presentation of ads in a given slot on a page has a significant impact on how users interact with them. Web designers are adept at creating good formats to facilitate ad display but selecting the best among those automatically is a machine learning task. I will describe a machine learning approach we use to solve this problem. It is again an explore/exploit problem but the dimensionality of this problem is much less than the ad selection problem. I will also provide a detailed description of how we deal with issues like budget pacing, bid forecasting, supply forecasting and targeting. Throughout, the ML components will be illustrated with real examples from production and evaluation metrics would be reported from live tests. Offline metrics that can be useful in evaluating methods before launching them on live traffic will also be discussed.
Automatic ad format selection via contextual bandits BIBAFull-Text 1587-1594
  Liang Tang; Romer Rosales; Ajit Singh; Deepak Agarwal
Visual design plays an important role in online display advertising: changing the layout of an online ad can increase or decrease its effectiveness, measured in terms of click-through rate (CTR) or total revenue. The decision of which layout to use for an ad involves a trade-off: using a layout provides feedback about its effectiveness (exploration), but collecting that feedback requires sacrificing the immediate reward of using a layout we already know is effective (exploitation). To balance exploration with exploitation, we pose automatic layout selection as a contextual bandit problem. There are many bandit algorithms, each generating a policy which must be evaluated. It is impractical to test each policy on live traffic. However, we have found that offline replay (a.k.a. exploration scavenging) can be adapted to provide an accurate estimator for the performance of ad layout policies at LinkedIn, using only historical data about the effectiveness of layouts. We describe the development of our offline replayer, and benchmark a number of common bandit algorithms.

DB track: graphs and storage systems

Graph similarity search with edit distance constraint in large graph databases BIBAFull-Text 1595-1600
  Weiguo Zheng; Lei Zou; Xiang Lian; Dong Wang; Dongyan Zhao
Due to many real applications of graph databases, it has become increasingly important to retrieve graphs g (in graph database D) that approximately match with query graph q, rather than exact subgraph matches. In this paper, we study the problem of graph similarity search, which retrieves graphs that are similar to a given query graph under the constraint of the minimum edit distance. Specifically, we derive a lower bound, branch-based bound, which can greatly reduce the search space of the graph similarity search. We also propose a tree index structure, namely b-tree, to facilitate effective pruning and efficient query processing. Extensive experiments confirm that our proposed approach outperforms the existing approaches by orders of magnitude, in terms of both pruning power and query response time.
Fast and scalable reachability queries on graphs by pruned labeling with landmarks and paths BIBAFull-Text 1601-1606
  Yosuke Yano; Takuya Akiba; Yoichi Iwata; Yuichi Yoshida
Answering reachability queries on directed graphs is ubiquitous in many applications involved with graph-shaped data as one of the most fundamental and important operations. However, it is still highly challenging to efficiently process them on large-scale graphs. Transitive-closure-based methods consume prohibitively large index space, and online-search-based methods answer queries too slowly. Labeling-based methods attain both small index size and query time, but previous indexing algorithms are not scalable at all for processing large graphs of the day. In this paper, we propose new labeling-based methods for reachability queries, referred to as pruned landmark labeling and pruned path labeling. They follow the frameworks of 2-hop cover and 3-hop cover, but their indexing algorithms are based on the recent notion of pruned labeling and improve the indexing time by several orders of magnitude, resulting in applicability to large graphs with tens of millions of vertices and edges. Our experimental results show that they attain remarkable trade-offs between fast query time, small index size and scalability, which previous methods have never been able to achieve. Furthermore, we also discuss the ingredients of the efficiency of our methods by a novel theoretical analysis based on the graph minor theory.
Graph hashing and factorization for fast graph stream classification BIBAFull-Text 1607-1612
  Ting Guo; Lianhua Chi; Xingquan Zhu
Graph stream classification concerns building learning models from continuously growing graph data, in which an essential step is to explore subgraph features to represent graphs for effective learning and classification. When representing a graph using subgraph features, all existing methods employ coarse-grained feature representation, which only considers whether or not a subgraph feature appears in the graph. In this paper, we propose a fine-grained graph factorization approach for Fast Graph Stream Classification (FGSC). Our main idea is to find a set of cliques as feature base to represent each graph as a linear combination of the base cliques. To achieve this goal, we decompose each graph into a number of cliques and select discriminative cliques to generate a transfer matrix called Clique Set Matrix (M). By using M as the base for formulating graph factorization, each graph is represented in a vector space with each element denoting the degree of the corresponding subgraph feature related to the graph, so existing supervised learning algorithms can be applied to derive learning models for graph classification.
Efficiently anonymizing social networks with reachability preservation BIBAFull-Text 1613-1618
  Xiangyu Liu; Bin Wang; Xiaochun Yang
The goal of graph anonymization is avoiding disclosure of privacy in social networks through graph modifications meanwhile preserving data utility of the anonymized graph for social network analysis. Graph reachability is an important data utility as reachability queries are not only common on graph databases, but also serving as fundamental operations for many other graph queries. However, the graph reachability is severely distorted after the anonymization. In this paper, we solve this problem by designing a reachability preserving anonymization (RPA for short) algorithm. The main idea of RPA is to organize vertices into groups and greedily anonymizes each vertex with low anonymization cost on reachability. We propose the reachable interval to efficiently measure the anonymization cost incurred by an edge addition, which guarantees the high efficiency of RPA. Extensive experiments illustrate that anonymized social networks generated by our methods preserve high utility on reachability.
ImG-complex: graph data model for topology of unstructured meshes BIBAFull-Text 1619-1624
  Alireza Rezaei Mahdiraji; Peter Baumann; Guntram Berti
Although, many applications use unstructured meshes, there is no specialized mesh database which supports storing and querying mesh data. Existing mesh libraries do not support declarative querying and are expensive to maintain. A mesh database can benefit the domains in several ways such as: declarative query language, ease of maintenance, etc. In this paper, we propose the Incidence multi-Graph Complex (ImG-Complex) data model for storing topological aspects of meshes in a database. ImG-Complex extends incidence graph (IG) model with multi-incidence information to represent a new object class which we call ImG-Complexes. We introduce optional and application-specific constraints to limit the ImG model to smaller object classes and validate mesh structures based on the modeled object class properties. We show how Neo4j graph database can be used to query mesh topology based on the (possibly constrained) ImG model. Finally, we experiment Neo4j and PostgreSQL performance on executing topological mesh queries.
ROU: advanced keyword search on graph BIBAFull-Text 1625-1630
  Yifan Pan; Yuqing Wu
Keyword search, the major means for Internet search engines, has recently been explored in structured and semi-structured data. What is yet to be explored thoroughly is how optional and negative keywords can be expressed, what the results should be and how such search queries can be evaluated efficiently. In this paper, we formally define a new type of keyword search query, ROU-query, which takes as input keywords in three categories: required, optional and unwanted, and returns as output sets of nodes in the data graph whose neighborhood satisfies the keyword requirements. We define multiple semantics, including maximal coverage and minimal footprint, to ensure the meaningfulness of results. We propose query induced partite graph (QuIP), that can capture the constraints on neighborhood size and unwanted keywords, and propose a family of algorithms for evaluation of ROU-queries. We conducted extensive experimental evaluations to show our approaches are able to generate results for ROU-queries efficiently.
Hotness-aware buffer management for flash-based hybrid storage systems BIBAFull-Text 1631-1636
  Yanfei Lv; Bin Cui; Xuexuan Chen; Jing Li
Flash solid-state drives (SSDs) provide much faster access to data compared with traditional hard disk drives (HDDs). The current price and performance of SSD suggest it can be adopted as a data buffer between main memory and HDD, and buffer management policy in such hybrid systems has attracted more and more interest from research community recently. In this paper, we propose a novel approach to manage the buffer in flash-based hybrid storage systems, named Hotness Aware Hit (HAT). HAT exploits a page reference queue to record the access history as well as the status of accessed pages, i.e., hot, warm and cold. Additionally, the page reference queue is further split into hot and warm regions which correspond to the memory and flash in general. The HAT approach updates the page status and deals with the page migration in the memory hierarchy according to the current page status and hit position in the page reference queue. Our empirical evaluation on benchmark traces demonstrates the superiority of the proposed strategy against the state-of-the-art competitors.
Expedited rating of data stores using agile data loading techniques BIBAFull-Text 1637-1642
  Sumita Barahmand; Shahram Ghandeharizadeh
To benchmark and rate a data store, one must repeat experiments that impose a different amount of load on the data store. Workloads that modify the benchmark database may require the same database to be loaded repeatedly. This may constitute a significant portion of the time to rate a data store. This paper presents several agile data loading techniques to expedite the rating process. These techniques include generating the disk image of the database once and re-using it, restoring the updated data items to their original value, maintaining in-memory state of the database across different experiments to avoid repeated loading of the database all together, and a hybrid of the third technique in combination with the other two. These techniques are general purpose and apply to a variety of cloud benchmarks. We investigate their implementation and evaluation in the context of one, the BG benchmark. Obtained results show a factor of two to twelve speedup in the rating process. As an example, when evaluating MongoDB with a million member BG database, we show these techniques expedite BG's rating from 4 months (123 days) of continuous running to less than 11 days for the first rating experiment. Subsequent ratings of MongoDB with different workloads using the same database is much faster, in the order of hours.

KM track: social networks and media

Social recommendation incorporating topic mining and social trust analysis BIBAFull-Text 1643-1648
  Tong Zhao; Chunping Li; Mengya Li; Qiang Ding; Li Li
We study the problem of social recommendation incorporating topic mining and social trust analysis. Different from other works related to social recommendation, we merge topic mining and social trust analysis techniques into recommender systems for finding topics from the tags of the items and estimating the topic-specific social trust. We propose a probabilistic matrix factorization (TTMF) algorithm and try to enhance the recommendation accuracy by utilizing the estimated topic-specific social trust relations. Moreover, TTMF is also convenient to solve the item cold start problem by inferring the feature (topic) of new items from their tags. Experiments are conducted on three different data sets. The results validate the effectiveness of our method for improving recommendation performance and its applicability to solve the cold start problem.
Originator or propagator?: incorporating social role theory into topic models for Twitter content analysis BIBAFull-Text 1649-1654
  Xin Wayne Zhao; Jinpeng Wang; Yulan He; Jian-Yun Nie; Xiaoming Li
A large number of studies have been devoted to modeling the contents and interactions between users on Twitter. In this paper, we propose a method inspired from Social Role Theory (SRT), which assumes that a user behaves differently with different roles in the generation process of Twitter content. We consider the two most distinctive social roles on Twitter: originator and propagator, who respectively posts original messages and retweets or forwards the messages from others. In addition, we also consider role-specific social interactions, especially implicit interactions between users who share some common interests. All the above elements are integrated into a novel regularized topic model. We evaluate the proposed method on real Twitter data. The results show that our method is more effective than the existing ones which do not distinguish social roles.
An effective latent networks fusion based model for event recommendation in offline ephemeral social networks BIBAFull-Text 1655-1660
  Guoqiong Liao; Yuchen Zhao; Sihong Xie; Philip S. Yu
Offline ephemeral social networks (OffESNs) are the networks created ad-hoc at a specific location for a specific purpose and lasting for short period of time, relying on mobile social media such as Radio Frequency Identification (RFID) and Bluetooth devices. The primary purpose of people in the OffESNs is to acquire and share information via attending prescheduled events. Event Recommendation over this kind of networks can facilitate attendees on selecting the prescheduled events and organizers on making resource planning. However, because of lack of users' preference and rating information, as well as explicit social relations, the existing recommendation methods can no longer work well to recommend the events in the OffESNs. To address the challenges such as how to derive latent preferences and social relations and how to fuse the latent information in a unified model, we first construct two heterogeneous interaction social networks, an event participation network and a physical proximity network. Then, we use them to derive users' latent preferences and latent networks on social relations, including like-minded peers, co-attendees and friends. Finally, we propose an LNF (Latent Networks Fusion) model under a pairwise factor graph to infer event attendance probabilities for recommendation. Experiments on an RFID-based real conference dataset have demonstrated the effectiveness of the proposed model compared with typical solutions.
Predicting trends in social networks via dynamic activeness model BIBAFull-Text 1661-1666
  Shuyang Lin; Xiangnan Kong; Philip S. Yu
With the effect of word-of-the-mouth, trends in social networks are now playing a significant role in shaping people's lives. Predicting dynamic trends is an important problem with many useful applications. There are three dynamic characteristics of a trend that should be captured by a trend model: intensity, coverage and duration. However, existing approaches on the information diffusion are not capable of capturing these three characteristics. In this paper, we study the problem of predicting dynamic trends in social networks. We first define related concepts to quantify the dynamic characteristics of trends in social networks, and formalize the problem of trend prediction. We then propose a Dynamic Activeness (DA) model based on the novel concept of activeness, and design a trend prediction algorithm using the DA model. We examine the prediction algorithm on the DBLP network, and show that it is more accurate than state-of-the-art approaches.
Dyadic event attribution in social networks with mixtures of Hawkes processes BIBAFull-Text 1667-1672
  Liangda Li; Hongyuan Zha
In many applications in social network analysis, it is important to model the interactions and infer the influence between pairs of actors, leading to the problem of dyadic event modeling which has attracted increasing interests recently. In this paper we focus on the problem of dyadic event attribution, an important missing data problem in dyadic event modeling where one needs to infer the missing actor-pairs of a subset of dyadic events based on their observed timestamps. Existing works either use fixed model parameters and heuristic rules for event attribution, or assume the dyadic events across actor-pairs are independent. To address those shortcomings we propose a probabilistic model based on mixtures of Hawkes processes that simultaneously tackles event attribution and network parameter inference, taking into consideration the dependency among dyadic events that share at least one actor. We also investigate using additive models to incorporate regularization to avoid overfitting. Our experiments on both synthetic and real-world data sets on international armed conflicts suggest that the proposed new method is capable of significantly improve accuracy when compared with the state-of-the-art for dyadic event attribution.
Modeling temporal effects of human mobile behavior on location-based social networks BIBAFull-Text 1673-1678
  Huiji Gao; Jiliang Tang; Xia Hu; Huan Liu
The rapid growth of location-based social networks (LBSNs) invigorates an increasing number of LBSN users, providing an unprecedented opportunity to study human mobile behavior from spatial, temporal, and social aspects. Among these aspects, temporal effects offer an essential contextual cue for inferring a user's movement. Strong temporal cyclic patterns have been observed in user movement in LBSNs with their correlated spatial and social effects (i.e., temporal correlations). It is a propitious time to model these temporal effects (patterns and correlations) on a user's mobile behavior. In this paper, we present the first comprehensive study of temporal effects on LBSNs. We propose a general framework to exploit and model temporal cyclic patterns and their relationships with spatial and social data. The experimental results on two real-world LBSN datasets validate the power of temporal effects in capturing user mobile behavior, and demonstrate the ability of our framework to select the most effective location prediction algorithm under various combinations of prediction models.
Social media news communities: gatekeeping, coverage, and statement bias BIBAFull-Text 1679-1684
  Diego Saez-Trumper; Carlos Castillo; Mounia Lalmas
We examine biases in online news sources and social media communities around them. To that end, we introduce unsupervised methods considering three types of biases: selection or "gatekeeping" bias, coverage bias, and statement bias, characterizing each one through a series of metrics. Our results, obtained by analyzing 80 international news sources during a two-week period, show that biases are subtle but observable, and follow geographical boundaries more closely than political ones. We also demonstrate how these biases are to some extent amplified by social media.
Discovering health-related knowledge in social media using ensembles of heterogeneous features BIBAFull-Text 1685-1690
  Suppawong Tuarob; Conrad S. Tucker; Marcel Salathe; Nilam Ram
Social media is emerging as a powerful source of communication, information dissemination and mining. Being colloquial and ubiquitous in nature makes it easier for users to express their opinions and preferences in a seamless, dynamic manner. Epidemic surveillance systems that utilize social media to detect the emergence of diseases have been proposed in the literature. These systems mostly employ traditional document classification techniques that represent a document with a bag of N-grams. However, such techniques are not optimal for social media where sparsity and noise are norms. The authors address the limitations posed by the traditional N-gram based methods and propose to use features that represent different semantic aspects of the data in combination with ensemble machine learning techniques to identify health-related messages in a heterogenous pool of social media data. Furthermore, the results reveal significant improvement in identifying health related social media content which can be critical in the emergence of a novel, unknown disease epidemic.
Seeking provenance of information using social media BIBAFull-Text 1691-1696
  Pritam Gundecha; Zhuo Feng; Huan Liu
Social media propagates breaking news and disinformation alike fast and on an unsurpassed scale. Because of its democratizing nature, social media users can easily produce, receive, and propagate a piece of information without necessarily providing traceable information. Thus, there are no means for a user to verify the provenance (aka sources or originators) of information. The disinformation can cause tragic consequences to society and individuals. This work aims to take advantage of characteristics of social media to provide a solution to the problem of lacking traceable information. Such knowledge can provide additional context to received information such that a user can assess how much value, trust, and validity should be placed in it. In this paper, we are studying a novel research problem that facilitates the seeking of the provenance of information for a few known recipients (less than 1% of the total recipients) by recovering the paths it has taken from its originators. The proposed methodology exploits easily computable node centralities of a large social media network. The experimental results with Facebook and Twitter datasets show that the proposed mechanism is effective in correctly identifying the additional recipients and seeking the provenance of information.

KM track: text

Compact explanatory opinion summarization BIBAFull-Text 1697-1702
  Hyun Duk Kim; Malu Castellanos; Meichun Hsu; ChengXiang Zhai; Umeshwar Dayal; Riddhiman Ghosh
In this paper, we propose a novel opinion summarization problem called compact explanatory opinion summarization (CEOS) which aims to extract within-sentence explanatory text segments from input opinionated texts to help users better understand the detailed reasons of sentiments. We propose and study general methods for identifying candidate boundaries and scoring the explanatoriness of text segments using Hidden Markov Models. We create new data sets and use a new evaluation measure to evaluate CEOS. Experimental results show that the proposed methods are effective for generating an explanatory opinion summary, outperforming a standard text summarization method.
Towards an enhanced and adaptable ontology by distilling and assembling online encyclopedias BIBAFull-Text 1703-1708
  Shan Jiang; Lidong Bing; Yan Zhang
In this paper, we investigate the problem of making better use of semantic knowledge obtained from different encyclopedia sources. We propose a framework to integrate different encyclopedias and reorganize the information. We also utilize Learning to Rank models to distill out more functional knowledge from the encyclopedic information and then align the knowledge with a WordNet-like ontology. Finally as a demonstration, a Chinese semantic knowledge repository named JNet is constructed based on this framework. Experiments show that the proposed methods work well and the three steps reinforce each other towards a more powerful ontology.
Assessing sparse information extraction using semantic contexts BIBAFull-Text 1709-1714
  Peipei Li; Haixun Wang; Hongsong Li; Xindong Wu
One important assumption of information extraction is that extractions occurring more frequently are more likely to be correct. Sparse information extraction is challenging because no matter how big a corpus is, there are extractions supported by only a small amount of evidence in the corpus. A pioneering work known as REALM learns HMMs to model the context of a semantic relationship for assessing the extractions. This is quite costly and the semantics revealed for the context are not explicit. In this work, we introduce a lightweight, explicit semantic approach for sparse information extraction. We use a large semantic network consisting of millions of concepts, entities, and attributes to explicitly model the context of semantic relationships. Experiments show that our approach improves the F-score of extraction by at least 11.2% over state-of-the-art, HMM based approaches while maintaining more efficiency.
Studying from electronic textbooks BIBAFull-Text 1715-1720
  Rakesh Agrawal; Sreenivas Gollapudi; Anitha Kannan; Krishnaram Kenthapadi
We present study navigator, an algorithmically-generated aid for enhancing the experience of studying from electronic textbooks. The study navigator for a section of the book consists of helpful concept references for understanding this section. Each concept reference is a pair consisting of a concept phrase explained elsewhere and the link to the section in which it has been explained. We propose a novel reader model for textbooks and an algorithm for generating the study navigator based on this model. We also present the results of an extensive user study that demonstrates the efficacy of the proposed system across textbooks on different subjects from different grades.
Generating informative snippet to maximize item visibility BIBAFull-Text 1721-1726
  Mahashweta Das; Habibur Rahman; Gautam Das; Vagelis Hristidis
The widespread use and growing popularity of online collaborative content sites has created rich resources for users to consult in order to make purchasing decisions on various items such as e-commerce products, restaurants, etc. Ideally, a user wants to quickly decide whether an item is desirable, from the list of items returned as a result of her search query. This has created new challenges for producers/manufacturers (e.g., Dell) or retailers (e.g., Amazon, eBay) of such items to compose succinct summarizations of web item descriptions, henceforth referred to as snippets, that are likely to maximize the items' visibility among users. We exploit the availability of user feedback in collaborative content sites in the form of tags to identify the most important item attributes that must be highlighted in an item snippet. We investigate the problem of finding the top-k best snippets for an item that are likely to maximize the probability that the user preference (available in the form of search query) is satisfied. Since a search query returns multiple relevant items, we also study the problem of finding the best diverse set of snippets for the items in order to maximize the probability of a user liking at least one of the top items. We develop an exact top-k algorithm for each of the problem and perform detailed experiments on synthetic and real data crawled from the web to demonstrate the utility of our problems and effectiveness of our solutions.
Assessing quality score of Wikipedia article using mutual evaluation of editors and texts BIBAFull-Text 1727-1732
  Yu Suzuki; Masatoshi Yoshikawa
In this paper, we propose a method for assessing quality scores of Wikipedia articles by mutually evaluating editors and texts. Survival ratio based approach is a major approach to assessing article quality. In this approach, when a text survives beyond multiple edits, the text is assessed as good quality, because poor quality texts have a high probability of being deleted by editors. However, many vandals, low quality editors, delete good quality texts frequently, which improperly decreases the survival ratios of good quality texts. As a result, many good quality texts are unfairly assessed as poor quality. In our method, we consider editor quality score for calculating text quality score, and decrease the impact on text quality by vandals. Using this improvement, the accuracy of the text quality score should be improved. However, an inherent problem with this idea is that the editor quality scores are calculated by the text quality scores. To solve this problem, we mutually calculate the editor and text quality scores until they converge. In this paper, we prove that the text quality score converges. We did our experimental evaluation, and confirmed that our proposed method could accurately assess the text quality scores.
Concept-based analysis of scientific literature BIBAFull-Text 1733-1738
  Chen-Tse Tsai; Gourab Kundu; Dan Roth
This paper studies the importance of identifying and categorizing scientific concepts as a way to achieve a deeper understanding of the research literature of a scientific community. To reach this goal, we propose an unsupervised bootstrapping algorithm for identifying and categorizing mentions of concepts. We then propose a new clustering algorithm that uses citations' context as a way to cluster the extracted mentions into coherent concepts. Our evaluation of the algorithms against gold standards shows significant improvement over state-of-the-art results. More importantly, we analyze the computational linguistic literature using the proposed algorithms and show four different ways to summarize and understand the research community which are difficult to obtain using existing techniques.
On sampling the wisdom of crowds: random vs. expert sampling of the Twitter stream BIBAFull-Text 1739-1744
  Saptarshi Ghosh; Muhammad Bilal Zafar; Parantapa Bhattacharya; Naveen Sharma; Niloy Ganguly; Krishna Gummadi
Several applications today rely upon content streams crowd-sourced from online social networks. Since real-time processing of large amounts of data generated on these sites is difficult, analytics companies and researchers are increasingly resorting to sampling. In this paper, we investigate the crucial question of how to sample the data generated by users in social networks. The traditional method is to randomly sample all the data. We analyze a different sampling methodology, where content is gathered only from a relatively small subset (< 1%) of the user population namely, the expert users. Over the duration of a month, we gathered tweets from over 500,000 Twitter users who are identified as experts on a diverse set of topics, and compared the resulting expert-sampled tweets with the 1% randomly sampled tweets provided publicly by Twitter. We compared the sampled datasets along several dimensions, including the diversity, timeliness, and trustworthiness of the information contained within them, and find important differences between the datasets. Our observations have major implications for applications such as topical search, trustworthy content recommendations, and breaking news detection.
Can back-of-the-book indexes be automatically created? BIBAFull-Text 1745-1750
  Zhaohui Wu; Zhenhui Li; Prasenjit Mitra; C. Lee Giles
Automatic creation of back-of-the-book indexes remains one of the few manual tasks related to publishing. Inspired by how human indexers work on back-of-the-book indexes creation, we present a new domain-independent, corpus-free and training-free automation approach. Given a book, the index terms will be sequentially selected according to an indexability score encoded by the structure information residing in a book as well as a novel context-aware term informativeness measurement utilizing the power of the web knowledge base such as Wikipedia. By extensive experiments on books from various domains, we show our approach to be a more effective and practical than ones that used previous keyword extraction and supervised learning.

IR track

Directing exploratory search with interactive intent modeling BIBAFull-Text 1759-1764
  Tuukka Ruotsalo; Jaakko Peltonen; Manuel Eugster; Dorota Glowacka; Ksenia Konyushkova; Kumaripaba Athukorala; Ilkka Kosunen; Aki Reijonen; Petri Myllymäki; Giulio Jacucci; Samuel Kaski
We introduce interactive intent modeling, where the user directs exploratory search by providing feedback for estimates of search intents. The estimated intents are visualized for interaction on an Intent Radar, a novel visual interface that organizes intents onto a radial layout where relevant intents are close to the center of the visualization and similar intents have similar angles. The user can give feedback on the visualized intents, from which the system learns and visualizes improved intent estimates. We systematically evaluated the effect of the interactive intent modeling in a mixed-method task-based information seeking setting with 30 users, where we compared two interface variants for interactive intent modeling, namely intent radar and a simpler list-based interface, to a conventional search system. The results show that interactive intent modeling significantly improves users' task performance and the quality of retrieved information.
FRec: a novel framework of recommending users and communities in social media BIBAFull-Text 1765-1770
  Lei Li; Wei Peng; Saurabh Kataria; Tong Sun; Tao Li
In this paper, we propose a framework of recommending users and communities in social media. Given a user's profile, our framework is capable of recommending influential users and topic-cohesive interactive communities that are most relevant to the given user. In our framework, we present a generative topic model to discover user-oriented and community-oriented topics simultaneously, which enables us to capture the exact topic interests of users, as well as the focuses of communities. Extensive evaluation on a data set obtained from Twitter has demonstrated the effectiveness of our proposed framework compared with other probabilistic topic model based recommendation methods.
Permutation indexing: fast approximate retrieval from large corpora BIBAFull-Text 1771-1776
  Maxim Gurevich; Tamás Sarlós
Inverted indexing is a ubiquitous technique used in retrieval systems including web search. Despite its popularity, it has a drawback -- query retrieval time is highly variable and grows with the corpus size. In this work we propose an alternative technique, permutation indexing, where retrieval cost is strictly bounded and has only logarithmic dependence on the corpus size. Our approach is based on two novel techniques: (a) partitioning of the term space into overlapping clusters of terms that frequently co-occur in queries, and (b) a data structure for compactly encoding results of all queries composed of terms in a cluster as continuous sequences of document ids. Then, query results are retrieved by fetching few small chunks of these sequences. There is a price though: our encoding is lossy and thus returns approximate result sets. The fraction of the true results returned, recall, is controlled by the level of redundancy. The more space is allocated for the permutation index the higher is the recall. We analyze permutation indexing both theoretically under simplified document and query models, and empirically on a realistic document and query collections. We show that although permutation indexing can not replace traditional retrieval methods, since high recall cannot be guaranteed on all queries, it covers up to 77% of tail queries and can be used to speed up retrieval for these queries.
Clustering-based transduction for learning a ranking model with limited human labels BIBAFull-Text 1777-1782
  Xin Zhang; Ben He; Tiejian Luo; Dongxing Li; Jungang Xu
Transductive learning is a semi-supervised learning paradigm that can leverage unlabeled data by creating pseudo labels for learning a ranking model, when there is only limited or no training examples available. However, the effectiveness of transductive learning in information retrieval (IR) can be hindered by the low quality pseudo labels. To this end, we propose to incorporate a two-step k-means clustering algorithm to select the high quality training queries for generating the pseudo labels. In particular, the first step selects the high-quality queries for which the relevant documents are highly coherent as indicated by the clustering results. The second step then selects the initial training examples for the transductive learning that iteratively aggregating the pseudo examples. Finally, the learning to rank (LTR) algorithms are applied to learn the ranking model using the pseudo training examples created by the transductive learning process. Our proposed approach is particularly suitable for applications where there is only little or no human labels available as it does not necessarily involve the use of relevance assessments information or human efforts. Experimental results on the standard TREC Tweets11 collection show that our proposed approach outperforms strong baselines, namely the conventional applications of learning to rank algorithms using human labels for the training and transductive learning using all the queries available.
Exploiting ranking factorization machines for microblog retrieval BIBAFull-Text 1783-1788
  Runwei Qiang; Feng Liang; Jianwu Yang
Learning to rank method has been proposed for practical application in the field of information retrieval. When employing it in microblog retrieval, the significant interactions of various involved features are rarely considered. In this paper, we propose a Ranking Factorization Machine (Ranking FM) model, which applies Factorization Machine model to microblog ranking on basis of pairwise classification. In this way, our proposed model combines the generality of learning to rank framework with the advantages of factorization models in estimating interactions between features, leading to better retrieval performance. Moreover, three groups of features (content relevance features, semantic expansion features and quality features) and their interactions are utilized in the Ranking FM model with the methods of stochastic gradient descent and adaptive regularization for optimization. Experimental results demonstrate its superiority over several baseline systems on a real Twitter dataset in terms of P@30 and MAP metrics. Furthermore, it outperforms the best performing results in the TREC'12 Real-Time Search Task.
Learning compact hashing codes for efficient tag completion and prediction BIBAFull-Text 1789-1794
  Qifan Wang; Lingyun Ruan; Zhiwei Zhang; Luo Si
Tags have been popularly utilized in many applications with image and text data for better managing, organizing and searching for useful information. Tag completion provides missing tag information for a set of existing images or text documents while tag prediction recommends tag information for any new image or text document. Valuable prior research has focused on improving the accuracy of tag completion and prediction, but limited research has been conducted for the efficiency issue in tag completion and prediction, which is a critical problem in many large scale real world applications.
   This paper proposes a novel efficient Hashing approach for Tag Completion and Prediction (HashTCP). In particular, we construct compact hashing codes for both data examples and tags such that the observed tags are consistent with the constructed hashing codes and the similarities between data examples are also preserved. We then formulate the problem of learning binary hashing codes as a discrete optimization problem. An efficient coordinate descent method is developed as the optimization procedure for the relaxation problem. A novel binarization method based on orthogonal transformation is proposed to obtain the binary codes from the relaxed solution. Experimental results on four datasets demonstrate that the proposed approach can achieve similar or even better accuracy with state-of-the-art methods and can be much more efficient, which is important for large scale applications.
How do users grow up along with search engines?: a study of long-term users' behavior BIBAFull-Text 1795-1800
  Jian Liu; Yiqun Liu; Min Zhang; Shaoping Ma
With a stronger reliance on search engines in our daily life, a large number of studies have investigated user behavior characteristics in Web search. However, previous studies mainly focus on large-scale query log data and analyze temporal changes based on all users without differentiating different user groups; few have really traced a fixed and long-term group of users and have distinguished the behavior of long-term users from ordinary users to analyze long-term temporal changes unbiasedly. In this paper we look into the interaction logs of these two user groups to analyze differences between these two user groups and to better understand how users grow up along with Web search engines. Statistical and experimental results show that there exist temporal changes of both user groups. There are also significant differences between these two user groups in the frequency of interaction, complexity of search tasks, and query formulation conventions. The findings have implications for how Web search engines should better support users' information seeking process by tackling complex search tasks and complicated query formulations.
LR-PPR: locality-sensitive, re-use promoting, approximate personalized pagerank computation BIBAFull-Text 1801-1806
  Jung Hyun Kim; K. Selçuk Candan; Maria Luisa Sapino
Personalized PageRank (PPR) based measures of node proximity have been shown to be highly effective in many prediction and recommendation applications. The use of personalized PageRank for large graphs, however, is difficult due to its high computation cost. In this paper, we propose a Locality-sensitive, Re-use promoting, approximate personalized PageRank (LR-PPR) algorithm for efficiently computing the PPR values relying on the localities of the given seed nodes on the graph: (a) The LR-PPR algorithm is locality sensitive in the sense that it reduces the computational cost of the PPR computation process by focusing on the local neighborhoods of the seed nodes. (b) LR-PPR is re-use promoting in that instead of performing a monolithic computation for the given seed node set using the entire graph, LR-PPR divides the work into localities of the seeds and caches the intermediary results obtained during the computation. These cached results are then reused for future queries sharing seed nodes. Experiment results for different data sets and under different scenarios show that LR-PPR algorithm is highly-efficient and accurate.
Multimedia summarization for trending topics in microblogs BIBAFull-Text 1807-1812
  Jingwen Bian; Yang Yang; Tat-Seng Chua
Microblogging services have revolutionized the way people exchange information. Confronted with the ever-increasing numbers of microblogs with multimedia contents and trending topics, it is desirable to provide visualized summarization to help users to quickly grasp the essence of topics. While existing works mostly focus on text-based methods only, summarization of multiple media types (e.g., text and image) are scarcely explored. In this paper, we propose a multimedia microblog summarization framework to automatically generate visualized summaries for trending topics. Specifically, a novel generative probabilistic model, termed multimodal-LDA (MMLDA), is proposed to discover subtopics from microblogs by exploring the correlations among different media types. Based on the information achieved from MMLDA, a multimedia summarizer is designed to separately identify representative textual and visual samples and then form a comprehensive visualized summary. We conduct extensive experiments on a real-world Sina Weibo microblog dataset to demonstrate the superiority of our proposed method against the state-of-the-art approaches.

Poster session: IR track

Semi-supervised discriminative preference elicitation for cold-start recommendation BIBAFull-Text 1813-1816
  Xi Zhang; Jian Cheng; Ting Yuan; Biao Niu; Hanqing Lu
Recommendation for cold users is fairly challenging because no prior rating can be used in preference prediction. To tackle this cold-start scenario, rating elicitation is usually employed through an initial interview in which users are queried by some carefully selected items. In this paper, we propose a novel framework to mine the most valuable items to construct query set using a semi-supervised discriminative selection (SSDS) model. To learn a low dimensional representation for users in item space which can reflect their tastes to a large extent, the model incorporates category labels as discriminative information. To ensure the used labels reliable as well as all users considered, the model utilizes a semi-supervised scheme leveraging expert guidance with graph regularization. Experimental results on real-world dataset MovieLens demonstrate that the proposed SSDS model outperforms traditional preference elicitation methods on top-N measures for cold-start recommendation.
Exploiting query term correlation for list caching in web search engines BIBAFull-Text 1817-1820
  Jiancong Tong; Gang Wang; Douglas S. Stones; Shizhao Sun; Xiaoguang Liu; Fan Zhang
Caching technologies have been widely employed to boost the performance of Web search engines. Motivated by the correlation between terms in query logs from a commercial search engine, we explore the idea of a caching scheme based on pairs of terms, rather than individual terms (which is the typical approach used by search engines today). We propose an inverted list caching policy, based on the Least Recently Used method, in which the co-occurring correlation between terms in the query stream is accounted for when deciding on which terms to keep in the cache. We consider not only the term co-occurrence within the same query but also the co-occurrence between separate queries. Experimental results show that the proposed approach can improve not only the cache hit ratio but also the overall throughput of the system when compared to existing list caching algorithms.
Speller performance prediction for query autocorrection BIBAFull-Text 1821-1824
  Alexey Baytin; Irina Galinskaya; Marina Panina; Pavel Serdyukov
Query speller is an indispensable part of any modern search engine. In this paper we define the problem of speller performance prediction and apply it to the task of query spelling autocorrection. As candidates for query autocorrection we used the suggestions generated by a query speller. To determine their reliability we used a binary classifier trained on manually labeled data. In addition to the basic standard lexical and statistical features we utilized a number of new click-based features, what allowed to significantly outperform the algorithm trained on the basic set of features.
Predicting the impact of expansion terms using semantic and user interaction features BIBAFull-Text 1825-1828
  Anton Bakhtin; Yury Ustinovskiy; Pavel Serdyukov
Query expansion for Information Retrieval is a challenging task. On the one hand, low quality expansion may hurt either recall, due to vocabulary mismatch, or precision, due to topic drift, and therefore reduce user satisfaction. On the other hand, utilizing a large number of expansion terms for a query may easily lead to resource consumption overhead. As web search engines apply strict constraints on response time, it is essential to estimate the impact of each expansion term on query performance at the pre-retrieval time. Our experimental results confirm that a significant part of expansions do not improve query performance, and it is possible to detect such expansions at the pre-retrieval time.
QBEES: query by entity examples BIBAFull-Text 1829-1832
  Steffen Metzger; Ralf Schenkel; Marcin Sydow
Structured knowledge bases are an increasingly important way for storing and retrieving information. Within such knowledge bases, an important search task is finding similar entities based on one or more example entities. We present QBEES, a novel framework for defining entity similarity based only on structural features, so-called aspects, of the entities, that includes query-dependent and query-independent entity ranking components. We present evaluation results with a number of existing entity list completion benchmarks, comparing to several state-of-the-art baselines.
Learning to selectively rank patients' medical history BIBAFull-Text 1833-1836
  Nut Limsopatham; Craig Macdonald; Iadh Ounis
Two main approaches have emerged in the literature for the effective deployment of a search system to rank patients having a medical history relevant to a query. The first approach is to directly rank patients based on the relevance of their medical history, represented as a concatenation of their associated medical records. Instead, the second approach initially retrieves the relevant medical records of patients, and then ranks the patients based on the relevance of their retrieved medical records. However, these two approaches may be useful for different queries. In this work, we propose a novel supervised approach that can effectively identify when to use either of the two aforementioned patient ranking approaches to attain effective retrieval performance. In particular, our approach deploys a classifier to learn to select a ranking approach when ranking patients, by using query difficulty measures, such as query performance predictors and the number of medical concepts detected in a query, as learning features. We thoroughly evaluate our approach using the standard test collections provided by the TREC Medical Records track. Our results show significant improvements over existing strong baselines.
A belief propagation approach for detecting shilling attacks in collaborative filtering BIBAFull-Text 1837-1840
  Jun Zou; Faramarz Fekri
Recommender systems have been widely used in e-commerce websites to suggest items that meet users' preferences. Collaborative filtering, which is the most popular recommendation algorithm, is vulnerable to shilling attacks, where a group of spam users collaborate to manipulate the recommendations. Several attack detection algorithms have been developed to detect spam users and remove them from the system. However, the existing algorithms focus mostly on rating patterns of users. In this paper, we develop a probabilistic inference framework that further exploits the target items for attack detection. In addition, the user features can also be conveniently incorporated in this framework. We utilize the Belief Propagation (BP) algorithm to perform inference efficiently. Experimental results verify that the proposed algorithm significantly improves detection performance as the number of target items increases.
Automated snippet generation for online advertising BIBAFull-Text 1841-1844
  Stamatina Thomaidou; Ismini Lourentzou; Panagiotis Katsivelis-Perakis; Michalis Vazirgiannis
Products, services or brands can be advertised alongside the search results in major search engines, while recently smaller displays on devices like tablets and smartphones have imposed the need for smaller ad texts. In this paper, we propose a method that produces in an automated manner compact text ads (promotional text snippets), given as input a product description webpage (landing page). The challenge is to produce a small comprehensive ad while maintaining at the same time relevance, clarity, and attractiveness. Our method includes the following phases. Initially, it extracts relevant and important n-grams (keywords) given the landing page. The keywords reserved must have a positive meaning in order to have a call-to-action style, thus we attempt sentiment analysis on them. Next, we build an Advertising Language Model to evaluate phrases in terms of their marketing appeal. We experiment with two variations of our method and we show that they outperform all the baseline approaches.
Detecting controversy on the web BIBAFull-Text 1845-1848
  Shiri Dori-Hacohen; James Allan
A useful feature to facilitate critical literacy would alert users when they are reading a controversial web page. This requires solving a binary classification problem: does a given web page discuss a controversial topic? We explore the feasibility of solving the problem by treating it as supervised k-nearest-neighbor classification. Our approach (1) maps a webpage to a set of neighboring Wikipedia articles which were labeled on a controversiality metric; (2) coalesces those labels into an estimate of the webpage's controversiality; and finally (3) converts the estimate to a binary value using a threshold. We demonstrate the applicability of our approach by validating it on a set of webpages drawn from seed queries. We show absolute gains of 22% in F0.5 on our test set over a sentiment-based approach, highlighting that detecting controversy is more complex than simply detecting opinions.
Mining user interest from search tasks and annotations BIBAFull-Text 1849-1852
  Sampath Jayarathna; Atish Patra; Frank Shipman
Interactive web search involves selecting which documents to read further and locating the parts of the documents that are relevant to the user's current activity. In this paper, we introduce UIMaP: User Interest Modeling and Personalization, a search task based personal user interest model to support users' information gathering tasks. The novelty of our approach lies in the use of topic modeling to generate fine-grained models of user interest and visualizations that direct user's attention to documents or parts of documents that match user's inferred interests. User annotations are used to help generate personalized visualizations for user's search tasks. Based on 1267 user annotations from 17 users, we show the performance comparisons of four different topic models: LDA+H, LDA+KL, LDA+JSD, and LDA+TopN.
Generating comparative summaries from reviews BIBAFull-Text 1853-1856
  Ruben Sipos; Thorsten Joachims
To facilitate direct comparisons between different products, we present an approach to constructing short and comparative summaries based on product reviews. In particular, the user can view automatically aligned pairs of snippets describing reviewers' opinions on different features (also selected automatically by our approach) for two selected products. We propose a submodular objective function that avoids redundancy, that is efficient to optimize, and that aligns the snippets into pairs. Snippets are chosen from product reviews and thus easy to obtain. In our experiments, we show that the method constructs qualitatively good summaries, and that it can be tuned via supervised learning.
Zero-shot video retrieval using content and concepts BIBAFull-Text 1857-1860
  Jeffrey Dalton; James Allan; Pranav Mirajkar
Recent research in video retrieval has been successful at finding videos when the query consists of tens or hundreds of sample relevant videos for training supervised models. Instead, we investigate unsupervised zero-shot retrieval where no training videos are provided: a query consists only of a text statement. For retrieval, we use text extracted from images in the videos, text recognized in the speech of its audio track, as well as automatically detected semantically meaningful visual video concepts identified with widely varying confidence in the videos. In this work we introduce a new method for automatically identifying relevant concepts given a text query using the Markov Random Field (MRF) retrieval framework. We use source expansion to build rich textual representations of semantic video concepts from large external sources such as the web. We find that concept-based retrieval significantly outperforms text based approaches in recall. Using an evaluation derived from the TRECVID MED'11 track, we present early results that an approach using multi-modal fusion can compensate for inadequacies in each modality, resulting in substantial effectiveness gains. With relevance feedback, our approach provides additional improvements of over 50%.
Diversified query expansion using conceptnet BIBAFull-Text 1861-1864
  Arbi Bouchoucha; Jing He; Jian-Yun Nie
Search result diversification (SRD) aims to select diverse documents from the search results in order to cover as many search intents as possible. A prerequisite is that the search results contain diverse documents. For this purpose, we investigate a new approach to SRD by diversifying the query. Expansion terms are selected from ConceptNet so as to cover as diverse aspects as possible. The experimental results on several TREC data sets show that our method can outperform the existing state-of-the-art approaches that do not diversify the query.
An empirical study of top-n recommendation for venture finance BIBAFull-Text 1865-1868
  Thomas Stone; Weinan Zhang; Xiaoxue Zhao
This paper concerns the task of top-N investment opportunity recommendation in the domain of venture finance. By venture finance, specifically, we are interested in the investment activity of venture capital (VC) firms and their investment partners. We have access to a dataset of recorded venture financings (i.e., investments) by VCs and their investment partners in private US companies. This research was undertaken in partnership with Correlation Ventures, a venture capital firm who are pioneering the use of predictive analytics in order to better inform investment decision making. This paper undertakes a detailed empirical study and data analysis then demonstrates the efficacy of recommender systems in this novel application domain.
Interest mining from user tweets BIBAFull-Text 1869-1872
  Thuy Vu; Victor Perez
We build a system to extract user interests from Twitter messages. Specifically, we extract interest candidates using linguistic patterns and rank them using four different keyphrase ranking techniques: TFIDF, TextRank, LDA-TextRank, and Relevance-Interestingness-Rank (RI-Rank). We also explore the complementary relation between TFIDF and TextRank in ranking interest candidates. Top ranked interests are evaluated with user feedback gathered from an online survey. The results show that TFIDF and TextRank are both suitable for extracting user interests from tweets. Moreover, the combination of TFIDF and TextRank consistently yields the highest user positive feedback.
An analysis of crowd workers mistakes for specific and complex relevance assessment task BIBAFull-Text 1873-1876
  Jesse Anderton; Maryam Bashir; Virgil Pavlu; Javed A. Aslam
The TREC 2012 Crowdsourcing track asked participants to crowdsource relevance assessments with the goal of replicating costly expert judgements with relatively fast, inexpensive, but less reliable judgements from anonymous online workers. The track used 10 "ad-hoc" queries, highly specific and complex (as compared to web search). The crowdsourced assessments were evaluated against expert judgments made by highly trained and capable human analysts in 1999 as part of ad hoc track collection construction. Since most crowdsourcing approaches submitted to the TREC 2012 track produced assessment sets nowhere close to the expert judgements, we decided to analyze crowdsourcing mistakes made on this task using data we collected via Amazon's Mechanical Turk service. We investigate two types of crowdsourcing approaches: one that asks for nominal relevance grades for each document, and the other that asks for preferences on many (not all) pairs of documents.
Combining prestige and relevance ranking for personalized recommendation BIBAFull-Text 1877-1880
  Xiao Yang; Zhaoxin Zhang
In this paper, we present an adaptive graph-based personalized recommendation method based on combining prestige and relevance ranking. By utilizing the unique network structure of n-partite heterogeneous graph, we attempt to address the problem of personalized recommendation in a two-layer ranking process with the help of reasonable measure of high and low order relationships by analyzing the representation of user's preference in the graph. With different initialization and surfing strategies, this graph-based ranking model can take different type of data into account to capture personal interests from multiple perspectives. The experiments show that this algorithm can achieve better performance than the traditional CF methods and some graph-based recommendation methods.
Strategies for setting time-to-live values in result caches BIBAFull-Text 1881-1884
  Fethi Burak Sazoglu; B. Barla Cambazoglu; Rifat Ozcan; Ismail Sengor Altingovde; Özgür Ulusoy
In web query result caching, staleness of queries are often bounded via a time-to-live (TTL) mechanism, which expires the validity of cached query results at some point in time. In this work, we evaluate the performance of three alternative TTL mechanisms: time-based TTL, frequency-based TTL, and click-based TTL. Moreover, we propose hybrid approaches obtained by pair-wise combination of these mechanisms. Our results indicate that combining time-based TTL with frequency-based TTL yields superior performance (i.e., lower stale query traffic and less redundant computation) than using a particular mechanism in isolation.
Learning to detect task boundaries of query session BIBAFull-Text 1885-1888
  Zhenzhong Zhang; Le Sun; Xianpei Han
To accomplish a search task and satisfy a single information need, users usually submit a series of queries to web search engines. It is useful for web search engines to detect the task boundaries in a series of successive queries. Traditional task boundary detection methods are based on time gap and lexical comparisons, which often suffer from the vocabulary gap problem, that is, the topically related queries may not share any common words. In this paper we learn hidden topics from query log and leverage them to resolve the vocabulary gap problem. Unlike other external knowledge resources, such as WordNet and Wikipedia, the hidden topics discovered from query log cover long tail queries, which is useful to detect task boundaries. Experimental results on dataset from real world query log demonstrate that the proposed method achieves significant quality enhancement.
Early prediction on imbalanced multivariate time series BIBAFull-Text 1889-1892
  Guoliang He; Yong Duan; Tieyun Qian; Xu Chen
Multivariate time series (MTS) classification is an important topic in time series data mining, and lots of efficient models and techniques have been introduced to cope with it. However, early classification on imbalanced MTS data largely remains an open problem. To deal with this issue, we adopt a multiple under-sampling and dynamical subspace generation method to obtain initial training data, and each training data is used to learn a base learner. Finally, an ensemble classifier is introduced for early classification on imbalanced MTS data. Experimental results show that our proposed methods can achieve effective early prediction on imbalanced MTS data.
Exploiting trustors as well as trustees in trust-based recommendation BIBAFull-Text 1893-1896
  Won-Seok Hwang; Shaoyu Li; Sang-Wook Kim; Ho Jin Choi
In a trust network, two users who are connected by a trust relationship tend to have similar interests. Based on this observation, existing trust-aware recommendation methods predict ratings for a target user on unseen items by referencing to ratings of those users who are reachable from the target user in the forward direction of trustor-trustee relationship through the trust network. However, these methods have overlooked the possibility of utilizing the ratings of those users reachable in the backward direction, which may also have similar interests. In this paper, we investigate this possibility by identifying and adding these users to the existing methods when predicting ratings for the target user. We perform a series of experiments and observe that our approach improves the coverage while preserving the accuracy.
Through-the-looking glass: utilizing rich post-search trail statistics for web search BIBAFull-Text 1897-1900
  Alexey Tolstikov; Mikhail Shakhray; Gleb Gusev; Pavel Serdyukov
With increasing popularity of browser toolbars, the challenge of employing user behavior data stored in their logs rises in its importance. The analysis of post-click search trails was shown to provide important knowledge about user experience, helpful for improving existing search systems. However, the utility of different trail properties for improving existing ranking models is still underexplored. We conduct a large-scale study and evaluation of a rich set of search trail features in realistic settings and conclude that a deeper investigation of a users experience far beyond her click on the result page has the potential to improve the existing ranking models.
Topical authority propagation on microblogs BIBAFull-Text 1901-1904
  Juan Hu; Yi Fang; Archana Godavarthy
With a huge number of active users on microblogs, it becomes increasingly important to identify authoritative users on specific topics. This paper tackles the task of finding authorities on Twitter given any query topic. Although there exists much work on identifying influential users on Twitter, most of them focus on global authority regardless of the topic. We propose a novel Topical Authority Propagation (TAP) model by utilizing the fact that topical authority can be propagated through retweeting, i.e., if a user's tweet on a given topic is retweeted by a topical authority, that user is likely to be an authority on the topic as well. Topical relevance of candidate authorities can be seamlessly integrated into the model. Link analysis algorithms such as PageRank can then be utilized to characterize how topical authority is propagated through retweeting. We conduct a set of experiments on Twitter and demonstrate the effectiveness of the proposed approach.
The importance of being socially-savvy: quantifying the influence of social networks on microblog retrieval BIBAFull-Text 1905-1908
  Alexander Kotov; Eugene Agichtein
Social media users create virtual connections for various reasons: personal and professional. While significant research efforts have been spent on exploring the dynamics of creation of social network connections, little is known about how those connections influence the content generated by social media users. In this work, we quantitatively evaluate the influence of social networks on social media content providers. Additionally, we propose several document expansion methods, which leverage the content generated by the social networks of the authors of social media documents and compare their effectiveness. Experimental results on a large sample of Twitter data indicate that retrieval models discriminatively leveraging social network content for document expansion outperform both traditional, socially-unaware retrieval models and retrieval models that indiscriminatively utilize all social connections.
Flexible and dynamic compromises for effective recommendations BIBAFull-Text 1909-1912
  Saurabh Gupta; Sutanu Chakraborti
Conversational Recommendation mimics the kind of dialog that takes between a customer and a shopkeeper involving multiple interactions where the user can give feedback at every interaction as opposed to Single Shot Retrieval, which corresponds to a scheme where the system retrieves a set of items in response to a user query in a single interaction. Compromise refers to a particular user preference which the recommender system failed to satisfy. But in the context of conversational systems, where the user's preferences keep on evolving as she interacts with the system, what constitutes as a compromise for her also keeps on changing. Typically, in Single Shot retrieval, the notion of compromise is characterized by the assignment of a particular feature to a particular dominance group such as MIB (higher value is better) or LIB (lower value is better) and this assignment remains true for all the users who use the system.
   In this paper, we propose a way to realize the notion of compromise in a conversational setting. Our approach, Flexi-Comp, introduces the notion of dynamically assigning a feature to two dominance groups simultaneously which is then used to redefine the notion of compromise. We show experimentally that a utility function based on this notion of compromise outperforms the existing conversational recommenders in terms of recommendation efficiency.

Industry session

The online revolution: education for everyone BIBAFull-Text 1913-1914
  Andrew Ng
In 2011, Stanford University offered three online courses, which anyone in the world could enroll in and take for free. Together, these three courses had enrollments of around 350,000 students, making this one of the largest experiments in online education ever performed. Since the beginning of 2012, we have transitioned this effort into a new venture, Coursera, a social entrepreneurship company whose mission is to make high-quality education accessible to everyone by allowing the best universities to offer courses to everyone around the world, for free. Coursera classes provide a real course experience to students, including video content, interactive exercises with meaningful feedback, using both auto-grading and peer-grading, and a rich peer-to-peer interaction around the course materials. Currently, Coursera has over 80 university and other partners, and 4 million students enrolled in its nearly 400 courses. These courses span a range of topics including computer science, business, medicine, science, humanities, social sciences, and more. In this talk, I'll report on this far-reaching experiment in education, and why we believe this model can provide both an improved classroom experience for our on-campus students, via a flipped classroom model, as well as a meaningful learning experience for the millions of students around the world who would otherwise never have access to education of this quality.
Online learning from streaming data BIBAFull-Text 1915-1916
  Jeff Hawkins
High velocity machine-generated data is growing rapidly. To act on this data in real time requires models that learn continuously and discover the temporal patterns in noisy data streams. The brain is also an online learning system that builds models from streaming data. In this talk I will describe recent advances in brain theory and how we have applied those advances to machine-generated streaming data.
   At the heart of our work are new insights into how layers of cells in the neocortex infer and make predictions from fast changing sensory data. This theory, called the Cortical Learning Algorithm, has been tested extensively. We have embedded these learning algorithms into a product called Grok which is being applied to numerous problems such as energy load forecasting and anomaly detection. I will give an introduction to the Cortical Learning Algorithm including how it uses sparse distributed representations and then show how Grok makes predictions and detects anomalies in streaming data.
   The Cortical Learning Algorithm is now an open source project (www.numenta.org) and I will give a brief introduction to the project.
From big data to big knowledge BIBAFull-Text 1917-1918
  Kevin Murphy
We are drowning in big data, but a lot of it is hard to interpret. For example, Google indexes about 40B webpages, but these are just represented as bags of words, which don't mean much to a computer. To get from "strings to things", Google introduced the Knowledge Graph (KG), which is a database of facts about entities (people, places, movies, etc.) and their relations (nationality, geo-containment, actor roles, etc). KG is based on Freebase, but supplements it with various other structured data sources. Although KG is very large (about 500M nodes/ entities, and 30B edges/ relations), it is still very incomplete. For example, 94% of the people are missing their place of birth, and 78% have no known nationality -- these are examples of missing links in the graph. In addition, we are missing many nodes (corresponding to new entities), as well as new types of nodes and edges (corresponding to extensions to the schema). In this talk, I will survey some of the efforts we are engaged in to try to "grow" KG automatically using machine learning methods. In particular, I will summarize our work on the problems of entity linkage, relation extraction, and link prediction, using data extracted from natural language text as well as tabular data found on the web.

DB track: miscellaneous

"All roads lead to Rome": optimistic recovery for distributed iterative data processing BIBAFull-Text 1919-1928
  Sebastian Schelter; Stephan Ewen; Kostas Tzoumas; Volker Markl
Executing data-parallel iterative algorithms on large datasets is crucial for many advanced analytical applications in the fields of data mining and machine learning. Current systems for executing iterative tasks in large clusters typically achieve fault tolerance through rollback recovery. The principle behind this pessimistic approach is to periodically checkpoint the algorithm state. Upon failure, the system restores a consistent state from a previously written checkpoint and resumes execution from that point.
   We propose an optimistic recovery mechanism using algorithmic compensations. Our method leverages the robust, self-correcting nature of a large class of fixpoint algorithms used in data mining and machine learning, which converge to the correct solution from various intermediate consistent states. In the case of a failure, we apply a user-defined compensate function that algorithmically creates such a consistent state, instead of rolling back to a previous checkpointed state. Our optimistic recovery does not checkpoint any state and hence achieves optimal failure-free performance with respect to the overhead necessary for guaranteeing fault tolerance.
   We illustrate the applicability of this approach for three wide classes of problems. Furthermore, we show how to implement the proposed optimistic recovery mechanism in a data flow system. Similar to the Combine operator in MapReduce, our proposed functionality is optional and can be applied to increase performance without changing the semantics of programs.
   In an experimental evaluation on large datasets, we show that our proposed approach provides optimal failure-free performance. In the absence of failures our optimistic scheme is able to outperform a pessimistic approach by a factor of two to five. In presence of failures, our approach provides fast recovery and outperforms pessimistic approaches in the majority of cases.
Optimizing plurality for human intelligence tasks BIBAFull-Text 1929-1938
  Luyi Mo; Reynold Cheng; Ben Kao; Xuan S. Yang; Chenghui Ren; Siyu Lei; David W. Cheung; Eric Lo
In a crowdsourcing system, Human Intelligence Tasks (HITs) (e.g., translating sentences, matching photos, tagging videos with keywords) can be conveniently specified. HITs are made available to a large pool of workers, who are paid upon completing the HITs they have selected. Since workers may have different capabilities, some difficult HITs may not be satisfactorily performed by a single worker. If more workers are employed to perform a HIT, the quality of the HIT's answer could be statistically improved. Given a set of HITs and a fixed "budget", we address the important problem of determining the number of workers (or plurality) of each HIT so that the overall answer quality is optimized. We propose a dynamic programming (DP) algorithm for solving the plurality assignment problem (PAP). We identify two interesting properties, namely, monotonicity and diminishing return, which are satisfied by a HIT if the quality of the HIT's answer increases monotonically at a decreasing rate with its plurality. We show for HITs that satisfy the two properties (e.g., multiple-choice-question HITs), the PAP is approximable. We propose an efficient greedy algorithm for such case. We conduct extensive experiments on synthetic and real datasets to evaluate our algorithms. Our experiments show that our greedy algorithm provides close-to-optimal solutions in practice.
Entropy-based histograms for selectivity estimation BIBAFull-Text 1939-1948
  Hien To; Kuorong Chiang; Cyrus Shahabi
Histograms have been extensively used for selectivity estimation by academics and have successfully been adopted by database industry. However, the estimation error is usually large for skewed distributions and biased attributes, which are typical in real-world data. Therefore, we propose effective models to quantitatively measure bias and selectivity based on information entropy. These models together with the principles of maximum entropy are then used to develop a class of entropy-based histograms. Moreover, since entropy can be computed incrementally, we present the incremental variations of our algorithms that reduce the complexities of the histogram construction from quadratic to linear. We conducted an extensive set of experiments with both synthetic and real-world datasets to compare the accuracy and efficiency of our proposed techniques with many other histogram-based techniques, showing the superiority of the entropy-based approaches for both equality and range queries.
Efficient two-party private blocking based on sorted nearest neighborhood clustering BIBAFull-Text 1949-1958
  Dinusha Vatsalan; Peter Christen; Vassilios S. Verykios
Integrating data from diverse sources with the aim to identify similar records that refer to the same real-world entities without compromising privacy of these entities is an emerging research problem in various domains. This problem is known as privacy-preserving record linkage (PPRL). Scalability of PPRL is a main challenge due to growing data size in real-world applications. Private blocking techniques have been used in PPRL to address this challenge by reducing the number of record pair comparisons that need to be conducted. Many of these private blocking techniques require a trusted third party to perform the blocking. One main threat with three-party solutions is the collusion between parties to identify the private data of another party.
   We introduce a novel two-party private blocking technique for PPRL based on sorted nearest neighborhood clustering. Privacy is addressed by a combination of the privacy techniques k-anonymous clustering and public reference values. Experiments conducted on two real-world databases validate that our approach is scalable to large databases and effective in generating candidate record pairs that correspond to true matches, while preserving k-anonymous privacy characteristics. Our approach also performs equal or superior compared to three other state-of-the-art private blocking techniques in terms of scalability, blocking quality, and privacy. It can achieve private blocking up-to two magnitudes faster than other state-of-the art private blocking approaches.
Context-aware top-K processing using views BIBAFull-Text 1959-1968
  Silviu Maniu; Bogdan Cautis
Search applications where queries are dependent on their context are becoming increasingly relevant in today's online applications. For example, the context may be the location of the user in location-aware search or the social network of the query initiator in social-aware search. Processing such queries efficiently is inherently difficult, and requires techniques that go beyond the existing, context-agnostic ones. A promising direction for efficient, online answering -- especially in the case of top-k queries -- is to materialize and exploit previous query results (views).
   We consider context-aware query optimization based on views, focusing on two important sub-problems. First, handling the possible differences in context between the various views and an input query leads to view results having uncertain scores, i.e., score ranges valid for the new context. As a consequence, current top-k algorithms are no longer directly applicable and need to be adapted to handle such uncertainty in object scores. Second, adapted view selection techniques are needed, which can leverage both the descriptions of queries and statistics over their results. We present algorithms that address these two problems, and illustrate their practical use in two important application scenarios: location-aware search and social-aware search. We validate our approaches via extensive experiments, using both synthetic and real-world datasets.
Locality sensitive hashing revisited: filling the gap between theory and algorithm analysis BIBAFull-Text 1969-1978
  Hongya Wang; Jiao Cao; LihChyun Shu; Davood Rafiei
Locality Sensitive Hashing (LSH) is widely recognized as one of the most promising approaches to similarity search in high-dimensional spaces. Based on LSH, a considerable number of nearest neighbor search algorithms have been proposed in the past, with some of them having been used in many real-life applications. Apart from their demonstrated superior performance in practice, the popularity of the LSH algorithms is mainly due to their provable performance bounds on query cost, space consumption and failure probability.
   In this paper, we show that a surprising gap exists between the LSH theory and widely practiced algorithm analysis techniques. In particular, we discover that a critical assumption made in the classical LSH algorithm analysis does not hold in practice, which suggests that using the existing methods to analyze the performance of practical LSH algorithms is a conceptual mismatch. To address this problem, a novel analysis model is developed that bridges the gap between the LSH theory and the method for analyzing the LSH algorithm performance. With the help of this model, we identify some important flaws in the commonly used analysis methods in the LSH literature. The validity of this model is verified through extensive experiments with real datasets.

IR track: users

Personalization of web-search using short-term browsing context BIBAFull-Text 1979-1988
  Yury Ustinovskiy; Pavel Serdyukov
Search and browsing activity is known to be a valuable source of information about user's search intent. It is extensively utilized by most of modern search engines to improve ranking by constructing certain ranking features as well as by personalizing search. Personalization aims at two major goals: extraction of stable preferences of a user and specification and disambiguation of the current query. The common way to approach these problems is to extract information from user's search and browsing long-term history and to utilize short-term history to determine the context of a given query. Personalization of the web search for the first queries in new search sessions of new users is more difficult due to the lack of both long- and short-term data.
   In this paper we study the problem of short-term personalization. To be more precise, we restrict our attention to the set of initial queries of search sessions. These, with the lack of contextual information, are known to be the most challenging for short-term personalization and are not covered by previous studies on the subject. To approach this problem in the absence of the search context, we employ short-term browsing context. We apply a widespread framework for personalization of search results based on the re-ranking approach and evaluate our methods on the large scale data. The proposed methods are shown to significantly improve non-personalized ranking of one of the major commercial search engines. To the best of our knowledge this is the first study addressing the problem of short-term personalization based on recent browsing history. We find that performance of this re-ranking approach can be reasonably predicted given a query. When we restrict the use of our method to the queries with largest expected gain, the resulting benefit of personalization increases significantly.
Factors affecting aggregated search coherence and search behavior BIBAFull-Text 1989-1998
  Jaime Arguello; Robert Capra; Wan-Ching Wu
Aggregated search is the task of incorporating results from different search services, or verticals, into the web search results. Aggregated search coherence refers to the extent to which results from different sources focus on similar senses of a given query. Prior research investigated aggregated search coherence between images and web results. A user study showed that users are more likely to interact with the web results when the images are more consistent with the intended query-sense. We build upon this work and address three outstanding research questions about aggregated search coherence: (1) Does the same "spill-over" effect generalize to other verticals besides images? (2) Is the effect stronger when the vertical results include image thumbnails? and (3) What factors influence if and when a spill-over occurs from a user's perspective? We investigate these questions using a large-scale crowdsourcing study and a smaller-scale laboratory study. Results suggest that the spill-over effect occurs for some verticals (images, shopping, video), but not others (news), and that including thumbnails in the vertical results has little effect. Qualitative data from our laboratory study provides insights about participants' actions and thought-processes when faced with (in)coherent results.
Improving passage ranking with user behavior information BIBAFull-Text 1999-2008
  Weize Kong; Elif Aktolga; James Allan
User behavior information has proved valuable for inferring document relevance, but its role in deducing relevance at the passage/section level is not well explored. In this paper, we study how user behavior information implies section relevance, and use this information to improve section ranking. More specifically, we focus on four types of user search behavior that occur while browsing a document -- dwell time, highlighting, copying and clicks at the section level. Experimental results based on a commercial query log show that user behavior information can significantly improve section ranking. While section-level click information is a very powerful signal of relevance, it depends on an interface supporting section-level links. We find comparable levels of gain using other behavior information that does not depend upon such an interface.
Personalized models of search satisfaction BIBAFull-Text 2009-2018
  Ahmed Hassan; Ryen W. White
Search engines need to model user satisfaction to improve their services. Since it is not practical to request feedback on searchers' perceptions and search outcomes directly from users, search engines must estimate satisfaction from behavioral signals such as query refinement, result clicks, and dwell times. This analysis of behavior in the aggregate leads to the development of global metrics such as satisfied result clickthrough (typically operationalized as result-page clicks with dwell time exceeding a particular threshold) that are then applied to all searchers' behavior to estimate satisfaction levels. However, satisfaction is a personal belief and how users behave when they are satisfied can also differ. In this paper we verify that searcher behavior when satisfied and dissatisfied is indeed different among individual searchers along a number of dimensions. As a result, we introduce and evaluate learned models of satisfaction for individual searchers and searcher cohorts. Through experimentation via logs from a large commercial Web search engine, we show that our proposed models can predict search satisfaction more accurately than a global baseline that applies the same satisfaction model across all users. Our findings have implications for the study and application of user satisfaction in search systems.
Beyond clicks: query reformulation as a predictor of search satisfaction BIBAFull-Text 2019-2028
  Ahmed Hassan; Xiaolin Shi; Nick Craswell; Bill Ramsey
To understand whether a user is satisfied with the current search results, implicit behavior is a useful data source, with clicks being the best-known implicit signal. However, it is possible for a non-clicking user to be satisfied and a clicking user to be dissatisfied. Here we study additional implicit signals based on the relationship between the user's current query and the next query, such as their textual similarity and the inter-query time. Using a large unlabeled dataset, a labeled dataset of queries and a labeled dataset of user tasks, we analyze the relationship between these signals. We identify an easily-implemented rule that indicates dissatisfaction: that a similar query issued within a time interval that is short enough (such as five minutes) implies dissatisfaction. By incorporating additional query-based features in the model, we show that a query-based model (with no click information) can indicate satisfaction more accurately than click-based models. The best model uses both query and click features. In addition, by comparing query sequences in successful tasks and unsuccessful tasks, we observe that search success is an incremental process for successful tasks with multiple queries.
Unsupervised identification of synonymous query intent templates for attribute intents BIBAFull-Text 2029-2038
  Yanen Li; Bo-June Paul Hsu; ChengXiang Zhai
Among all web search queries there is an important subset of queries containing entity mentions. In these queries, it is observed that users are most interested in requesting some attribute of an entity, such as "Obama age" for the intent of age, which we refer to as the attribute intent. In this work we address the problem of identifying synonymous query intent templates for the attribute intent. For example, "how old is [Person]" and "[Person]'s age" are both synonymous templates for the age intent. Successful identification of the synonymous query intent templates not only can improve the performance of all existing query annotation approaches, but also could benefit applications such as instant answers and intent-based query suggestion. In this work we propose a clustering framework with multiple kernel functions to identify synonymous query intent templates for a set of canonical templates jointly. Furthermore, signals from multiple sources of information are integrated into a kernel function between templates, where the weights of these signals are tuned in an unsupervised manner. We have conducted extensive experiments across multiple domains in FreeBase, and results demonstrate the effectiveness of our clustering framework for finding synonymous query intent templates for attribute intents.

KM track: extraction and text mining

Toward advice mining: conditional random fields for extracting advice-revealing text units BIBAFull-Text 2039-2048
  Alfan Farizki Wicaksono; Sung-Hyon Myaeng
Web forums are platforms for personal communications on sharing information with others. Such information is often expressed in the form of advice. In this paper, we address the problem of advice-revealing text unit (ATU) extraction from online forums due to its usefulness in travel domain. We represent advice as a two-tuple comprising an advice-revealing sentence and its context sentences. To extract the advice-revealing sentences, we propose to define the task as a sequence labeling problem, using three different types of features: syntactic, contextual, and semantic features. To extract the context sentences, we propose to use a 2 Dimensional CRF (2D-CRF) model, which gives the best performance compared to traditional machine learning models. Finally, we present a solution to the integrated problem of extracting both advice-revealing sentences and their respective context sentences at the same time using our proposed models, i.e., Multiple Linear CRF (ML-CRF) and 2 Dimensional CRF Plus (2D-CRF+). The experimental results show that ML-CRF performs better than any other models studied in this paper for extracting advice-revealing sentences and context sentences.
Information extraction as a filtering task BIBAFull-Text 2049-2058
  Henning Wachsmuth; Benno Stein; Gregor Engels
Information extraction is usually approached as an annotation task: Input texts run through several analysis steps of an extraction process in which different semantic concepts are annotated and matched against the slots of templates. We argue that such an approach lacks an efficient control of the input of the analysis steps. In this paper, we hence propose and evaluate a model and a formal approach that consistently put the filtering view in the focus: Before spending annotation effort, filter those portions of the input texts that may contain relevant information for filling a template and discard the others. We model all dependencies between the semantic concepts sought for with a truth maintenance system, which then efficiently infers the portions of text to be annotated in each analysis step. The filtering view enables an information extraction system (1) to annotate only relevant portions of input texts and (2) to easily trade its run-time efficiency for its recall. We provide our approach as an open-source extension of Apache UIMA and we show the potential of our approach in a number of experiments.
Web news extraction via path ratios BIBAFull-Text 2059-2068
  Gongqing Wu; Li Li; Xuegang Hu; Xindong Wu
In addition to the news content, most web news pages also contain navigation panels, advertisements, related news links etc. These non-news items not only exist outside the news region, but are also present in the news content region. Effectively extracting the news content and filtering the noise have important effects on the follow-up activities of content management and analysis. Our extensive case studies have indicated that there exists potential relevance between web content layouts and their tag paths. Based on this observation, we design two tag path features to measure the importance of nodes: Text to tag Path Ratio (TPR) and Extended Text to tag Path Ratio (ETPR), and describe the calculation process of TPR by traversing the parsing tree of a web news page. In this paper, we present Content Extraction via Path Ratios (CEPR) -- a fast, accurate and general on-line method for distinguishing news content from non-news content by the TPR/ETPR histogram effectively. In order to improve the ability of CEPR in extracting short texts, we propose a Gaussian smoothing method weighted by a tag path edit distance. This approach can enhance the importance of internal-link nodes but ignore noise nodes existing in news content. Experimental results on the CleanEval datasets and web news pages randomly selected from well-known websites show that CEPR can extract across multi-resources, multi-styles, and multi-languages. The average F and average score with CEPR is 8.69% and 14.25% higher than CETR, which demonstrates better web news extraction performance than most existing methods.
Lead-lag analysis via sparse co-projection in correlated text streams BIBAFull-Text 2069-2078
  Fangzhao Wu; Yangqiu Song; Shixia Liu; Yongfeng Huang; Zhenyu Liu
Correlated topical trend detection is very useful in analyzing public and social media influence. In this paper, we propose an algorithm that can both detect the correlation and discover the corresponding keywords that trigger the correlation. To detect the correlation, we use a projection vector to project two text streams onto the same space, and then use a least square cost function to regress one text stream over the other with different time lags. To extract the corresponding keywords, we impose the non-negative sparsity constraints over the projection parameters. In addition, we present an accelerated algorithm based on Nesterov's method to efficiently solve the optimization problem. In our experiments, we use both synthetic and real data sets to demonstrate the advantages and capabilities of the proposed algorithm over CCA on the follower link prediction problem.
Adaptive co-training SVM for sentiment classification on tweets BIBAFull-Text 2079-2088
  Shenghua Liu; Fuxin Li; Fangtao Li; Xueqi Cheng; Huawei Shen
Sentiment classification is an important problem in tweets mining. There lack labeled data and rating mechanism for generating them in Twitter service. And topics in Twitter are more diverse while sentiment classifiers always dedicate themselves to a specific domain or topic. Thus it is a challenge to make sentiment classification adaptive to diverse topics without sufficient labeled data. Therefore we formally propose an adaptive multiclass SVM model which transfers an initial common sentiment classifier to a topic-adaptive one. To tackle the tweet sparsity, non-text features are explored besides the conventional text features, which are intuitively split into two views. An iterative algorithm is proposed for solving this model by alternating among three steps: optimization, unlabeled data selection and adaptive feature expansion steps. The algorithm alternatively minimizes the margins of two independent objectives on different views to learn coefficient matrices, which are collaboratively used for unlabeled tweets selection from the topic that the algorithm is adapting to. And then topic-adaptive sentiment words are expended based on the above selection, in turn to help the first two steps find more confident and unlabeled tweets and boost the final performance. Comparing with the well-known supervised sentiment classifiers and semi-supervised approaches, our algorithm achieves promising increases in accuracy averagely on the 6 topics from public tweet corpus.
On handling textual errors in latent document modeling BIBAFull-Text 2089-2098
  Tao Yang; Dongwon Lee
As large-scale text data become available on the Web, textual errors in a corpus are often inevitable (e.g., digitizing historic documents). Due to the calculation of frequencies of words, however, such textual errors can significantly impact the accuracy of statistical models such as the popular Latent Dirichlet Allocation (LDA) model. To address such an issue, in this paper, we propose two novel extensions to LDA (i.e., TE-LDA and TDE-LDA): (1) The TE-LDA model incorporates textual errors into term generation process; and (2) The TDE-LDA model extends TE-LDA further by taking into account topic dependency to leverage on semantic connections among consecutive words even if parts are typos. Using both real and synthetic data sets with varying degrees of "errors", our TDE-LDA model outperforms: (1) the traditional LDA model by 16%-39% (real) and 20%-63% (synthetic); and (2) the state-of-the-art N-Grams model by 11%-27% (real) and 16%-54% (synthetic).

KM track: community and web mining

Overlapping community detection using seed set expansion BIBAFull-Text 2099-2108
  Joyce Jiyoung Whang; David F. Gleich; Inderjit S. Dhillon
Community detection is an important task in network analysis. A community (also referred to as a cluster) is a set of cohesive vertices that have more connections inside the set than outside. In many social and information networks, these communities naturally overlap. For instance, in a social network, each vertex in a graph corresponds to an individual who usually participates in multiple communities. One of the most successful techniques for finding overlapping communities is based on local optimization and expansion of a community metric around a seed set of vertices. In this paper, we propose an efficient overlapping community detection algorithm using a seed set expansion approach. In particular, we develop new seeding strategies for a personalized PageRank scheme that optimizes the conductance community score. The key idea of our algorithm is to find good seeds, and then expand these seed sets using the personalized PageRank clustering procedure. Experimental results show that this seed set expansion approach outperforms other state-of-the-art overlapping community detection methods. We also show that our new seeding strategies are better than previous strategies, and are thus effective in finding good overlapping clusters in a graph.
TODMIS: mining communities from trajectories BIBAFull-Text 2109-2118
  Siyuan Liu; Shuhui Wang; Kasthuri Jayarajah; Archan Misra; Ramayya Krishnan
Existing algorithms for trajectory-based clustering usually rely on simplex representation and a single proximity-related distance (or similarity) measure. Consequently, additional information markers (e.g., social interactions or the semantics of the spatial layout) are usually ignored, leading to the inability to fully discover the communities in the trajectory database. This is especially true for human-generated trajectories, where additional fine-grained markers (e.g., movement velocity at certain locations, or the sequence of semantic spaces visited) can help capture latent relationships between cluster members. To address this limitation, we propose TODMIS: a general framework for Trajectory cOmmunity Discovery using Multiple Information Sources. TODMIS combines additional information with raw trajectory data and creates multiple similarity metrics. In our proposed approach, we first develop a novel approach for computing semantic level similarity by constructing a Markov Random Walk model from the semantically-labeled trajectory data, and then measuring similarity at the distribution level. In addition, we also extract and compute pair-wise similarity measures related to three additional markers, namely trajectory level spatial alignment (proximity), temporal patterns and multi-scale velocity statistics. Finally, after creating a single similarity metric from the weighted combination of these multiple measures, we apply dense sub-graph detection to discover the set of distinct communities. We evaluated TODMIS extensively using traces of (i) student movement data in a campus, (ii) customer trajectories in a shopping mall, and (iii) city-scale taxi movement data. Experimental results demonstrate that TODMIS correctly and efficiently discovers the real grouping behaviors in these diverse settings.
Archiving the relaxed consistency web BIBAFull-Text 2119-2128
  Zhiwu Xie; Herbert Van de Sompel; Jinyang Liu; Johann van Reenen; Ramiro Jordan
The historical, cultural, and intellectual importance of archiving the web has been widely recognized. Today, all countries with high Internet penetration rate have established high-profile archiving initiatives to crawl and archive the fast-disappearing web content for long-term use. As web technologies evolve, established web archiving techniques face challenges. This paper focuses on the potential impact of the relaxed consistency web design on crawler driven web archiving. Relaxed consistent websites may disseminate, albeit ephemerally, inaccurate and even contradictory information. If captured and preserved in the web archives as historical records, such information will degrade the overall archival quality. To assess the extent of such quality degradation, we build a simplified feed-following application and simulate its operation with synthetic workloads. The results indicate that a non-trivial portion of a relaxed consistency web archive may contain observable inconsistency, and the inconsistency window may extend significantly longer than that observed at the data store. We discuss the nature of such quality degradation and propose a few possible remedies.
Programming with personalized pagerank: a locally groundable first-order probabilistic logic BIBAFull-Text 2129-2138
  William Yang Wang; Kathryn Mazaitis; William W. Cohen
Many information-management tasks (including classification, retrieval, information extraction, and information integration) can be formalized as inference in an appropriate probabilistic first-order logic. However, most probabilistic first-order logics are not efficient enough for realistically-sized instances of these tasks. One key problem is that queries are typically answered by "grounding" the query -- i.e., mapping it to a propositional representation, and then performing propositional inference -- and with a large database of facts, groundings can be very large, making inference and learning computationally expensive. Here we present a first-order probabilistic language which is well-suited to approximate "local" grounding: in particular, every query $Q$ can be approximately grounded with a small graph. The language is an extension of stochastic logic programs where inference is performed by a variant of personalized PageRank. Experimentally, we show that the approach performs well on an entity resolution task, a classification task, and a joint inference task; that the cost of inference is independent of database size; and that speedup in learning is possible by multi-threading.
Towards faster and better retrieval models for question search BIBAFull-Text 2139-2148
  Guangyou Zhou; Yubo Chen; Daojian Zeng; Jun Zhao
Community question answering (cQA) has become an important service due to the popularity of cQA archives on the web. This paper is concerned with the problem of question search. Question search in cQA aims to find the historical questions that are semantically equivalent or similar to the queried questions. In this paper, we propose a faster and better retrieval model for question search by leveraging user chosen category. After introducing the question category, we can filter certain amount of irrelevant historical questions under a wide range of leaf categories. Experimental results conducted on real cQA data set demonstrate that the proposed techniques are more effective and efficient than a variety of baseline methods.

KM track: learning and applications (2)

Nonparametric Bayesian multitask collaborative filtering BIBAFull-Text 2149-2158
  Sotirios Chatzis
The dramatic rates new digital content becomes available has brought collaborative filtering systems to the epicenter of computer science research in the last decade. One of the greatest challenges collaborative filtering systems are confronted with is the data sparsity problem: users typically rate only very few items; thus, availability of historical data is not adequate to effectively perform prediction. To alleviate these issues, in this paper we propose a novel multitask collaborative filtering approach. Our approach is based on a coupled latent factor model of the users rating functions, which allows for coming up with an agile information sharing mechanism that extracts much richer task-correlation information compared to existing approaches. Formulation of our method is based on concepts from the field of Bayesian nonparametrics, specifically Indian Buffet Process priors, which allow for data-driven determination of the optimal number of underlying latent features (item characteristics and user traits) assumed in the context of the model. We experiment on several real-world datasets, demonstrating both the efficacy of our method, and its superiority over existing approaches.
Local-to-global semi-supervised feature selection BIBAFull-Text 2159-2168
  Mohammed Hindawi; Khalid Benabdeslem
Variable-weighting approaches are well-known in the context of embedded feature selection. Generally, this task is performed in a global way, when the algorithm selects a single cluster-independent subset of features (global feature selection). However, there exist other approaches that aim to select cluster-specific subsets of features (local feature selection). Global and local feature selection have different objectives, nevertheless, in this paper we propose a novel embedded approach which locally weights the variables towards a global feature selection. The proposed approach is presented in the semi-supervised paradigm. Experiments on some known data sets are presented to validate our model and compare it with some representative methods.
Intelligently querying incomplete instances for improving classification performance BIBAFull-Text 2169-2178
  Karthik Sankaranarayanan; Amit Dhurandhar
The problem of intelligently acquiring missing input information given a limited number of queries to enhance classification performance has gained substantial interest in the last decade or so. This is primarily due to the emergence of the targeted advertising industry which is trying to best match products to its potential consumer base in the absence of complete consumer profile information. In this paper, we propose a novel active feature acquisition technique to tackle this problem of instance completion prevalent in these domains. We show theoretically that our technique is optimal given the current classifier and derive a probabilistic lower bound on the error reduction achieved with our technique. We also show that a simplification of our technique is equivalent to the Expected Utility approach which is one of the most sophisticated solutions for this problem in existing literature. We then demonstrate the efficacy of our approach through experiments on real data. Finally, we show that our technique can be easily extended to the scenario where we have a cost matrix associated with acquiring missing information for each instance or instance-feature combinations.
A probabilistic mixture model for mining and analyzing product search log BIBAFull-Text 2179-2188
  Huizhong Duan; ChengXiang Zhai; Jinxing Cheng; Abhishek Gattani
The booming of e-commerce in recent years has led to the generation of large amounts of product search log data. Product search log is a unique new data with much valuable information and knowledge about user preferences over product attributes that is often hard to obtain from other sources. While regular search logs (e.g., Web search logs) contain click-throughs for unstructured text documents (e.g., web pages), product search logs contain clickth-roughs for structured entities defined by a set of attributes and their values. For instance, a laptop can be defined by its size, color, cpu, ram, etc. Such structures in product entities offer us opportunities to mine and discover detailed useful knowledge about user preferences at the attribute level, but they also raise significant challenges for mining due to the lack of attribute-level observations.
   In this paper, we propose a novel probabilistic mixture model for attribute-level analysis of product search logs. The model is based on a generative process where queries are generated by a mixture of unigram language models defined by each attribute-value pair of a clicked entity. The model can be efficiently estimated using the Expectation-Maximization (EM) algorithm. The estimated parameters, including the attribute-value language models and attribute-value preference models, can be directly used to improve product search accuracy, or aggregated to reveal knowledge for understanding user intent and supporting business intelligence. Evaluation of the proposed model on a commercial product search log shows that the model is effective for mining and analyzing product search logs to discover various kinds of useful knowledge.
Eigenvalues perturbation of integral operator for kernel selection BIBAFull-Text 2189-2198
  Yong Liu; Shali Jiang; Shizhong Liao
Kernel selection is one of the key issues both in recent research and application of kernel methods. This is usually done by minimizing either an estimate of generalization error or some other related performance measure. It is well known that a kernel matrix can be interpreted as an empirical version of a continuous integral operator, and its eigenvalues converge to the eigenvalues of integral operator. In this paper, we introduce new kernel selection criteria based on the eigenvalues perturbation of the integral operator. This perturbation quantifies the difference between the eigenvalues of the kernel matrix and those of the integral operator. We establish the connection between eigenvalues perturbation and generalization error. By minimizing the derived generalization error bounds, we propose the kernel selection criteria. Therefore the kernel chosen by our proposed criteria can guarantee good generalization performance. To compute the values of our criteria, we present a method to obtain the eigenvalues of integral operator via the Fourier transform. Experiments on benchmark datasets demonstrate that our kernel selection criteria are sound and effective.

Industry session

Beyond data: from user information to business value through personalized recommendations and consumer science BIBAFull-Text 2199-2200
  Xavier Amatriain
Since the Netflix $1 million Prize, announced in 2006, Netflix has been known for having personalization at the core of our product. Our current product offering is nowadays focused around instant video streaming, and our data is now many orders of magnitude larger. Not only do we have many more users in many more countries, but we also receive many more streams of data. Besides the ratings, we now also use information such as what our members play, browse, or search. In this invited talk I will discuss the different approaches we follow to deal with these large streams of user data in order to extract information for personalizing our service. I will describe some of the machine learning models used, and their application in the service. I will also describe our data-driven approach to innovation that combines rapid offline explorations as well as online A/B testing. This approach enables us to convert user information into real and measurable business value.
Beyond data: from user information to business value through personalized recommendations and consumer science BIBAFull-Text 2201-2208
  Xavier Amatriain
Since the Netflix $1 million Prize, announced in 2006, Netflix has been known for having personalization at the core of our product. Our current product offering is nowadays focused around instant video streaming, and our data is now many orders of magnitude larger. Not only do we have many more users in many more countries, but we also receive many more streams of data. Besides the ratings, we now also use information such as what our members play, browse, or search.
   In this paper I will discuss the different approaches we follow to deal with these large streams of user data in order to extract information for personalizing our service. I will describe some of the machine learning models used, and their application in the service. I will also describe our data-driven approach to innovation that combines rapid offline explorations as well as online A/B testing. This approach enables us to convert user information into real and measurable business value.
Leveraging data to change industry paradigms BIBAFull-Text 2209-2210
  Chris Farmer
Much of the conversation on "big data" is centered on data technologies and analytics platforms and how established companies apply them. While those technologies and platforms are certainly very important for industry incumbents, data analytics is also often a key building block for new start-up entrants looking to disrupt industry verticals. In many cases, the best examples of novel applications of data to create new services and competitive advantage require a complete rethinking of organizational design in order to create feedback loops and rethink cost structures. The company I founded, SignalFire is applying data for competitive advantage in my own industry, venture capital, but there are myriad examples of this trend across industries such as transportation, financial services, retail, media and many other markets. In this talk, I will discuss how we analyze these trends as venture capitalists and will look at a few case studies of specific companies leveraging data to innovate in their industries.
Large-scale deep learning at Baidu BIBAFull-Text 2211-2212
  Kai Yu
In the past 30 years, tremendous progress has been achieved in building effective shallow classification models. Despite the success, we come to realize that, for many applications, the key bottleneck is not the qualify of classifiers but that of features. Not being able to automatically get useful features has become the main limitation for shallow models. Since 2006, learning high-level features using deep architectures from raw data has become a huge wave of new learning paradigms. In recent two years, deep learning has made many performance breakthroughs, for example, in the areas of image understanding and speech recognition. In this talk, I will walk through some of the latest technology advances of deep learning within Baidu, and discuss the main challenges, e.g., developing effective models for various applications, and scaling up the model training using many GPUs. In the end of the talk I will discuss what might be interesting future directions.

DB track: query processing and privacy

Wondering why data are missing from query results?: ask conseil why-not BIBAFull-Text 2213-2218
  Melanie Herschel
In analyzing and debugging data transformations, or more specifically relational queries, a subproblem is to understand why some data are not part of the query result. This problem has recently been addressed from different perspectives for various fragments of relational queries. The different perspectives yield different, yet complementary explanations of such missing-answers.
   This paper first aims at unifying the different approaches by defining a new type of explanation, called hybrid explanation, that encompasses the variety of previously defined types of explanations. This solution goes beyond simply forming the union of explanations produced by different algorithms and is shown to be able to explain a larger set of missing-answers. Second, we present Conseil, an algorithm to generate hybrid explanations. Conseil is also the first algorithm to handle non-monotonic queries. Experiments on efficiency and explanation quality show that Conseil is comparable to and even outperforms previous algorithms.
Fast evaluation of iceberg pattern-based aggregate queries BIBAFull-Text 2219-2224
  Zhian He; Petrie Wong; Ben Kao; Eric Lo; Reynold Cheng
A Sequence OLAP (S-OLAP) system provides a platform on which pattern-based aggregate (PBA) queries on a sequence database are evaluated. In its simplest form, a PBA query consists of a pattern template T and an aggregate function F. A pattern template is a sequence of variables, each is defined over a domain. For example, the template T = (X,Y,Y,X) consists of two variables X and Y. Each variable is instantiated with all possible values in its corresponding domain to derive all possible patterns of the template. Sequences are grouped based on the patterns they possess. The answer to a PBA query is a sequence cuboid (s-cuboid), which is a multidimensional array of cells. Each cell is associated with a pattern instantiated from the query's pattern template. The value of each s-cuboid cell is obtained by applying the aggregate function F to the set of data sequences that belong to that cell. Since a pattern template can involve many variables and can be arbitrarily long, the induced s-cuboid for a PBA query can be huge. For most analytical tasks, however, only iceberg cells with very large aggregate values are of interest. This paper proposes an efficient approach to identify and evaluate iceberg cells of s-cuboids. Experimental results show that our algorithms are orders of magnitude faster than existing approaches.
Top-down keyword query processing on XML data BIBAFull-Text 2225-2230
  Junfeng Zhou; Xingmin Zhao; Wei Wang; Ziyang Chen; Jeffrey Xu Yu
Efficiently answering XML keyword queries has attracted much research effort in the last decade. One key factors resulting in the inefficiency of existing methods are the common-ancestor-repetition (CAR) and visiting-useless-nodes (VUN) problems. In this paper, we propose a generic top-down processing strategy to answer a given keyword query w.r.t. LCA/SLCA/ELCA semantics. By top-down, we mean that we visit all common ancestor (CA) nodes in a depth-first, left-to-right order, thus avoid the CAR problem; by generic, we mean that our method is independent of the labeling schemes and query semantics. We show that the satisfiability of a node v w.r.t. the given semantics can be determined by v's child nodes, based on which our methods avoid the VUN problem. We propose two algorithms that are based on either traditional inverted lists or our newly proposed LLists to improve the overall performance. The experimental results verify the benefits of our methods according to various evaluation metrics.
Efficient pruning algorithm for top-K ranking on dataset with value uncertainty BIBAFull-Text 2231-2236
  Jianwen Chen; Ling Feng
Top-K ranking query in uncertain databases aims to find the top-K tuples according to a ranking function. The interplay between score and uncertainty makes top-K ranking in uncertain databases an intriguing issue, leading to rich query semantics. Recently, a unified ranking framework based on parameterized ranking functions (PRFs) is formulated, which generalizes many previously proposed ranking semantics. Under the PRFs based ranking framework, efficient pruning approach for Top-K ranking on dataset with tuple uncertainty has been well studied in the literature. However, this cannot be applied to top-K ranking on dataset with value uncertainty (described through attribute-level uncertain data model), which are often natural and useful in analyzing uncertain data in many applications. This paper aims to develop efficient pruning techniques for top-K ranking on dataset with value uncertainty under the PRFs based ranking framework, which has not been well studied in the literature. We present the mathematics of deriving the pruning techniques and the corresponding algorithms. The experimental results on both real and synthetic data demonstrate the effectiveness and efficiency of the proposed pruning techniques.
Query execution timing: taming real-time anytime queries on multicore processors BIBAFull-Text 2237-2242
  Chunyao Song; Zheng Li; Tingjian Ge; Jie Wang
Answering real-time queries, especially over probabilistic data, is becoming increasingly important for service providers. We study anytime query processing algorithms, and extend the traditional query execution plan with a timing component. Our focus is how to determine this timing component, given the queries' deadline constraints. We consider the common multicore processors. Specifically, we propose two query optimization modes: offline periodic optimization and online optimization. We devise efficient algorithms for both offline and online cases followed by a competitive analysis to show the power of our online optimization. Finally, we perform a systematic experimental evaluation using real-world datasets to verify our approaches.
Merged aggregate nearest neighbor query processing in road networks BIBAFull-Text 2243-2248
  Weiwei Sun; Chong Chen; Baihua Zheng; Chunan Chen; Liang Zhu; Weimo Liu; Yan Huang
Aggregate nearest neighbor query, which returns a common interesting point that minimizes the aggregate distance for a given query point set, is one of the most important operations in spatial databases and their application domains. This paper addresses the problem of finding the aggregate nearest neighbor for a merged set that consists of the given query point set and multiple points needed to be selected from a candidate set, which we name as merged aggregate nearest neighbor (MANN) query. This paper proposes an effective algorithm to process MANN query in road networks based on our pruning strategies. Extensive experiments are conducted to examine the behaviors of the solutions and the overall experiments show that our strategies to minimize the response time are effective and achieve several orders of magnitude speedup compared with the baseline methods.
SkyView: a user evaluation of the skyline operator BIBAFull-Text 2249-2254
  Matteo Magnani; Ira Assent; Kasper Hornbæk; Mikkel R. Jakobsen; Ken Friis Larsen
The skyline operator has recently emerged as an alternative to ranking queries. It retrieves a number of potential best options for arbitrary monotone preference functions. The success of this operator in the database community is based on the belief that users benefit from the limited effort required to specify skyline queries compared to, for instance, ranking. While application examples of the skyline operator exist, there is no principled analysis of its benefits and limitations in data retrieval tasks. Our study investigates the degree to which users understand skyline queries, how they specify query parameters and how they interact with skyline results made available in listings or map-based interfaces.
UMicS: from anonymized data to usable microdata BIBAFull-Text 2255-2260
  Graham Cormode; Entong Shen; Xi Gong; Ting Yu; Cecilia M. Procopiuc; Divesh Srivastava
There is currently a tug-of-war going on surrounding data releases. On one side, there are many strong reasons pulling to release data to other parties: business factors, freedom of information rules, and scientific sharing agreements. On the other side, concerns about individual privacy pull back, and seek to limit releases. Privacy technologies such as differential privacy have been proposed to resolve this deadlock, and there has been much study of how to perform private data release of data in various forms. The focus of such works has been largely on the data owner: what process should they apply to ensure that the released data preserves privacy whilst still capturing the input data distribution accurately. Almost no attention has been paid to the needs of the data user, who wants to make use of the released data within their existing suite of tools and data. The difficulty of making use of data releases is a major stumbling block for the widespread adoption of data privacy technologies.
   In this paper, instead of proposing new privacy mechanisms for data publishing, we consider the whole data release process, from the data owner to the data user. We lay out a set of principles for privacy tool design that highlights the requirements for interoperability, extensibility and scalability. We put these into practice with UMicS, an end-to-end prototype system to control the release and use of private data. An overarching tenet is that it should be possible to integrate the released data into the data user's systems with the minimum of change and cost. We describe how to instantiate UMicS in a variety of usage scenarios. We show how using data modeling techniques from machine learning can improve the utility, in particular when combined with background knowledge that the data user may possess. We implement UMicS, and evaluate it over a selection of data sets and release cases. We see that UMicS allows for very effective use of released data, while upholding our privacy principles.

IR Track

GAPfm: optimal top-n recommendations for graded relevance domains BIBAFull-Text 2261-2266
  Yue Shi; Alexandros Karatzoglou; Linas Baltrunas; Martha Larson; Alan Hanjalic
Recommender systems are frequently used in domains in which users express their preferences in the form of graded judgments, such as ratings. Current ranking techniques are based on one of two sub-optimal approaches: either they optimize for a binary metric such as Average Precision, which discards information on relevance levels, or they optimize for Normalized Discounted Cumulative Gain (NDCG), which ignores the dependence of an item's contribution on the relevance of more highly ranked items. We address the shortcomings of existing approaches by proposing GAPfm, the Graded Average Precision factor model, which is a latent factor model for top-N recommendation in domains with graded relevance data. The model optimizes the Graded Average Precision metric that has been proposed recently for assessing the quality of ranked results lists for graded relevance. GAPfm's advantages are twofold: it maintains full information about graded relevance and also addresses the limitations of models that optimize NDCG. Experimental results show that GAPfm achieves substantial improvements on the top-N recommendation task, compared to several state-of-the-art approaches.
URL tree: efficient unsupervised content extraction from streams of web documents BIBAFull-Text 2267-2272
  Borut Sluban; Miha Grcar
The Web represents the largest, and an increasingly growing, source of information. Extracting meaningful content from Web pages presents a challenging problem, already extensively addressed in the offline setting. In this work, we focus on content extraction from streams of HTML documents. We present an infrastructure that converts continuously acquired HTML documents into a stream of plain text documents. The presented pipeline consists of RSS readers for data acquisition from different Web sites, a duplicate removal component, and a novel content extraction algorithm which is efficient, unsupervised, and language-independent. Our content extraction approach is based on the observation that HTML documents from the same source normally share a common template. The core of the proposed content extraction algorithm is a simple data structure called URL Tree. The performance of the algorithm was evaluated in a stream setting on a time-stamped semi-automatically annotated dataset which was made publicly available. We compared the performance of URL Tree with that of several open source content extraction algorithms. The evaluation results show that our stream-based algorithm already starts outperforming the other algorithms after only 10 to 100 documents from a specific domain.
Estimating document focus time BIBAFull-Text 2273-2278
  Adam Jatowt; Ching-Man Au Yeung; Katsumi Tanaka
Temporality is an important characteristic of text documents. While some documents are clearly atemporal, many have temporal character and can be mapped to certain time periods. In this paper, we introduce the problem of estimating focus time of documents. Document focus time is defined as the time to which the content of a document refers to and is considered as a complementary dimension to its creation time or timestamp. We propose several estimators of focus time by utilizing external knowledge bases such as news article collections which contain explicit temporal references. We then evaluate the effectiveness of our methods on diverse datasets of documents about historical events in five countries.
Faceted models of blog feeds BIBAFull-Text 2279-2284
  Lifeng Jia; Clement Yu; Weiyi Meng
Faceted blog distillation aims at retrieving the blogs that are not only relevant to a query but also exhibit an interested facet. In this paper we consider personal and official facets. Personal blogs depict various topics related to the personal experiences of bloggers while official blogs deliver contents with bloggers' commercial influences. We observe that some terms, such as nouns, usually describe the topics of posts in blogs while other terms, such as pronouns and adverbs, normally reflect the facets of posts. Thus we present a model that estimates the probabilistic distributions of topics and those of facets in posts. It leverages a classifier to separate facet terms from topical terms in the posterior inference. We also observe that the posts from a blog are likely to exhibit the same facet. So we propose another model that constrains the posts from a blog to have the same facet distributions in its generative process. Experimental results using the TREC 2009-2010 queries over the TREC Blogs08 collection show the effectiveness of both models. Our results outperform the best known results for personal and official distillation.
SRbench -- a benchmark for soundtrack recommendation systems BIBAFull-Text 2285-2290
  Aleksandar Stupar; Sebastian Michel
In this work, a benchmark to evaluate the retrieval performance of soundtrack recommendation systems is proposed. Such systems aim at finding songs that are played as background music for a given set of images. The proposed benchmark is based on preference judgments, where relevance is considered a continuous ordinal variable and judgments are collected for pairs of songs with respect to a query (i.e., set of images). To capture a wide variety of songs and images, we use a large space of possible music genres, different emotions expressed through music, and various query-image themes. The benchmark consists of two types of relevance assessments: (i) judgments obtained from a user study, that serve as a "gold standard" for (ii) relevance judgments gathered through Amazon's Mechanical Turk. We report on the performance of two state-of-the-art soundtrack recommendation systems using the proposed benchmark.
CV-PCR: a context-guided value-driven framework for patent citation recommendation BIBAFull-Text 2291-2296
  Sooyoung Oh; Zhen Lei; Wang-Chien Lee; Prasenjit Mitra; John Yen
Patent citation recommendation and prior patent search, critical for patent filing and patent examination, have become increasingly difficult due to the rapidly growing number of patents. Unlike paper citations that focus on reference comprehensiveness, patent citations tend to be more parsimonious and refer only to those prior patents bearing significant technological and/or economic value, as they define the scope of the citing patent and thus have significant legal and economic implications. Based on the insight that patent citations are important information reflecting the value of cited patents to the citing patent, we propose a heterogeneous patent citation-bibliographic network that combines patent citations (reflecting value relation) and bibliographic information (reflecting similarity relation) together. From this network, we extract various features that reflect the value of a prior patent to a query patent with regard to the context of the query patent such as its assignee, classifications, etc. We then propose a two-stage framework for patent citation recommendation. Our idea is that by exploiting those context-specific value measures of candidate patents to the query patent, the proposed framework is able to make effective patent citation recommendations. We evaluate the proposed context-guided value-driven framework using a collection of 1.8M U.S. patents. Experimental results validate our ideas and show that those value-driven features are very effective and significantly outperform two state-of-the-art methods in terms of both the precision and recall rates.
Modeling behavioral factors in interactive information retrieval BIBAFull-Text 2297-2302
  Feza Baskaya; Heikki Keskustalo; Kalervo Järvelin
In real-life, information retrieval consists of sessions of one or more query iterations. Each iteration has several subtasks like query formulation, result scanning, document link clicking, document reading and judgment, and stopping. Each of the subtasks has behavioral factors associated with them. These factors include search goals and cost constraints, query formulation strategies, scanning and stopping strategies, and relevance assessment behavior. Traditional IR evaluation focuses on retrieval and result presentation methods, and interaction within a single-query session. In the present study we aim at assessing the effects of the behavioral factors on retrieval effectiveness. Our research questions include how effective is human behavior employing search strategies compared to various baselines under various search goals and time constraints. We examine both ideal as well as fallible human behavior and wish to identify robust behaviors, if any. Methodologically, we use extensive simulation of human behavior in a test collection. Our findings include that (a) human behavior using multi-query sessions may exceed in effectiveness comparable single-query sessions, (b) the same empirically observed behavioral patterns are reasonably effective under various search goals and constraints, but (c) remain on average clearly below the best possible ones. Moreover, there is no behavioral pattern for sessions that would be even close to winning in most cases; the information need (or topic) in relation to the test collection is a determining factor.
Intent models for contextualising and diversifying query suggestions BIBAFull-Text 2303-2308
  Eugene Kharitonov; Craig Macdonald; Pavel Serdyukov; Iadh Ounis
The query suggestion or auto-completion mechanisms help users to type less while interacting with a search engine. A basic approach that ranks suggestions according to their frequency in query logs is suboptimal. Firstly, many candidate queries with the same prefix can be removed as redundant. Secondly, the suggestions can also be personalised based on the user's context. These two directions to improve the mechanisms' quality can be in opposition: while the latter aims to promote suggestions that address search intents that a user is likely to have, the former aims to diversify the suggestions to cover as many intents as possible. We introduce a contextualisation framework that utilises a short-term context using the user's behaviour within the current search session, such as the previous query, the documents examined, and the candidate query suggestions that the user has discarded. This short-term context is used to contextualise and diversify the ranking of query suggestions, by modelling the user's information need as a mixture of intent-specific user models. The evaluation is performed offline on a set of approximately 1.0M test user sessions. Our results suggest that the proposed approach significantly improves query suggestions compared to the baseline approach.
Building user profiles from topic models for personalised search BIBAFull-Text 2309-2314
  Morgan Harvey; Fabio Crestani; Mark J. Carman
Personalisation is an important area in the field of IR that attempts to adapt ranking algorithms so that the results returned are tuned towards the searcher's interests. In this work we use query logs to build personalised ranking models in which user profiles are constructed based on the representation of clicked documents over a topic space. Instead of employing a human-generated ontology, we use novel latent topic models to determine these topics. Our experiments show that by subtly introducing user profiles as part of the ranking algorithm, rather than by re-ranking an existing list, we can provide personalised ranked lists of documents which improve significantly over a non-personalised baseline. Further examination shows that the performance of the personalised system is particularly good in cases where prior knowledge of the search query is limited.
Transferring knowledge with source selection to learn IR functions on unlabeled collections BIBAFull-Text 2315-2320
  Parantapa Goswami; Massih R. Amini; Eric Gaussier
We investigate the problem of learning an IR function on a collection without relevance judgements (called target collection) by transferring knowledge from a selected source collection with relevance judgements. To do so, we first construct, for each query in the target collection, relative relevance judgment pairs using information from the source collection closest to the query (selection and transfer steps), and then learn an IR function from the obtained pairs in the target collection (self-learning step). For the transfer step, the relevance information in the source collection is summarized as a grid that provides, for each term frequency and document frequency values of a word in a document, an empirical estimate of the relevance of the document. The self-learning step iteratively assigns pairwise preferences to documents in the target collection using the scores of the former learned function. We show the effectiveness of our approach through a series of extensive experiments on CLEF and several collections from TREC used either as target or source datasets. Our experiments show the importance of selecting the source collection prior to transfer information to the target collection, and demonstrate that the proposed approach yields results consistently and significantly above state-of-the-art IR functions.
Understanding how people interact with web search results that change in real-time using implicit feedback BIBAFull-Text 2321-2326
  Jin Young Kim; Mark Cramer; Jaime Teevan; Dmitry Lagun
The way a searcher interacts with query results can reveal a lot about what is being sought. Considerable research has gone into using implicit relevance feedback to identify relevant content in real-time, but little is known about how to best present this newly identified relevant content to users. In this paper we compare a traditional search interface with one that dynamically re-ranks and recommends search results as the user interacts with it in order to build a picture of how and when users should be offered dynamically identified relevant content. We present several studies that compare logged behavior for hundreds of thousands of users and millions of queries as well as self-reported measures of success across the two interaction models. Compared to traditional web search, users presented with dynamically ranked results exhibit higher engagement and find information faster, particularly during exploratory tasks. These findings have implications for how search engines might best exploit implicit feedback in real-time in order to help users identify the most relevant results as quickly as possible.
Facet selection algorithms for web product search BIBAFull-Text 2327-2332
  Damir Vandic; Flavius Frasincar; Uzay Kaymak
Multifaceted search is a commonly used interaction paradigm in e-commerce applications, such as Web shops. Because of the large amount of possible product attributes, Web shops usually make use of static information to determine which facets should be displayed. Unfortunately, this approach does not take into account the user query, leading to a non-optimal facet drill down process. In this paper, we focus on automatic facet selection, with the goal of minimizing the number of steps needed to find the desired product. We propose several algorithms for facet selection, which we evaluate against the state-of-the-art algorithms from the literature. We implement our approach in a Web application called faccy.net. The evaluation is based on simulations employing 1000 queries, 980 products, 487 facets, and three drill down strategies. As evaluation metrics we use the average number of clicks, the average utility, and the top-10 promotion percentage. The results show that the Probabilistic Entropy algorithm significantly outperforms the other considered algorithms.
Learning deep structured semantic models for web search using clickthrough data BIBAFull-Text 2333-2338
  Po-Sen Huang; Xiaodong He; Jianfeng Gao; Li Deng; Alex Acero; Larry Heck
Latent semantic models, such as LSA, intend to map a query to its relevant documents at the semantic level where keyword-based matching often fails. In this study we strive to develop a series of new latent semantic models with a deep structure that project queries and documents into a common low-dimensional space where the relevance of a document given a query is readily computed as the distance between them. The proposed deep structured semantic models are discriminatively trained by maximizing the conditional likelihood of the clicked documents given a query using the clickthrough data. To make our models applicable to large-scale Web search applications, we also use a technique called word hashing, which is shown to effectively scale up our semantic models to handle large vocabularies which are common in such tasks. The new models are evaluated on a Web document ranking task using a real-world data set. Results show that our best model significantly outperforms other latent semantic models, which were considered state-of-the-art in the performance prior to the work presented in this paper.
Learning open-domain comparable entity graphs from user search queries BIBAFull-Text 2339-2344
  Ziheng Jiang; Lei Ji; Jianwen Zhang; Jun Yan; Ping Guo; Ning Liu
A frequent behavior of internet users is to compare among various comparable entities for decision making. As an instance, a user may compare among iPhone 5, Lumia 920 etc. products before deciding which cellphone to buy. However, it is a challenging problem to know what entities are generally comparable from the users' viewpoints in the open domain Web. In this paper, we propose a novel solution, which is known as Comparable Entity Graph Mining (CEGM), to learn an open-domain comparable entity graph from the user search queries. CEGM firstly mine seed comparable entity pairs from user search queries automatically using predefined query patterns. Next, it discovers more entity pairs with a confidence classifier in a bootstrapping fashion. Newly discovered entity pairs are organized into an open-domain comparable entity graph. Based on our empirical study over 1 billion queries of a commercial search engine, we build a comparable entity graph which covers 73.4% queries in the top 50 million unique queries of a commercial search engine. Through manual labeling in sampled sub-graphs, the average precision of comparable entities is 89.4%. As applications of the learned entity graph, the entity recommendation in Web search is empirically studied.
RAProp: ranking tweets by exploiting the tweet/user/web ecosystem and inter-tweet agreement BIBAFull-Text 2345-2350
  Srijith Ravikumar; Kartik Talamadupula; Raju Balakrishnan; Subbarao Kambhampati
The increasing popularity of Twitter renders improved trustworthiness and relevance assessment of tweets much more important for search. However, given the limitations on the size of tweets, it is hard to extract measures for ranking from the tweets? content alone. We present a novel ranking method called RAProp, which combines two orthogonal measures of relevance and trustworthiness of a tweet. The first, called Feature Score, measures the trustworthiness of the source of the tweet by extracting features from a 3-layer Twitter ecosystem consisting of users, tweets and webpages. The second measure, called agreement analysis, estimates the trustworthiness of the content of a tweet by analyzing whether the content is independently corroborated by other tweets. We view the candidate result set of tweets as the vertices of a graph, with the edges measuring the estimated agreement between each pair of tweets. The feature score is propagated over this agreement graph to compute the top-k tweets that have both trustworthy sources and independent corroboration. The evaluation of our method on 16 million tweets from the TREC 2011 Microblog Dataset shows that for top-30 precision, we achieve 53% better precision than the current best performing method on the data set, and an improvement of 300% over current Twitter Search.
Incorporating the surfing behavior of web users into pagerank BIBAFull-Text 2351-2356
  Shatlyk Ashyralyyev; B. Barla Cambazoglu; Cevdet Aykanat
In large-scale commercial web search engines, estimating the importance of a web page is a crucial ingredient in ranking web search results. So far, to assess the importance of web pages, two different types of feedback have been taken into account, independent of each other: the feedback obtained from the hyperlink structure among the web pages (e.g., PageRank) or the web browsing patterns of users (e.g., BrowseRank). Unfortunately, both types of feedback have certain drawbacks. While the former lacks the user preferences and is vulnerable to malicious intent, the latter suffers from sparsity and hence low web coverage. In this work, we combine these two types of feedback under a hybrid page ranking model in order to alleviate the above-mentioned drawbacks. Our empirical results indicate that the proposed model leads to better estimation of page importance according to an evaluation metric that relies on user click feedback obtained from web search query logs. We conduct all of our experiments in a realistic setting, using a very large scale web page collection (around 6.5 billion web pages) and web browsing data (around two billion web page visits).
Question routing to user communities BIBAFull-Text 2357-2362
  Aditya Pal; Fei Wang; Michelle X. Zhou; Jeffrey Nichols; Barton A. Smith
An online community consists of a group of users who share a common interest, background, or experience and their collective goal is to contribute towards the welfare of the community members. Question answering is an important feature that enables community members to exchange knowledge within the community boundary. The overwhelming number of communities necessitates the need for a good question routing strategy so that new questions gets routed to the appropriately focused community and thus get resolved. In this paper, we consider the novel problem of routing questions to the right community and propose a framework to select the right set of communities for a question. We begin by using several prior proposed features for users and add some additional features, namely language attributes and inclination to respond, for community modeling. Then we introduce two k nearest neighbor based aggregation algorithms for computing community scores. We show how these scores can be combined to recommend communities and test the effectiveness of the recommendations over a large real world dataset.
Learning to rank for question routing in community question answering BIBAFull-Text 2363-2368
  Zongcheng Ji; Bin Wang
This paper focuses on the problem of Question Routing (QR) in Community Question Answering (CQA), which aims to route newly posted questions to the potential answerers who are most likely to answer them. Traditional methods to solve this problem only consider the text similarity features between the newly posted question and the user profile, while ignoring the important statistical features, including the question-specific statistical feature and the user-specific statistical features. Moreover, traditional methods are based on unsupervised learning, which is not easy to introduce the rich features into them. This paper proposes a general framework based on the learning to rank concepts for QR. Training sets consist of triples (q, asker, answerers) are first collected. Then, by introducing the intrinsic relationships between the asker and the answerers in each CQA session to capture the intrinsic labels/orders of the users about their expertise degree of the question q, two different methods, including the SVM-based and RankingSVM-based methods, are presented to learn the models with different example creation processes from the training set. Finally, the potential answerers are ranked using the trained models. Extensive experiments conducted on a real world CQA dataset from Stack Overflow show that our proposed two methods can both outperform the traditional query likelihood language model (QLLM) as well as the state-of-the-art Latent Dirichlet Allocation based model (LDA). Specifically, the RankingSVM-based method achieves statistical significant improvements over the SVM-based method and has gained the best performance.

KM track: entities, tags, and time series

Re-ranking for joint named-entity recognition and linking BIBAFull-Text 2369-2374
  Avirup Sil; Alexander Yates
Recognizing names and linking them to structured data is a fundamental task in text analysis. Existing approaches typically perform these two steps using a pipeline architecture: they use a Named-Entity Recognition (NER) system to find the boundaries of mentions in text, and an Entity Linking (EL) system to connect the mentions to entries in structured or semi-structured repositories like Wikipedia. However, the two tasks are tightly coupled, and each type of system can benefit significantly from the kind of information provided by the other. We present a joint model for NER and EL, called NEREL, that takes a large set of candidate mentions from typical NER systems and a large set of candidate entity links from EL systems, and ranks the candidate mention-entity pairs together to make joint predictions. In NER and EL experiments across three datasets, NEREL significantly outperforms or comes close to the performance of two state-of-the-art NER systems, and it outperforms 6 competing EL systems. On the benchmark MSNBC dataset, NEREL provides a 60% reduction in error over the next-best NER system and a 68% reduction in error over the next-best EL system.
Identifying salient entities in web pages BIBAFull-Text 2375-2380
  Michael Gamon; Tae Yano; Xinying Song; Johnson Apacible; Patrick Pantel
We propose a system that determines the salience of entities within web documents. Many recent advances in commercial search engines leverage the identification of entities in web pages. However, for many pages, only a small subset of entities are central to the document, which can lead to degraded relevance for entity triggered experiences. We address this problem by devising a system that scores each entity on a web page according to its centrality to the page content. We propose salience classification functions that incorporate various cues from document content, web search logs, and a large web graph. To cost-effectively train the models, we introduce a soft labeling methodology that generates a set of annotations based on user behaviors observed in web search logs. We evaluate several variations of our model via a large-scale empirical study conducted over a test set, which we release publicly to the research community. We demonstrate that our methods significantly outperform competitive baselines and the previous state of the art, while keeping the human annotation cost to a minimum.
Recommending tags with a model of human categorization BIBAFull-Text 2381-2386
  Paul Seitlinger; Dominik Kowald; Christoph Trattner; Tobias Ley
When interacting with social tagging systems, humans exercise complex processes of categorization that have been the topic of much research in cognitive science. In this paper we present a recommender approach for social tags derived from ALCOVE, a model of human category learning. The basic architecture is a simple three-layers connectionist model. The input layer encodes patterns of semantic features of a user-specific resource, such as latent topics elicited through Latent Dirichlet Allocation (LDA) or available external categories. The hidden layer categorizes the resource by matching the encoded pattern against already learned exemplar patterns. The latter are composed of unique feature patterns and associated tag distributions. Finally, the output layer samples tags from the associated tag distributions to verbalize the preceding categorization process. We have evaluated this approach on a real-world folksonomy gathered from Wikipedia bookmarks in Delicious. In the experiment our approach outperformed LDA, a well-established algorithm. We attribute this to the fact that our approach processes semantic information (either latent topics or external categories) across the three different layers. With this paper, we demonstrate that a theoretically guided design of algorithms not only holds potential for improving existing recommendation mechanisms, but it also allows us to derive more generalizable insights about how human information interaction on the Web is determined by both semantic and verbal processes.
Automatically generating descriptions for resources by tag modeling BIBAFull-Text 2387-2392
  Bin Bi; Junghoo Cho
We have been witnessing an increasing number of social tagging systems on the web. Tags help users understand a resource readily and accurately. In a social tagging system, however, there are typically a fairly large number of resources each associated with a long list of tags. When browsing resources, users are reluctant to read these tags one by one. Instead, users prefer a shorter list of tags as a compact description of a resource. Such a tag description facilitates users to understand the resource accurately and effortlessly. This calls for a generator for a tag description, which selects a set of high-quality tags for a given resource. The tag description condenses the original tag list by retaining the most important tags of the long list.
   We propose that a good generator should go beyond pure tag popularity and towards diversifying a tag description. In this paper, we present a general framework of selecting a set of k tags as the description for a given resource. In addition, a generative model BTM is proposed to model users' tagging process. The experimental results on real-world tagging data confirm the effectiveness of the proposed approach in social tagging systems, showing significant improvement over the other baselines.
Mining characteristic multi-scale motifs in sensor-based time series BIBAFull-Text 2393-2398
  Ugo Vespier; Siegfried Nijssen; Arno Knobbe
More and more, physical systems are being fitted with various kinds of sensors in order to monitor their behavior, health or intensity of use. The large quantities of time series data collected from these complex systems often exhibit two important characteristics: the data is a combination of various superimposed effects operating at different time scales, and each effect shows a fair degree of repetition. Each of these effects can be described by a small collection of motifs: recurring temporal patterns in the data. We propose a method to discover characteristic and potentially overlapping motifs at multiple time scales, taking into account systemic deformations and temporal warping. Our method is based on a combination of scale-space theory and the Minimum Description Length principle. We show its effectiveness on two time series datasets from real world applications.
Efficient forecasting for hierarchical time series BIBAFull-Text 2399-2404
  Lars Dannecker; Robert Lorenz; Philipp Rösch; Wolfgang Lehner; Gregor Hackenbroich
Forecasting is used as the basis for business planning in many application areas such as energy, sales and traffic management. Time series data used in these areas is often hierarchically organized and thus, aggregated along the hierarchy levels based on their dimensional features. Calculating forecasts in these environments is very time consuming, due to ensuring forecasting consistency between hierarchy levels. To increase the forecasting efficiency for hierarchically organized time series, we introduce a novel forecasting approach that takes advantage of the hierarchical organization. There, we reuse the forecast models maintained on the lowest level of the hierarchy to almost instantly create already estimated forecast models on higher hierarchical levels. In addition, we define a hierarchical communication framework, increasing the communication flexibility and efficiency. Our experiments show significant runtime improvements for creating a forecast model at higher hierarchical levels, while still providing a very high accuracy.
Extraction and integration of web data by end-users BIBAFull-Text 2405-2410
  Sudhir Agarwal; Michael Genesereth
For increasingly sophisticated use cases end users often need to extract, combine, and aggregate information from various (often dynamically generated) web pages from multiple websites. Current search engines do not focus on combining information from various web pages in order to answer the overall information need of the user. Semantic Web and Linked Data usually take a static view on the data and rely on providers' cooperation. In this paper, we present a novel approach that enables end users to easily extract data from web pages while they browse, store it locally in their browser as well as structure, integrate and search such data. We propose Datalog rules for integrating and searching the extracted data. We show how cleaning steps and integration rules can be reused to accelerate the cleaning and integration of extracted data. The proposed approach is implemented as a browser plugin. We present its implementation details and report on our evaluation of the plugin concerning user experience and browsing time saving.

KM track: mining and learning

pEDM: online-forecasting for smart energy analytics BIBAFull-Text 2411-2416
  Lars Dannecker; Philipp Rösch; Ulrike Fischer; Gordon Gaumnitz; Wolfgang Lehner; Gregor Hackenbroich
Continuous balancing of energy demand and supply is a fundamental prerequisite for the stability of energy grids and requires accurate forecasts of electricity consumption and production at any point in time. Today's Energy Data Management (EDM) systems already provide accurate predictions, but typically employ a very time-consuming and inflexible forecasting process. However, emerging trends such as intra-day trading and an increasing share of renewable energy sources need a higher forecasting efficiency. Additionally, the wide variety of applications in the energy domain pose different requirements with respect to runtime and accuracy and thus, require flexible control of the forecasting process. To solve this issue, we introduce our novel online forecasting process as part of our EDM system called pEDM. The online forecasting process rapidly provides forecasting results and iteratively refines them over time. Thus, we avoid long calculation times and allow applications to adapt the process to their needs. Our evaluation shows that our online forecasting process offers a very efficient and flexible way of providing forecasts to the requesting applications.
An efficient probabilistic framework for multi-dimensional classification BIBAFull-Text 2417-2422
  Iyad Batal; Charmgil Hong; Milos Hauskrecht
The objective of multi-dimensional classification is to learn a function that accurately maps each data instance to a vector of class labels. Multi-dimensional classification appears in a wide range of applications including text categorization, gene functionality classification, semantic image labeling, etc. Usually, in such problems, the class variables are not independent, but rather exhibit conditional dependence relations among them. Hence, the key to the success of multi-dimensional classification is to effectively model such dependencies and use them to facilitate the learning. In this paper, we propose a new probabilistic approach that represents class conditional dependencies in an effective yet computationally efficient way. Our approach uses a special tree-structured Bayesian network model to represent the conditional joint distribution of the class variables given the feature variables. We develop and present efficient algorithms for learning the model from data and for performing exact probabilistic inferences on the model. Extensive experiments on multiple datasets demonstrate that our approach achieves highly competitive results when it is compared to existing state-of-the-art methods.
OMS-TL: a framework of online multiple source transfer learning BIBAFull-Text 2423-2428
  Liang Ge; Jing Gao; Aidong Zhang
Transfer learning has benefitted many real-world applications where labeled data are abundant in source domains but scarce on the target domain. As there are usually multiple relevant domains where knowledge can be transferred, Multiple Source Transfer Learning (MSTL) has recently attracted much attention. Most existing MSTL methods work in an offline fashion in that they have to store all the data on the target domain before learning. However, in some time-critical applications where the data arrive sequentially in large volume, a fast and scalable online method that can transfer knowledge from multiple source domains is much needed. To achieve this end, in this paper, we propose a new framework of Online Multiple Source Transfer Learning (OMS-TL). The framework is based on a convex optimization problem where knowledge transferred from multiple source domains are guided by the information on the target domain. The proposed method is fast, scalable and enjoys the theoretical guarantees of standard online algorithms. Extensive experiments are conducted on three real-life data sets. The results show that the performance of OMS-TL is close to that of its offline counterpart, which bears comparable performance to existing baseline methods. Furthermore, the proposed method has great scalability and fast response time.
Discovering and managing quantitative association rules BIBAFull-Text 2429-2434
  Chunyao Song; Tingjian Ge
Although association rule mining has been studied in the literature for quite a while and numerical attributes are prevalent, perhaps surprisingly, the state-of-the-art quantitative association rule mining is rather inefficient and ineffective in discovering all useful rules. In this paper, we propose a novel divide and conquer two-phase algorithm, which is guaranteed to find all good rules efficiently. We further devise an optimization technique for performance. Moreover, we discuss a few issues with managing and using the discovered quantitative association rules. We perform a comprehensive experimental study which shows that our algorithm is one to two orders of magnitude faster than the state-of-the-art one. In addition, we discover significantly more rules that are useful for prediction.
Combining one-class classifiers via meta learning BIBAFull-Text 2435-2440
  Eitan Menahem; Lior Rokach; Yuval Elovici
Selecting the best classifier among the available ones is a difficult task, especially when only instances of one class exist. In this work we examine the notion of combining one-class classifiers as an alternative for selecting the best classifier. In particular, we propose two one-class classification performance measures to weigh classifiers and show that a simple ensemble that implements these measures can outperform the most popular one-class ensembles. Furthermore, we propose a new one-class ensemble scheme, TUPSO, which uses meta-learning to combine one-class classifiers. Our experiments demonstrate the superiority of TUPSO over all other tested ensembles and show that the TUPSO performance is statistically indistinguishable from that of the hypothetical best classifier.
Scalable bootstrapping for python BIBAFull-Text 2441-2446
  Peter Birsinger; Richard Xia; Armando Fox
High-level productivity languages such as Python, Matlab, and R are popular choices for scientists doing data analysis. However, for today's increasingly large datasets, applications written in these languages may run too slowly, if at all. In such cases, an experienced programmer must typically rewrite the application in a less-productive performant language such as C or C++, but this work is intricate, tedious, and often non-reusable. To bridge this gap between programmer productivity and performance, we extend an existing framework that uses just-in-time code generation and compilation. This framework uses the SEJITS methodology, (Selective Embedded Just-In-Time Specialization [11]), converting programs written in domain specific embedded languages (DSELs) to programs in languages suitable for high performance or parallel computation.
   We present a Python DSEL for a recently developed, scalable bootstrapping method; the DSEL executes efficiently in a distributed cluster. In previous work [18, Prasad et al. created a DSEL compiler for the same DSEL (with minor differences) to generate OpenMP or Cilk code. In this work, we create a new DSEL compiler which instead emits code to run on Spark [16], a distributed processing framework. Using two example applications of bootstrapping, we show that the resulting distributed code achieves near-perfect strong scaling from 4 to 32 eight-core computers (32 to 256 cores) on datasets up to hundreds of gigabytes in size. With our DSEL, a data scientist can write a single program in serial Python that can run "toy" problems in plain Python, non-toy problems fitting on a single computer in OpenMP or Cilk, and non-toy problems with large datasets on a multi-computer Spark installation.
FIRE: interactive visual support for parameter space-driven rule mining BIBAFull-Text 2447-2452
  Abhishek Mukherji; Xika Lin; Jason Whitehouse; Christopher R. Botaish; Elke A. Rundensteiner; Matthew O. Ward
While significant strides have been made on efficient association rule mining, the usability of mining systems woefully lags behind. In particular, the usability of rule mining systems is limited by the lack of support for interactive exploration of the relationships among rule results produced with various parameter settings. Based on a novel parameter space-driven approach, our proposed Framework for Interactive Rule Exploration (FIRE) addresses the usability shortcoming. FIRE features innovative visual displays and effective interactions that enable analysts to conduct rule exploration at the speed of thought. Particularly, the parameter space view (PSpace) displays the distribution of rules produced for diverse parameter settings. This not only facilitates user parameter selection but also empowers analyst's to understand rule relationships in the parameter space context. Our user study with 22 subjects establishes the usability and effectiveness of the proposed features and interactions of FIRE using benchmark datasets. Overall, this research encompasses significant contributions at the intersection of data mining, knowledge management and visual analytics.

Demo session

Consumer-centric SLA manager for cloud-hosted databases BIBAFull-Text 2453-2456
  Liang Zhao; Sherif Sakr; Anna Liu
We present an end-to-end framework for consumer-centric SLA management of virtualized database servers. The framework facilitates adaptive and dynamic provisioning of the database tier of the software applications based on application-defined policies for satisfying their own SLA performance requirements, avoiding the cost of any SLA violation and controlling the monetary cost of the allocated computing resources. In this framework, the SLA of the consumer applications are declaratively defined in terms of goals which are subjected to a number of constraints that are specific to the application requirements. The framework continuously monitors the application-defined SLA and automatically triggers the execution of necessary corrective actions (scaling out/in the database tier) when required. The framework is database platform-agnostic, uses virtualization-based database replication mechanisms and requires zero source code changes of the cloud-hosted application.
TerraFly GeoCloud: online spatial data analysis system BIBAFull-Text 2457-2460
  Yun Lu; Mingjin Zhang; Tao Li; Chang Liu; Erik Edrosa; Naphtali Rishe
With the exponential growth of the usage of web map services, the geo data analysis has become more and more popular. This paper develops an online Spatial Data Analysis System, TerraFly GeoCloud, which facilitates the end user to visualize and analyze spatial data, and to share the analysis results. Built on the TerraFly Geo spatial database, TerraFly GeoCloud is an extra layer running upon TerraFly map supporting many different visualization functions and spatial data analysis models. TerraFly GeoCloud also enables the MapQL technology to create maps using SQL-like statements. The TerraFly GeoCloud system is available at http://terrafly.fiu.edu/GeoCloud/.
MetKB: enriching RDF knowledge bases with web entity-attribute tables BIBAFull-Text 2461-2464
  Haoqiong Bian; Yueguo Chen; Xiaoyong Du; Xiaolu Zhang
There are many entity-attribute tables on the Web that can be utilized for enriching the entities of an RDF knowledge base. This requires the schema mapping (matching) between the Web tables and the RDF knowledge base. In this paper, we propose a feasible solution that is able to automatically search and rank entity-attribute tables from the Web, and effectively map the extracted tables with the RDF knowledge base with very few manual efforts.
READFAST: high-relevance search-engine for big text BIBAFull-Text 2465-2468
  Michael Gubanov; Anna Pyayt
Relevance of search-results is a key factor for any search engine. In order to return and rank the Web-pages that are most relevant to the query, contemporary search engines use complex ranking functions that depend on hundreds of features. For example, presence or absence of the query keywords on the page, their proximity, frequencies, HTML markup are just a few to name. Additional features might include fonts, tags, hyperlinks, metadata, and parts of the Web-page description. All this information is used by the search-engine to rank HTML Web pages returned to the user, but is unfortunately absent in free text that has no HTML markup, tags, hyperlinks, and any other metadata, except implicit natural language structure.
   Here we demonstrate one of the first Big text search engines that leverages hidden structure of the natural language sentences in order to process user queries and return more relevant search-results than a standard keyword-search. It provides a structured index extracted from the text using Natural Language Processing (NLP) that can be used to browse and query free text.
FusionDB: conflict management system for small-science databases BIBAFull-Text 2469-2472
  Karim Ibrahim; Nathaniel Selvo; Mohamad El-Rifai; Mohamed Eltabakh
In this paper, we demonstrate the FusionDB system; an extended relational database engine for managing conflicts in small-science databases. In small sciences, groups -- each consists of few scientists -- may share and exchange parts of their own databases among each other to foster collaboration. The goal of such sharing, especially when done at early stages of the discovery process, is not to build a warehouse or a unified schema, instead the goal is to compare and verify results, detect and assess conflicts, and possibly modify or re-design the discovery process. FusionDB is designed to meet the requirements and address the challenges of such sharing model. We will demonstrate the key functionalities of FusionDB including: (1) Detecting conflicts using a rule-based model over heterogeneous schemas, (2) Assessing conflicts and providing probabilistic estimates for values' correctness, (3) Extended querying capabilities in the presence of conflicts, and (4) Providing curation operations to help scientists resolve and investigate conflicts according to different priorities. FusionDB is realized on top of PostgreSQL DBMS.
GeCo: an online personal data generator and corruptor BIBAFull-Text 2473-2476
  Khoi-Nguyen Tran; Dinusha Vatsalan; Peter Christen
We demonstrate GeCo, an online personal data GEnerator and COrruptor that facilitates the creation of realistic personal data ranging from names, addresses, and dates, to social security and credit card numbers, as well as numerical values such as salary or blood pressure. Using an intuitive Web interface, a user can create records containing such data according to their needs, and apply various corruption functions to generate duplicates of these records. Synthetic personal data are increasingly required in areas such as record de-duplication, fraud detection, cloud computing, and health informatics, where data quality issues can significantly affect the outcomes of data integration, processing, and mining projects. Privacy concerns, however, often make it difficult for researchers to obtain real data that contain personal details. Compared to other data generators that have to be downloaded, installed and customized, GeCo allows the creation of personal data with much less effort. In this demonstration we show (1) how different types of attributes, and dependencies between them, can be specified; (2) how the generated data can be modified using various types of corruption functions; and (3) how a user can contribute to GeCo by providing attribute generation functions and look-up files. We believe GeCo will be a valuable tool for researchers that require realistic personal data to evaluate their algorithms with regard to efficiency and effectiveness.
DeExcelerator: a framework for extracting relational data from partially structured documents BIBAFull-Text 2477-2480
  Julian Eberius; Christoper Werner; Maik Thiele; Katrin Braunschweig; Lars Dannecker; Wolfgang Lehner
Of the structured data published on the web, for instance as datasets on Open Data Platforms such as data.gov, but also in the form of HTML tables on the general web, only a small part is in a relational form. Instead the data is intermingled with formatting, layout and textual metadata, i.e., it is contained in partially structured documents. This makes transformation into a true relational form necessary, which is a precondition for most forms of data analysis and data integration. Studying data.gov as an example source for partially structured documents, we present a classification of typical normalization problems. We then present the DeExcelerator, which is a framework for extracting relations from partially structured documents such as spreadsheets and HTML tables.
Demonstrating intelligent crawling and archiving of web applications BIBAFull-Text 2481-2484
  Muhammad Faheem; Pierre Senellart
We demonstrate here a new approach to Web archival crawling, based on an application-aware helper that drives crawls of Web applications according to their types (especially, according to their content management systems). By adapting the crawling strategy to the Web application type, one is able to crawl a given Web application (say, a given forum or blog) with fewer requests than traditional crawling techniques. Additionally, the application-aware helper is able to extract semantic content from the Web pages crawled, which results in a Web archive of richer value to an archive user. In our demonstration scenario, we invite a user to compare application-aware crawling to regular Web crawling on the Web site of their choice, both in terms of efficiency and of experience in browsing and searching the archive.
iNewsBox: modeling and exploiting implicit feedback for building personalized news radio BIBAFull-Text 2485-2488
  Yanan Xie; Liang Chen; Kunyang Jia; Lichuan Ji; Jian Wu
Online news reading has become the major method to know about the world as web provide more information than other media like TV and radio. However, traditional online news reading interface is inconvenient for many types of people, especially for those who are disabled or taking a bus. This paper presents a mobile application iNewsBox enabling users to listen to news collected from the Internet. In order to simplify necessary interactions of getting valuable news, we also propose a framework for using implicit feedback to recommend news in this paper. Experiment shows our algorithms in iNewsBox are effective.
SportSense: using motion queries to find scenes in sports videos BIBAFull-Text 2489-2492
  Ihab Al Kabary; Heiko Schuldt
We present SportSense, a system for interactive sports video retrieval using sketch-based motion queries. SportSense is based on sports videos of games, enriched with an overlay of metadata that incorporates spatio-temporal information about various events and movements. We present how sketch-based motion queries are formulated and executed, as well as the use of various intuitive input interfaces to acquire the query object. The system uses spatio-temporal index structures to facilitate interactive response times.
PredictionIO: a distributed machine learning server for practical software development BIBAFull-Text 2493-2496
  Simon Chan; Thomas Stone; Kit Pang Szeto; Ka Hou Chan
One of the biggest challenges for software developers to build real-world predictive applications with machine learning is the steep learning curve of data processing frameworks, learning algorithms and scalable system infrastructure. We present PredictionIO, an open source machine learning server that comes with a step-by-step graphical user interface for developers to (i) evaluate, compare and deploy scalable learning algorithms, (ii) tune hyperparameters of algorithms manually or automatically and (iii) evaluate model training status. The system also comes with an Application Programming Interface (API) to communicate with software applications for data collection and prediction retrieval. The whole infrastructure of PredictionIO is horizontally scalable with a distributed computing component based on Hadoop. The demonstration shows a live example and workflows of building real-world predictive applications with the graphical user interface of PredictionIO, from data collection, algorithm tuning and selection, model training and re-training to real-time prediction querying.
Exploring XML data is as easy as using maps BIBAFull-Text 2497-2500
  Yong Zeng; Zhifeng Bao; Guoliang Li; Tok Wang Ling
For keyword search on XML data, traditionally, a list of query results in the form of subtrees will be returned to users. However, we find that it is still not sufficient to meet users' information needs because: (1) the search intention of a certain keyword query varies from person to person; (2) amongst the query results, they may have sibling or containment relationships (in the context of whole XML database), which could be important for users to digest the query results and should be shown to users. Therefore, we try to equip the traditional XML keyword search engine with our new exploration model XMAP, providing user an interactive yet novel way to explore the results with better user experience.
Inside the world's playlist BIBAFull-Text 2501-2504
  Wouter Weerkamp; Manos Tsagkias; Maarten de Rijke
We describe Streamwatchr, a real-time system for analyzing the music listening behavior of people around the world. Streamwatchr collects music-related tweets, extracts artists and songs, and visualizes the results in three ways: (i) currently trending songs and artists, (ii) newly discovered songs, and (iii) popularity statistics per country and world-wide for both songs and artists.
Detecting and exploring clusters in attributed graphs: a plugin for the gephi platform BIBAFull-Text 2505-2508
  Brigitte Boden; Roman Haag; Thomas Seidl
Clustering graph data has gained much attention in recent years, as data represented as graphs is ubiquitous in today's applications. For many applications, besides the mere graph data also further information about the vertices of a graph is available, which can be represented as attribute vectors. Recently, combined clustering approaches were introduced, which consider graph information and attribute vectors simultaneously for clustering. The visualization of clustering results can help users to get a better understanding of the results. In this paper, we introduce the GC-Viz system, which is implemented as a plugin for the Gephi platform. GC-Viz allows the user to test the combined clustering methods GAMer and DB-CSC on their data and to visualize and explore the clustering results. Furthermore, GC-Viz enables the user to visually compare the results of different clustering algorithms on the same dataset.
Cloud Armor: a platform for credibility-based trust management of cloud services BIBAFull-Text 2509-2512
  Talal H. Noor; Quan Z. Sheng; Anne H. H. Ngu; Abdullah Alfazi; Jeriel Law
Trust management of cloud services is emerging as an important research issue in recent years, which poses significant challenges because of the highly dynamic, distributed, and non-transparent nature of cloud services. This paper describes Cloud Armor, a platform for credibility-based trust management of cloud services. The platform provides a crawler for automatic cloud services discovery, an adaptive and robust credibility model for measuring the credibility of feedbacks, and a trust-based recommender to recommend the most trustworthy cloud services to users. This paper presents the motivation, system design, implementation, and a demonstration of the Cloud Armor platform.
Human computing games for knowledge acquisition BIBAFull-Text 2513-2516
  Sarath Kumar Kondreddi; Peter Triantafillou; Gerhard Weikum
Automatic information extraction techniques for knowledge acquisition are known to produce noise, incomplete or incorrect facts from textual sources. Human computing offers a natural alternative to expand and complement the output of automated information extraction methods, thereby enabling us to build high-quality knowledge bases. However, relying solely on human inputs for extraction can be prohibitively expensive in practice. We demonstrate human computing games for knowledge acquisition that employ human computing to overcome the limitations in automated fact acquisition methods. We provide a combined approach that tightly integrates automated extraction techniques with human computing for effective gathering of facts. The methods we provide gather facts in the form of relationships between entities. The games we demonstrate are specifically designed to capture hard-to-extract relations between entities in narrative text -- a task that automated systems find challenging.
A tool for assisting provenance search in social media BIBAFull-Text 2517-2520
  Suhas Ranganath; Pritam Gundecha; Huan Liu
In recent years, social media sites are witnessing an information explosion. Determining the reliability of such a large amount of information is a major area of research. Information provenance (aka, sources or origin) provides a way to measure the reliability of information in social networks. The main challenge in seeking provenance is the availability of suitable data consisting of sufficient unique propagation paths. Knowledge of the actual propagation paths for a piece of information will be a valuable asset in provenance search. This paper presents a tool for capturing the propagation network of a given tweet or URL (Uniform Resource Locator) in the Twitter network. Researchers can use this tool to collect information propagation data, design effective strategies for determining the provenance, and gain information about the tweet such as impact, growth rate and users influencing the spread. Two case studies are presented to demonstrate the effectiveness of the system for seeking provenance information.
SPHINX: rich insights into evidence-hypotheses relationships via parameter space-based exploration BIBAFull-Text 2521-2524
  Abhishek Mukherji; Jason Whitehouse; Christopher R. Botaish; Elke A. Rundensteiner; Matthew O. Ward
We demonstrate our SPHINX system that not only derives but also visualizes evidence-hypotheses relationships on a parameter space of belief and plausibility. SPHINX facilitates the analyst to interactively explore the contribution of different pieces of evidence towards the hypotheses. The key technical contributions of SPHINX include both computational and visual dimensions. The computational contributions cover (a.) flexible computational model selection; and (b.) real-time incremental strength computations. The visual contributions include (a.) sense-making over parameter space; (b.) filtering and abstraction options; (c.) novel visual displays such as evidence glyph and skyline views. Using two real datasets, we will demonstrate that the SPHINX system provides the analysts with rich insights into evidence-hypothesis relationships facilitating the discovery and decision making process.
Search excavator: the knowledge discovery tool BIBAFull-Text 2525-2528
  Dmitri Danilov; Eero Vainikko
We present a knowledge discovery tool Search Excavator (SE) developed for detecting similar words in web documents ranked by overall usage frequency in American English. The SE prototype application is a web browser add-on developed to assist users in acquiring new knowledge in unknown domains and to help in posing more specific search queries. The SE is designed to discover similar but generally infrequent words with surrounding texts in browser web documents and then suggest found words as possible query keywords. This technique allows users to discover unknown data intersections and use less ambiguous queries to target the required documents. The SE concept is motivated by a number of ideas. The similar infrequent words in the texts of the relevant documents can include field specific terms and facts that can be unknown to the user. Suggesting such keywords can decrease the overall search time encouraging early learning by directing users to the new unknown relevant terms and facts in a search session with an ambiguous query. Finally, we present four demonstration scenarios from our small-scale qualitative user study of the SE tool.
ESTHETE: a news browsing system to visualize the context and evolution of news stories BIBAFull-Text 2529-2532
  Rahul Goyal; Ravee Malla; Amitabha Bagchi; Sameep Mehta; Maya Ramanath
Providing the history and context(s) of a news article that emerges in the middle of an evolving news story -- sometimes multiple news stories -- is a complex task. The complexity of the task is compounded by the fact that different users are interested in different contexts of the article, and it is impossible to guess what a particular user is most interested in. In this paper, we introduce ESTHETE, a system that provides rich context(s) (through what we call personalized flexible context extraction), by preprocessing and storing articles in a structured representation (directed graphs) that makes it easy for the user to explore different contexts. The advantage of this approach is that the incremental computational expense in incorporating new articles as they are published is minimal. Our system is available at: http://konfrap.com/esthete.
WordSeer: a knowledge synthesis environment for textual data BIBAFull-Text 2533-2536
  Aditi Muralidharan; Marti A. Hearst; Christopher Fan
We describe WordSeer, a tool whose goal is to help scholars and analysts discover patterns and formulate and test hypotheses about the contents of text collections, midway between what humanities scholars call a traditional "close read" and the new "distant read" or "culturomics" approach. To this end, WordSeer allows for highly flexible "slicing and dicing" (hence "sliding") across a text collection. The tool allows users to view text from different angles by selecting subsets of data, viewing those as visualizations, moving laterally to view other subsets of data, slicing into another view, expanding the viewed data by relaxing constraints, and so on. We illustrate the text sliding capabilities of the tool with examples from a case study in the field of humanities and social sciences -- an analysis of how U.S. perceptions of China and Japan changed over the last 30 years.

Panel discussion

Channeling the deluge: research challenges for big data and information systems BIBAFull-Text 2537-2538
  Paul Bennett; Lee Giles; Alon Halevy; Jiawei Han; Marti Hearst; Jure Leskovec
With massive amounts of data being generated and stored ubiquitously in every discipline and every aspect of our daily life, how to handle such big data poses many challenging issues to researchers in data and information systems. The participants of CIKM 2013 are active researchers on large scale data, information and knowledge management, from multiple disciplines, including database systems, data mining, information retrieval, human-computer interaction, and knowledge or information management.
   As a group of experienced researchers in academia and industry, we will present at this panel our visions on what should be the challenging research issues in this promising research frontier and hope to attract heated discussions and debates from the audience. We expect panelists with diverse backgrounds raise different challenging research problems and exchange their views with each other and with the audience. A heated discussion may help young researchers understand the need for research in both industry and academia and invest their efforts on more important research issues and make impacts to the development of new principles, methodologies, and technologies.

Co-located workshop summaries

AKBC 2013: third workshop on automated knowledge base construction BIBAFull-Text 2539-2540
  Fabian M. Suchanek; Sebastian Riedel; Sameer Singh; Partha P. Talukdar
The AKBC 2013 workshop aims to be a venue of excellence and vision in the area of knowledge base construction. This year's workshop will feature keynotes by ten leading researchers in the field, including from Google, Microsoft, Stanford, and CMU. The submissions focus on visionary ideas instead of on experimental evaluation. Nineteen accepted papers will be presented as posters, with nine exceptional papers also highlighted as spotlight talks. Thereby, the workshop aims provides a vivid forum of discussion about the field of automated knowledge base construction.
DOLAP 2013 workshop summary BIBAFull-Text 2541-2542
  Ladjel Bellatreche; Alfredo Cuzzocrea; Il-Yeol Song
The ACM DOLAP workshop presents research on data warehousing and On-Line Analytical Processing (OLAP). The DOLAP 2013 program has three interesting sessions on Design and Exploitation of Social Data Warehouses, ETL and modeling and new trends, as well as a keynote talk on OLAP query processing and a panel on OLAP and DataWarehousing Technology in Big Data era.
Sixth workshop on exploiting semantic annotations in information retrieval (ESAIR'13) BIBAFull-Text 2543-2544
  Paul. N. Bennett; Evgeniy Gabrilovich; Jaap Kamps; Jussi Karlgren
There is an increasing amount of structure on the web as a result of modern web languages, user tagging and annotation, emerging robust NLP tools, and an ever growing volume of linked data. These meaningful, semantic, annotations hold the promise to significantly enhance information access, by enhancing the depth of analysis of today's systems. Currently, we have only started exploring the possibilities and only begin to understand how these valuable semantic cues can be put to fruitful use. ESAIR'13 focuses on two of the most challenging aspects to address in the coming years. First, there is a need to include the currently emerging knowledge resources (such as DBpedia, Freebase) as underlying semantic model giving access to an unprecedented scope and detail of factual information. Second, there is a need to include annotations beyond the topical dimension (think of sentiment, reading level, prerequisite level, etc) that contain vital cues for matching the specific needs and profile of the searcher at hand.
2013 international workshop on computational scientometrics: theory and applications BIBAFull-Text 2545-2546
  Cornelia Caragea; C. Lee Giles; Lior Rokach; Xiaozhong Liu
The field of Scientometrics is concerned with the analysis of science and scientific research. As science advances, scientists around the world continue to produce large numbers of research articles, which provide the technological basis for worldwide collection, sharing, and dissemination of scientific discoveries. Research ideas are generally developed based on high quality citations. Understanding how research ideas emerge, evolve, or disappear as a topic, what is a good measure of quality of published works, what are the most promising areas of research, how authors connect and influence each other, who are the experts in a field, what works are similar, and who funds a particular research topic are some of the major foci of the rapidly emerging field of Scientometrics. Digital libraries and other databases that store research articles have become a medium for answering such questions. Citation analysis is used to mine large publication graphs in order to extract patterns in the data (e.g., citations per article) that can help measure the quality of a journal. Scientometrics, on the other hand, is used to mine graphs that link together multiple types of entities: authors, publications, conference venues, journals, institutions, etc., in order to assess the quality of science and answer complex questions such as those listed above. Tools such as maps of science that are built from digital libraries, allow different categories of users to satisfy various needs, e.g., help researchers to easily access research results, identify relevant funding opportunities, and find collaborators. Moreover, the recent developments in data mining, machine learning, natural language processing, and information retrieval makes it possible to transform the way we analyze research publications, funded proposals, patents, etc., on a web-wide scale.
Workshop summary for the 2013 international workshop on mining unstructured big data using natural language processing BIBFull-Text 2547-2548
  Xiaozhong Liu; Miao Chen; Ying Ding; Min Song
CloudDB 2013: fifth international workshop on cloud data management BIBAFull-Text 2549-2550
  Feifei Li; Xiaofeng Meng; Fusheng Wang; Cong Yu
The fifth ACM international workshop on cloud data management is held in San Francisco, California, USA on October 28, 2013 and co-located with the ACM 22nd Conference on Information and Knowledge Management (CIKM). The main objective of the workshop is to address the challenges of large scale data management based on the cloud computing infrastructure. The workshop brings together researchers and practitioners from cloud computing, distributed storage, query processing, parallel algorithms, data mining, and system analysis, all attendees share common research interests in maximizing performance, reducing cost of cloud data management and enlarging the scale of their endeavors. We have constructed an exciting program of four refereed papers and an invited keynote talk that will give participants a full dose of emerging research.
DUBMOD13: international workshop on data-driven user behavioral modelling and mining from social media BIBAFull-Text 2551-2552
  Jalal Mahmud; Jeffrey Nichols; Michelle X. Zhou; James Caverlee; John O'Donovan
Massive amounts of data are being generated on social media sites, such as Twitter and Facebook. These data can be used to better understand people (e.g., personality traits, perceptions, and preferences) and predict their behavior. As a result, a deeper understanding of users and their behavior can benefit a wide range of intelligent applications, such as advertising, social recommender systems, and personalized knowledge management. These applications will also benefit individual users themselves and optimize their experience across a wide variety of domains, such as retail, healthcare, and education. Since mining and understanding user behavior from social media often requires interdisciplinary effort, including machine learning, text mining, human-computer interaction, and social science, our workshop aims to bring together researchers and practitioners from multiple fields to discuss the creation of deeper models of individual users by mining the content that they publish and the social networking behavior that they exhibit.
PLEAD 2013: politics, elections and data BIBAFull-Text 2553-2554
  Ingmar Weber; Ana-Maria Popescu; Marco Pennacchiotti
What is the role of the internet in politics general and during campaigns in particular? And what is the role of large amounts of user data in all of this?
   In the 2008 and 2012 U.S. presidential campaigns the Democrats were far more successful than the Republicans in utilizing online media for mobilization, co-ordination and fundraising. Year over year, social media and the Internet plays a fundamental role in political campaigns. However, technical research in this area is still limited and fragmented. The goal of this workshop is to bring together researchers working at the intersection of social network analysis, computational social science and political science, to share and discuss their ideas in a common forum; and to inspire further developments in this growing, fascinating field.
DTMBIO 2013: international workshop on data and text mining in biomedical informatics BIBAFull-Text 2555-2556
  Atul Butte; Doheon Lee; Hua Xu; Min Song
The organizers of ACM Seventh International Workshop on Data and Text Mining in Biomedical Informatics (DTMBIO 13) are pleased to announce that the seventh DTMBIO will be held in conjunction with CIKM, one of the largest data management conferences. The major interests of DTMBIO are on the state-of-the-art applications of data and text mining on biomedical research problems. DTMBIO 13 will be a forum of discussing and exchanging informatics related techniques and problems in the context of biomedical research.
CIKM 2013 workshop on living labs for information retrieval evaluation BIBAFull-Text 2557-2558
  Krisztian Balog; David Elsweiler; Evangelos Kanoulas; Liadh Kelly; Mark D. Smucker
In the past few years the information retrieval (IR) community has been exploring ways to move further away from the Cranfield style evaluation paradigm, and make evaluations more 'realistic' (more centered on real users, their needs and behaviours). As part of this drive, living labs which involve and integrate users in the research process have been proposed. The Living Labs for Information Retrieval Evaluation workshop (LL'13) brings together for the first time people interested in progressing the living labs for IR evaluation methodology.
The first workshop on user engagement optimization BIBAFull-Text 2559-2560
  Liangjie Hong; Shuang-Hong Yang
Online user engagement optimization is key to many Internet business. Several research areas are related to the concept of online user engagement optimization, including machine learning, data mining, information retrieval, recommender systems, online A/B (bucket) testing and psychology. In the past, research efforts in this direction are pursued in separate communities and conferences, yielding potential disconnected and repeated results. In addition, researchers and practitioners are sometimes only exposed to a specific aspect of the topic, which might be incomplete and suboptimal to the whole picture. Here, we organize the first workshop on the topic of online user engagement optimization, explicitly targeting the topic as a whole and bring researchers and practitioners together to foster the field. We invite two leading researchers from industry to give keynote talks about online machine learning and online experimentations. In addition, several invited talks from industry and academic researchers have covered the topics of content personalization, online experimental platforms and recommender systems. Also, six novel submissions are included as short papers in the workshop such that new results are discussed and shared among the workshop.
PIKM 2013: the 6th ACM workshop for Ph.D. students in information and knowledge management BIBAFull-Text 2561-2562
  Fabian M. Suchanek; Anisoara Nica
The PIKM workshop gives Ph.D. students an opportunity to present their dissertation proposals at a global stage. Similarly to the CIKM, the PIKM workshop covers a wide range of topics in the areas of databases, information retrieval and knowledge management. Interdisciplinary work across these tracks is particularly encouraged.
Web-KR 2013: the 4th international workshop on web-scale knowledge representation, retrieval and reasoning BIBAFull-Text 2563-2564
  Yi Zeng; Spyros Kotoulas; Zhisheng Huang
As a continuous effort for organizing discussions and providing possible theories and techniques to deal with the barriers for knowledge processing at Web scale, the 2013 International Workshop on Web-scale Knowledge Representation, Retrieval and Reasoning (Web-KR 2013) was held in conjunction with the 2013 ACM International Conference on Information and Knowledge Management (CIKM 2013) on November 1st, 2013 at Burlingame, CA, United States. This is the 4th version of the Web-KR workshop. As in previous workshops under the same title, accepted papers of this workshop cover many important topics in the field. This year, the contributions focus on multi-faceted understanding of Web knowledge sources, Web entity linking, deep Web knowledge acquisition, and Web-scale stream reasoning. Many new approaches are proposed to deal with these problems in the context of large scale Web resources. This summary introduces the major contributions of accepted papers in the Web-KR 2013 workshop.
Data management & analytics for healthcare (DARE 2013) BIBAFull-Text 2565-2566
  Ullas Nambiar; Niranjan Thirumale
Reducing healthcare costs and improving quality of outcomes is a challenge even in developed economies. Much information remains in paper form, lack common standards, sharing is uncommon and frequently hampered by the lack of foolproof de-identification for patient privacy. All of these issues impede opportunities for data mining and analysis that would enable better predictive and preventive medicine. These issues are compounded in emerging economies due to geopolitical constraints, transportation and geographic barriers, a much more limited clinical workforce, and infrastructural challenges to delivery. Thus, simple, high-impact deliverable interventions such as universal childhood immunization and maternal childcare are hampered by poor monitoring and reporting systems. This workshop is focused on identifying challenges to be overcome for effectively delivering efficient healthcare and to the masses. Specifically, we will provide a forum to discuss research directions, share experience and insights from both academia and industry. The anticipated outcome of the workshop is an assessment of the state of the art in the area, as well as identification of critical next steps to pursue in this topic.