HCI Bibliography Home | HCI Conferences | WWW Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
WWW Tables of Contents: 091011-111-212-112-213-113-214-114-215-115-2

Proceedings of the 2015 International Conference on the World Wide Web

Fullname:Proceedings of the 24th International Conference on World Wide Web
Editors:Aldo Gangemi; Stefano Leonardi; Alessandro Panconesi
Location:Florence, Italy
Dates:2015-May-18 to 2015-May-22
Standard No:ISBN: 978-1-4503-3469-3; ACM DL: Table of Contents; hcibib: WWW15-1
Links:Conference Website
  1. WWW 2015-05-18 Volume 1
    1. Technical Papers
    2. Technical Papers 2

WWW 2015-05-18 Volume 1

Technical Papers

Optimizing Display Advertising in Online Social Networks BIBAFull-Text 1-11
  Zeinab Abbassi; Aditya Bhaskara; Vishal Misra
Advertising is a significant source of revenue for most online social networks. Conventional online advertising methods need to be customized for online social networks in order to address their distinct characteristics. Recent experimental studies have shown that providing social cues along with ads, e.g. information about friends liking the ad or clicking on an ad, leads to higher click rates. In other words, the probability of a user clicking an ad is a function of the set of friends that have clicked the ad. In this work, we propose formal probabilistic models to capture this phenomenon, and study the algorithmic problem that then arises. Our work is in the context of display advertising where a contract is signed to show an ad to a pre-determined number of users. The problem we study is the following: given a certain number of impressions, what is the optimal display strategy, i.e. the optimal order and the subset of users to show the ad to, so as to maximize the expected number of clicks? Unlike previous models of influence maximization, we show that this optimization problem is hard to approximate in general, and that it is related to finding dense subgraphs of a given size. In light of the hardness result, we propose several heuristic algorithms including a two-stage algorithm inspired by influence-and-exploit strategies in viral marketing. We evaluate the performance of these heuristics on real data sets, and observe that our two-stage heuristic significantly outperforms the natural baselines.
Frankenplace: Interactive Thematic Mapping for Ad Hoc Exploratory Search BIBAFull-Text 12-22
  Benjamin Adams; Grant McKenzie; Mark Gahegan
Ad hoc keyword search engines built using modern information retrieval methods do a good job of handling fine-grained queries. However, they perform poorly at facilitating spatial and spatially-embedded thematic exploration of the results, despite the fact that many queries, e.g. "civil war," refer to different documents and topics in different places. This is not for lack of data: geographic information, such as place names, events, and coordinates are common in unstructured document collections on the web. The associations between geographic and thematic contents in these documents can provide a rich groundwork to organize information for exploratory research. In this paper we describe the architecture of an interactive thematic map search engine, Frankenplace, designed to facilitate document exploration at the intersection of theme and place. The map interface enables a user to zoom the geographic context of their query in and out, and quickly explore through thousands of search results in a meaningful way. And by combining topic models with geographically contextualized search results, users can discover related topics based on geographic context. Frankenplace utilizes a novel indexing method called geoboost for boosting terms associated with cells on a discrete global grid. The resulting index factors in the geographic scale of the place or feature mentioned in related text, the relative textual scope of the place reference, and the overall importance of the containing document in the document network. The system is currently indexed with over 5 million documents from the web, including the English Wikipedia and online travel blog entries. We demonstrate that Frankenplace can support four distinct types of exploratory search tasks while being adaptive to scale and location of interest.
Towards Reconciling SPARQL and Certain Answers BIBAFull-Text 23-33
  Shqiponja Ahmetaj; Wolfgang Fischl; Reinhard Pichler; Mantas Šimkus; Sebastian Skritek
SPARQL entailment regimes are strongly influenced by the big body of works on ontology-based query answering, notably in the area of Description Logics (DLs). However, the semantics of query answering under SPARQL entailment regimes is defined in a more naive and much less expressive way than the certain answer semantics usually adopted in DLs. The goal of this work is to introduce an intuitive certain answer semantics also for SPARQL and to show the feasibility of this approach. For OWL 2 QL entailment, we present algorithms for the evaluation of an interesting fragment of SPARQL (the so-called well-designed SPARQL). Moreover, we show that the complexity of the most fundamental query analysis tasks (such as query containment and equivalence testing) is not negatively affected by the presence of OWL 2 QL entailment under the proposed semantics.
Donor Retention in Online Crowdfunding Communities: A Case Study of DonorsChoose.org BIBAFull-Text 34-44
  Tim Althoff; Jure Leskovec
Online crowdfunding platforms like DonorsChoose.org and Kickstarter allow specific projects to get funded by targeted contributions from a large number of people. Critical for the success of crowdfunding communities is recruitment and continued engagement of donors. With donor attrition rates above 70%, a significant challenge for online crowdfunding platforms as well as traditional offline non-profit organizations is the problem of donor retention. We present a large-scale study of millions of donors and donations on DonorsChoose.org, a crowdfunding platform for education projects. Studying an online crowdfunding platform allows for an unprecedented detailed view of how people direct their donations. We explore various factors impacting donor retention which allows us to identify different groups of donors and quantify their propensity to return for subsequent donations. We find that donors are more likely to return if they had a positive interaction with the receiver of the donation. We also show that this includes appropriate and timely recognition of their support as well as detailed communication of their impact. Finally, we discuss how our findings could inform steps to improve donor retention in crowdfunding communities and non-profit organizations.
Budget-Constrained Item Cold-Start Handling in Collaborative Filtering Recommenders via Optimal Design BIBAFull-Text 45-54
  Oren Anava; Shahar Golan; Nadav Golbandi; Zohar Karnin; Ronny Lempel; Oleg Rokhlenko; Oren Somekh
It is well known that collaborative filtering (CF) based recommender systems provide better modeling of users and items associated with considerable rating history. The lack of historical ratings results in the user and the item cold-start problems. The latter is the main focus of this work. Most of the current literature addresses this problem by integrating content-based recommendation techniques to model the new item. However, in many cases such content is not available, and the question arises is whether this problem can be mitigated using CF techniques only. We formalize this problem as an optimization problem: given a new item, a pool of available users, and a budget constraint, select which users to assign with the task of rating the new item in order to minimize the prediction error of our model. We show that the objective function is monotone-supermodular, and propose efficient optimal design based algorithms that attain an approximation to its optimum. Our findings are verified by an empirical study using the Netflix dataset, where the proposed algorithms outperform several baselines for the problem at hand.
Improved Theoretical and Practical Guarantees for Chromatic Correlation Clustering BIBAFull-Text 55-65
  Yael Anava; Noa Avigdor-Elgrabli; Iftah Gamzu
We study a natural generalization of the correlation clustering problem to graphs in which the pairwise relations between objects are categorical instead of binary. This problem was recently introduced by Bonchi et al. under the name of chromatic correlation clustering, and is motivated by many real-world applications in data-mining and social networks, including community detection, link classification, and entity de-duplication. Our main contribution is a fast and easy-to-implement constant approximation framework for the problem, which builds on a novel reduction of the problem to that of correlation clustering. This result significantly progresses the current state of knowledge for the problem, improving on a previous result that only guaranteed linear approximation in the input size. We complement the above result by developing a linear programming-based algorithm that achieves an improved approximation ratio of 4. Although this algorithm cannot be considered to be practical, it further extends our theoretical understanding of chromatic correlation clustering. We also present a fast heuristic algorithm that is motivated by real-life scenarios in which there is a ground-truth clustering that is obscured by noisy observations. We test our algorithms on both synthetic and real datasets, like social networks data. Our experiments reinforce the theoretical findings by demonstrating that our algorithms generally outperform previous approaches, both in terms of solution cost and reconstruction of an underlying ground-truth clustering.
Global Diffusion via Cascading Invitations: Structure, Growth, and Homophily BIBAFull-Text 66-76
  Ashton Anderson; Daniel Huttenlocher; Jon Kleinberg; Jure Leskovec; Mitul Tiwari
Many of the world's most popular websites catalyze their growth through invitations from existing members. New members can then in turn issue invitations, and so on, creating cascades of member signups that can spread on a global scale. Although these diffusive invitation processes are critical to the popularity and growth of many websites, they have rarely been studied, and their properties remain elusive. For instance, it is not known how viral these cascades structures are, how cascades grow over time, or how diffusive growth affects the resulting distribution of member characteristics present on the site. In this paper, we study the diffusion of LinkedIn, an online professional network comprising over 332 million members, a large fraction of whom joined the site as part of a signup cascade. First we analyze the structural patterns of these signup cascades, and find them to be qualitatively different from previously studied information diffusion cascades. We also examine how signup cascades grow over time, and observe that diffusion via invitations on LinkedIn occurs over much longer timescales than are typically associated with other types of online diffusion. Finally, we connect the cascade structures with rich individual-level attribute data to investigate the interplay between the two. Using novel techniques to study the role of homophily in diffusion, we find striking differences between the local, edge-wise homophily and the global, cascade-level homophily we observe in our data, suggesting that signup cascades form surprisingly coherent groups of members.
Recommendation Subgraphs for Web Discovery BIBAFull-Text 77-87
  Arda Antikacioglu; R. Ravi; Srinath Sridhar
Recommendations are central to the utility of many popular e-commerce websites. Such sites typically contain a set of recommendations on every product page that enables visitors and crawlers to easily navigate the website. These recommendations are essentially universally present on all e-commerce websites. Choosing an appropriate set of recommendations at each page is a critical task performed by dedicated backend software systems. We formalize the concept of recommendations used for discovery as a natural graph optimization problem on a bipartite graph and propose three methods for solving the problem in increasing order of sophistication: a local random sampling algorithm, a greedy algorithm and a more involved partitioning based algorithm. We first theoretically analyze the performance of these three methods on random graph models and characterize when each method will yield a solution of sufficient quality and the parameter ranges when more sophistication is needed. We complement this by providing an empirical analysis of these algorithms on simulated and real-world production data from a retail website. Our results confirm that it is not always necessary to implement complicated algorithms in the real-world, and demonstrate that very good practical results can be obtained by using simple heuristics that are backed by the confidence of concrete theoretical guarantees.
Is Sniping A Problem For Online Auction Markets? BIBAFull-Text 88-96
  Matt Backus; Thomas Blake; Dimitriy V. Masterov; Steven Tadelis
A common complaint about online auctions for consumer goods is the presence of "snipers," who place bids in the final seconds of sequential ascending auctions with predetermined ending times. The literature conjectures that snipers are best-responding to the existence of "incremental" bidders that bid up to their valuation only as they are outbid. Snipers aim to catch these incremental bidders at a price below their reserve, with no time to respond. As a consequence, these incremental bidders may experience regret when they are outbid at the last moment at a price below their reservation value. We measure the effect of this experience on a new buyer's propensity to participate in future auctions. We show the effect to be causal using a carefully selected subset of auctions from eBay.com and instrumental variables estimation strategy. Bidders respond to sniping quite strongly and are between 4 and 18 percent less likely to return to the platform.
Essential Web Pages Are Easy to Find BIBAFull-Text 97-107
  Ricardo Baeza-Yates; Paolo Boldi; Flavio Chierichetti
In this paper we address the problem of estimating the index size needed by web search engines to answer as many queries as possible by exploiting the marked difference between query and click frequencies. We provide a possible formal definition for the notion of essential web pages as those that cover a large fraction of distinct queries -- i.e., we look at the problem as a version of MaxCover. Although in general MaxCover is approximable to within a factor of 1-1/e 0.632 from the optimum, we provide a condition under which the greedy algorithm does find the actual best cover (or remains at a known bounded factor from it). The extra check for optimality (or for bounding the ratio from the optimum) comes at a negligible algorithmic cost. Moreover, in most practical instances of this problem, the algorithm is able to provide solutions that are provably optimal, or close to optimal. We relate this observed phenomenon to some properties of the queries' click graph. Our experimental results confirm that a small number of web pages can respond to a large fraction of the queries (e.g., 0.4% of the pages answers 20% of the queries). Our approach can be used in several related search applications, and has in fact an even more general appeal -- as a first example, our preliminary experimental study confirms that our algorithm has extremely good performances on other (social network based) MaxCover instances.
Design and Analysis of Benchmarking Experiments for Distributed Internet Services BIBAFull-Text 108-118
  Eytan Bakshy; Eitan Frachtenberg
The successful development and deployment of large-scale Internet services depends critically on performance. Even small regressions in processing time can translate directly into significant energy and user experience costs. Despite the widespread use of distributed server infrastructure (e.g., in cloud computing and Web services), there is little research on how to benchmark such systems to obtain valid and precise inferences with minimal data collection costs. Correctly A/B testing distributed Internet services can be surprisingly difficult because interdependencies between user requests (e.g., for search results, social media streams, photos) and host servers violate assumptions required by standard statistical tests. We develop statistical models of distributed Internet service performance based on data from Perflab, a production system used at Facebook which vets thousands of changes to the company's codebase each day. We show how these models can be used to understand the tradeoffs between different benchmarking routines, and what factors must be taken into account when performing statistical tests. Using simulations and empirical data from Perflab, we validate our theoretical results, and provide easy-to-implement guidelines for designing and analyzing such benchmarks.
ACCAMS: Additive Co-Clustering to Approximate Matrices Succinctly BIBAFull-Text 119-129
  Alex Beutel; Amr Ahmed; Alexander J. Smola
Matrix completion and approximation are popular tools to capture a user's preferences for recommendation and to approximate missing data. Instead of using low-rank factorization we take a drastically different approach, based on the simple insight that an additive model of co-clusterings allows one to approximate matrices efficiently. This allows us to build a concise model that, per bit of model learned, significantly beats all factorization approaches in matrix completion. Even more surprisingly, we find that summing over small co-clusterings is more effective in modeling matrices than classic co-clustering, which uses just one large partitioning of the matrix. Following Occam's razor principle, the fact that our model is more concise and yet just as accurate as more complex models suggests that it better captures the latent preferences and decision making processes present in the real world. We provide an iterative minimization algorithm, a collapsed Gibbs sampler, theoretical guarantees for matrix approximation, and excellent empirical evidence for the efficacy of our approach. We achieve state-of-the-art results for matrix completion on Netflix at a fraction of the model complexity.
Who, What, When, and Where: Multi-Dimensional Collaborative Recommendations Using Tensor Factorization on Sparse User-Generated Data BIBAFull-Text 130-140
  Preeti Bhargava; Thomas Phan; Jiayu Zhou; Juhan Lee
Given the abundance of online information available to mobile users, particularly tourists and weekend travelers, recommender systems that effectively filter this information and suggest interesting participatory opportunities will become increasingly important. Previous work has explored recommending interesting locations; however, users would also benefit from recommendations for activities in which to participate at those locations along with suitable times and days. Thus, systems that provide collaborative recommendations involving multiple dimensions such as location, activities and time would enhance the overall experience of users.The relationship among these dimensions can be modeled by higher-order matrices called tensors which are then solved by tensor factorization. However, these tensors can be extremely sparse. In this paper, we present a system and an approach for performing multi-dimensional collaborative recommendations for Who (User), What (Activity), When (Time) and Where (Location), using tensor factorization on sparse user-generated data. We formulate an objective function which simultaneously factorizes coupled tensors and matrices constructed from heterogeneous data sources. We evaluate our system and approach on large-scale real world data sets consisting of 588,000 Flickr photos collected from three major metro regions in USA. We compare our approach with several state-of-the-art baselines and demonstrate that it outperforms all of them.
Secrets, Lies, and Account Recovery: Lessons from the Use of Personal Knowledge Questions at Google BIBAFull-Text 141-150
  Joseph Bonneau; Elie Bursztein; Ilan Caron; Rob Jackson; Mike Williamson
We examine the first large real-world data set on personal knowledge question's security and memorability from their deployment at Google. Our analysis confirms that secret questions generally offer a security level that is far lower than user-chosen passwords. It turns out to be even lower than proxies such as the real distribution of surnames in the population would indicate. Surprisingly, we found that a significant cause of this insecurity is that users often don't answer truthfully. A user survey we conducted revealed that a significant fraction of users (37%) who admitted to providing fake answers did so in an attempt to make them "harder to guess" although on aggregate this behavior had the opposite effect as people "harden" their answers in the same and predictable way. On the usability side, we show that secret answers have surprisingly poor memorability despite the assumption that their reliability motivates their continued deployment. From millions of account recovery attempts we observed a significant fraction of users (e.g 40% of our English-speaking US users) were unable to recall their answers when needed. This is lower than the success rate of alternative recovery mechanisms such as SMS reset codes (over 80%). Comparing question strength and memorability reveals that the questions that are potentially the most secure (e.g what is your first phone number) are also the ones with the worst memorability. We conclude that it appears next to impossible to find secret questions that are both secure and memorable. Secret questions continue have some use when combined with other signals, but they should not be used alone and best practice should favor more reliable alternatives.
Supporting Ethical Web Research: A New Research Ethics Review BIBAFull-Text 151-161
  Anne Bowser; Janice Y. Tsai
Research ethics is an important and timely topic. In academia, federally regulated Institutional Review Boards (IRBs) protect participants of human subjects research, and offer researchers a mechanism to assess the ethical implications of their work. Industry research labs are not subject to the same requirements, and may lack processes for research ethics review. We describe the creation of a new ethics framework and a research ethics submission system (RESS) within Microsoft Research (MSR). This RESS is customized to the needs of web researchers. We describe our iterative development process, including our assessment of the current state of web research, developing a framework of methods based on a survey of 358 research papers; build and evaluate our system with 14 users to identify the benefits and pitfalls of full deployment; evaluate how our system matches with existing federal regulations; and, suggest next steps for supporting ethical web research.
Sequential Hypothesis Tests for Adaptive Locality Sensitive Hashing BIBAFull-Text 162-172
  Aniket Chakrabarti; Srinivasan Parthasarathy
All pairs similarity search is a problem where a set of data objects is given and the task is to find all pairs of objects that have similarity above a certain threshold for a given similarity measure-of-interest. When the number of points or dimensionality is high, standard solutions fail to scale gracefully. Approximate solutions such as Locality Sensitive Hashing (LSH) and its Bayesian variants (BayesLSH and BayesLSHLite) alleviate the problem to some extent and provide substantial speedup over traditional index based approaches. BayesLSH is used for pruning the candidate space and computation of approximate similarity, whereas BayesLSHLite can only prune the candidates, but similarity needs to be computed exactly on the original data. Thus where ever the explicit data representation is available and exact similarity computation is not too expensive, BayesLSHLite can be used to aggressively prune candidates and provide substantial speedup without losing too much on quality. However, the loss in quality is higher in the BayesLSH variant, where explicit data representation is not available, rather only a hash sketch is available and similarity has to be estimated approximately. In this work we revisit the LSH problem from a Frequentist setting and formulate sequential tests for composite hypothesis (similarity greater than or less than threshold) that can be leveraged by such LSH algorithms for adaptively pruning candidates aggressively. We propose a vanilla sequential probability ratio test (SPRT) approach based on this idea and two novel variants. We extend these variants to the case where approximate similarity needs to be computed using fixed-width sequential confidence interval generation technique. We compare these novel variants with the SPRT variant and BayesLSH/Bayes-LSHLite variants and show that they can provide tighter qualitative guarantees over BayesLSH/BayesLSHLite -- a state-of-the-art approach -- while being up to 2.1x faster than a traditional SPRT and 8.8x faster than AllPairs.
Opinion Spam Detection in Web Forum: A Real Case Study BIBAFull-Text 173-183
  Yu-Ren Chen; Hsin-Hsi Chen
Opinion spamming refers to the illegal marketing practice which involves delivering commercially advantageous opinions as regular users. In this paper, we conduct a real case study based on a set of internal records of opinion spams leaked from a shady marketing campaign. We explore the characteristics of opinion spams and spammers in a web forum to obtain some insights, including subtlety property of opinion spams, spam post ratio, spammer accounts, first post and replies, submission time of posts, activeness of threads, and collusion among spammers. Then we present features that could be potentially helpful in detecting spam opinions in threads. The results of spam detection on first posts show: (1) spam first posts put more focus on certain topics such as the user experiences' on the promoted items, (2) spam first posts generally use more words and pictures to showcase the promoted items in an attempt to impress people, (3) spam first posts tend to be submitted during work time, and (4) the threads that spam first posts initiate are more active to be placed at striking positions. The spam detection on replies is more challenging. Besides lower spam ratio and less content, replies even do not mention the promoted items. Their major intention is to keep the discussion in a thread alive to attract more attention on it. Submission time of replies, thread activeness, position of replies, and spamicity of first post are more useful than content-based features in spam detection on replies.
Summarizing Entity Descriptions for Effective and Efficient Human-centered Entity Linking BIBAFull-Text 184-194
  Gong Cheng; Danyun Xu; Yuzhong Qu
Entity linking connects the Web of documents with knowledge bases. It is the task of linking an entity mention in text to its corresponding entity in a knowledge base. Whereas a large body of work has been devoted to automatically generating candidate entities, or ranking and choosing from them, manual efforts are still needed, e.g., for defining gold-standard links for evaluating automatic approaches, and for improving the quality of links in crowdsourcing approaches. However, structured descriptions of entities in knowledge bases are sometimes very long. To avoid overloading human users with too much information and help them more efficiently choose an entity from candidates, we aim to substitute entire entity descriptions with compact, equally effective structured summaries that are automatically generated. To achieve it, our approach analyzes entity descriptions in the knowledge base and the context of entity mention from multiple perspectives, including characterizing and differentiating power, information overlap, and relevance to context. Extrinsic evaluation (where human users carry out entity linking tasks) and intrinsic evaluation (where human users rate summaries) demonstrate that summaries generated by our approach help human users carry out entity linking tasks more efficiently (22-23% faster), without significantly affecting the quality of links obtained; and our approach outperforms existing approaches to summarizing entity descriptions.
Semantic Tagging of Mathematical Expressions BIBAFull-Text 195-204
  Pao-Yu Chien; Pu-Jen Cheng
Semantic tagging of mathematical expressions (STME) gives semantic meanings to tokens in mathematical expressions. In this work, we propose a novel STME approach that relies on neither text along with expressions, nor labelled training data. Instead, our method only requires a mathematical grammar set. We point out that, besides the grammar of mathematics, the special property of variables and user habits of writing expressions help us understand the implicit intents of the user. We build a system that considers both restrictions from the grammar and variable properties, and then apply an unsupervised method to our probabilistic model to learn the user habits. To evaluate our system, we build large-scale training and test datasets automatically from a public math forum. The results demonstrate the significant improvement of our method, compared to the maximum-frequency baseline. We also create statistics to reveal the properties of mathematics language.
Collaborative Ranking with a Push at the Top BIBAFull-Text 205-215
  Konstantina Christakopoulou; Arindam Banerjee
The goal of collaborative filtering is to get accurate recommendations at the top of the list for a set of users. From such a perspective, collaborative ranking based formulations with suitable ranking loss functions are natural. While recent literature has explored the idea based on objective functions such as NDCG or Average Precision, such objectives are difficult to optimize directly. In this paper, building on recent advances from the learning to rank literature, we introduce a novel family of collaborative ranking algorithms which focus on accuracy at the top of the list for each user while learning the ranking functions collaboratively. We consider three specific formulations, based on collaborative p-norm push, infinite push, and reverse-height push, and propose efficient optimization methods for learning these models. Experimental results illustrate the value of collaborative ranking, and show that the proposed methods are competitive, usually better than existing popular approaches to personalized recommendation.
Parallel Streaming Signature EM-tree: A Clustering Algorithm for Web Scale Applications BIBAFull-Text 216-226
  Christopher Michael De Vries; Lance De Vine; Shlomo Geva; Richi Nayak
The proliferation of the web presents an unsolved problem of automatically analyzing billions of pages of natural language. We introduce a scalable algorithm that clusters hundreds of millions of web pages into hundreds of thousands of clusters. It does this on a single mid-range machine using efficient algorithms and compressed document representations. It is applied to two web-scale crawls covering tens of terabytes. ClueWeb09 and ClueWeb12 contain 500 and 733 million web pages and were clustered into 500,000 to 700,000 clusters. To the best of our knowledge, such fine grained clustering has not been previously demonstrated. Previous approaches clustered a sample that limits the maximum number of discoverable clusters. The proposed EM-tree algorithm uses the entire collection in clustering and produces several orders of magnitude more clusters than the existing algorithms. Fine grained clustering is necessary for meaningful clustering in massive collections where the number of distinct topics grows linearly with collection size. These fine-grained clusters show an improved cluster quality when assessed with two novel evaluations using ad hoc search relevance judgments and spam classifications for external validation. These evaluations solve the problem of assessing the quality of clusters where categorical labeling is unavailable and unfeasible.
Network-based Origin Confusion Attacks against HTTPS Virtual Hosting BIBAFull-Text 227-237
  Antoine Delignat-Lavaud; Karthikeyan Bhargavan
We investigate current deployment practices for virtual hosting, a widely used method for serving multiple HTTP and HTTPS origins from the same server, in popular content delivery networks, cloud-hosting infrastructures, and web servers. Our study uncovers a new class of HTTPS origin confusion attacks: when two virtual hosts use the same TLS certificate, or share a TLS session cache or ticket encryption key, a network attacker may cause a page from one of them to be loaded under the other's origin in a client browser. These attacks appear when HTTPS servers are configured to allow virtual host fallback from a client-requested, secure origin to some other unexpected, less-secure origin. We present evidence that such vulnerable virtual host configurations are widespread, even on the most popular and security-scrutinized websites, thus allowing a network adversary to hijack pages, or steal secure cookies and single sign-on tokens. To prevent our virtual host confusion attacks and recover the isolation guarantees that are commonly assumed in shared hosting environments, we propose fixes to web server software and advocate conservative configuration guidelines for the composition of HTTP with TLS.
The Dynamics of Micro-Task Crowdsourcing: The Case of Amazon MTurk BIBAFull-Text 238-247
  Djellel Eddine Difallah; Michele Catasta; Gianluca Demartini; Panagiotis G. Ipeirotis; Philippe Cudré-Mauroux
Micro-task crowdsourcing is rapidly gaining popularity among research communities and businesses as a means to leverage Human Computation in their daily operations. Unlike any other service, a crowdsourcing platform is in fact a marketplace subject to human factors that affect its performance, both in terms of speed and quality. Indeed, such factors shape the dynamics of the crowdsourcing market. For example, a known behavior of such markets is that increasing the reward of a set of tasks would lead to faster results. However, it is still unclear how different dimensions interact with each other: reward, task type, market competition, requester reputation, etc. In this paper, we adopt a data-driven approach to (A) perform a long-term analysis of a popular micro-task crowdsourcing platform and understand the evolution of its main actors (workers, requesters, and platform). (B) We leverage the main findings of our five year log analysis to propose features used in a predictive model aiming at determining the expected performance of any batch at a specific point in time. We show that the number of tasks left in a batch and how recent the batch is are two key features of the prediction. (C) Finally, we conduct an analysis of the demand (new tasks posted by the requesters) and supply (number of tasks completed by the workforce) and show how they affect task prices on the marketplace.
Hierarchical Neural Language Models for Joint Representation of Streaming Documents and their Content BIBAFull-Text 248-255
  Nemanja Djuric; Hao Wu; Vladan Radosavljevic; Mihajlo Grbovic; Narayan Bhamidipati
We consider the problem of learning distributed representations for documents in data streams. The documents are represented as low-dimensional vectors and are jointly learned with distributed vector representations of word tokens using a hierarchical framework with two embedded neural language models. In particular, we exploit the context of documents in streams and use one of the language models to model the document sequences, and the other to model word sequences within them. The models learn continuous vector representations for both word tokens and documents such that semantically similar documents and words are close in a common vector space. We discuss extensions to our model, which can be applied to personalized recommendation and social relationship mining by adding further user layers to the hierarchy, thus learning user-specific vectors to represent individual preferences. We validated the learned representations on a public movie rating data set from MovieLens, as well as on a large-scale Yahoo News data comprising three months of user activity logs collected on Yahoo servers. The results indicate that the proposed model can learn useful representations of both documents and word tokens, outperforming the current state-of-the-art by a large margin.
Future User Engagement Prediction and Its Application to Improve the Sensitivity of Online Experiments BIBAFull-Text 256-266
  Alexey Drutsa; Gleb Gusev; Pavel Serdyukov
Modern Internet companies improve their services by means of data-driven decisions that are based on online controlled experiments (also known as A/B tests). To run more online controlled experiments and to get statistically significant results faster are the emerging needs for these companies. The main way to achieve these goals is to improve the sensitivity of A/B experiments. We propose a novel approach to improve the sensitivity of user engagement metrics (that are widely used in A/B tests) by utilizing prediction of the future behavior of an individual user. This problem of prediction of the exact value of a user engagement metric is also novel and is studied in our work. We demonstrate the effectiveness of our sensitivity improvement approach on several real online experiments run at Yandex. Especially, we show how it can be used to detect the treatment effect of an A/B test faster with the same level of statistical significance.
Enriching Structured Knowledge with Open Information BIBAFull-Text 267-277
  Arnab Dutta; Christian Meilicke; Heiner Stuckenschmidt
We propose an approach for semantifying web extracted facts. In particular, we map subject and object terms of these facts to instances; and relational phrases to object properties defined in a target knowledge base. By doing this we resolve the ambiguity inherent in the web extracted facts, while simultaneously enriching the target knowledge base with a significant number of new assertions. In this paper, we focus on the mapping of the relational phrases in the context of the overall work ow. Furthermore, in an open extraction setting identical semantic relationships can be represented by different surface forms, making it necessary to group these surface forms together. To solve this problem we propose the use of Markov clustering. In this work we present a complete, ontology independent, generalized workflow which we evaluate on facts extracted by Nell and Reverb. Our target knowledge base is DBpedia. Our evaluation shows promising results in terms of producing highly precise facts. Moreover, the results indicate that the clustering of relational phrases pays of in terms of an improved instance and property mapping.
A Multi-View Deep Learning Approach for Cross Domain User Modeling in Recommendation Systems BIBAFull-Text 278-288
  Ali Mamdouh Elkahky; Yang Song; Xiaodong He
Recent online services rely heavily on automatic personalization to recommend relevant content to a large number of users. This requires systems to scale promptly to accommodate the stream of new users visiting the online services for the first time. In this work, we propose a content-based recommendation system to address both the recommendation quality and the system scalability. We propose to use a rich feature set to represent users, according to their web browsing history and search queries. We use a Deep Learning approach to map users and items to a latent space where the similarity between users and their preferred items is maximized. We extend the model to jointly learn from features of items from different domains and user features by introducing a multi-view Deep Learning model. We show how to make this rich-feature based user representation scalable by reducing the dimension of the inputs and the amount of training data. The rich user feature representation allows the model to learn relevant user behavior patterns and give useful recommendations for users who do not have any interaction with the service, given that they have adequate search and browsing history. The combination of different domains into a single model for learning helps improve the recommendation quality across all the domains, as well as having a more compact and a semantically richer user latent feature vector. We experiment with our approach on three real-world recommendation systems acquired from different sources of Microsoft products: Windows Apps recommendation, News recommendation, and Movie/TV recommendation. Results indicate that our approach is significantly better than the state-of-the-art algorithms (up to 49% enhancement on existing users and 115% enhancement on new users). In addition, experiments on a publicly open data set also indicate the superiority of our method in comparison with transitional generative topic models, for modeling cross-domain recommender systems. Scalability analysis show that our multi-view DNN model can easily scale to encompass millions of users and billions of item entries. Experimental results also confirm that combining features from all domains produces much better performance than building separate models for each domain.
Cookies That Give You Away: The Surveillance Implications of Web Tracking BIBAFull-Text 289-299
  Steven Englehardt; Dillon Reisman; Christian Eubank; Peter Zimmerman; Jonathan Mayer; Arvind Narayanan; Edward W. Felten
We study the ability of a passive eavesdropper to leverage "third-party" HTTP tracking cookies for mass surveillance. If two web pages embed the same tracker which tags the browser with a unique cookie, then the adversary can link visits to those pages from the same user (i.e., browser instance) even if the user's IP address varies. Further, many popular websites leak a logged-in user's identity to an eavesdropper in unencrypted traffic. To evaluate the effectiveness of our attack, we introduce a methodology that combines web measurement and network measurement. Using OpenWPM, our web privacy measurement platform, we simulate users browsing the web and find that the adversary can reconstruct 62-73% of a typical user's browsing history. We then analyze the effect of the physical location of the wiretap as well as legal restrictions such as the NSA's "one-end foreign" rule. Using measurement units in various locations -- Asia, Europe, and the United States -- we show that foreign users are highly vulnerable to the NSA's dragnet surveillance due to the concentration of third-party trackers in the U.S. Finally, we find that some browser-based privacy tools mitigate the attack while others are largely ineffective.
Efficient Densest Subgraph Computation in Evolving Graphs BIBAFull-Text 300-310
  Alessandro Epasto; Silvio Lattanzi; Mauro Sozio
Densest subgraph computation has emerged as an important primitive in a wide range of data analysis tasks such as community and event detection. Social media such as Facebook and Twitter are highly dynamic with new friendship links and tweets being generated incessantly, calling for efficient algorithms that can handle very large and highly dynamic input data. While either scalable or dynamic algorithms for finding densest subgraphs have been proposed, a viable and satisfactory solution for addressing both the dynamic aspect of the input data and its large size is still missing. We study the densest subgraph problem in the dynamic graph model, for which we present the first scalable algorithm with provable guarantees. In our model, edges are added adversarially while they are removed uniformly at random from the current graph. We show that at any point in time we are able to maintain a 2(1+ε)-approximation of a current densest subgraph, while requiring O(polylog(n+r)) amortized cost per update (with high probability), where r is the total number of update operations executed and n is the maximum number of nodes in the graph. In contrast, a naive algorithm that recomputes a dense subgraph every time the graph changes requires Omega(m) work per update, where m is the number of edges in the current graph. Our theoretical analysis is complemented with an extensive experimental evaluation on large real-world graphs showing that (approximate) densest subgraphs can be maintained efficiently within hundred of microseconds per update.
A Practical Framework for Privacy-Preserving Data Analytics BIBAFull-Text 311-321
  Liyue Fan; Hongxia Jin
The availability of an increasing amount of user generated data is transformative to our society. We enjoy the benefits of analyzing big data for public interest, such as disease outbreak detection and traffic control, as well as for commercial interests, such as smart grid and product recommendation. However, the large collection of user generated data contains unique patterns and can be used to re-identify individuals, which has been exemplified by the AOL search log release incident. In this paper, we propose a practical framework for data analytics, while providing differential privacy guarantees to individual data contributors. Our framework generates differentially private aggregates which can be used to perform data mining and recommendation tasks. To alleviate the high perturbation errors introduced by the differential privacy mechanism, we present two methods with different sampling techniques to draw a subset of individual data for analysis. Empirical studies with real-world data sets show that our solutions enable accurate data analytics on a small fraction of the input data, reducing user privacy risk and data storage requirement without compromising the analysis results.
Compressed Indexes for String Searching in Labeled Graphs BIBAFull-Text 322-332
  Paolo Ferragina; Francesco Piccinno; Rossano Venturini
Storing and searching large labeled graphs is indeed becoming a key issue in the design of space/time efficient online platforms indexing modern social networks or knowledge graphs. But, as far as we know, all these results are limited to design compressed graph indexes which support basic access operations onto the link structure of the input graph, such as: given a node u, return the adjacency list of u. This paper takes inspiration from the Facebook Unicorn's platform and proposes some compressed-indexing schemes for large graphs whose nodes are labeled with strings of variable length -- i.e., node's attributes such as user's (nick-)name -- that support sophisticated search operations which involve both the linked structure of the graph and the string content of its nodes.
   An extensive experimental evaluation over real social networks will show the time and space efficiency of the proposed indexing schemes and their query processing algorithms.
Improving Paid Microtasks through Gamification and Adaptive Furtherance Incentives BIBAFull-Text 333-343
  Oluwaseyi Feyisetan; Elena Simperl; Max Van Kleek; Nigel Shadbolt
Crowdsourcing via paid microtasks has been successfully applied in a plethora of domains and tasks. Previous efforts for making such crowdsourcing more effective have considered aspects as diverse as task and workflow design, spam detection, quality control, and pricing models. Our work expands upon such efforts by examining the potential of adding gamification to microtask interfaces as a means of improving both worker engagement and effectiveness. We run a series of experiments in image labeling, one of the most common use cases for microtask crowdsourcing, and analyse worker behavior in terms of number of images completed, quality of annotations compared against a gold standard, and response to financial and game-specific rewards. Each experiment studies these parameters in two settings: one based on a state-of-the-art, non-gamified task on CrowdFlower and another one using an alternative interface incorporating several game elements. Our findings show that gamification leads to better accuracy and lower costs than conventional approaches that use only monetary incentives. In addition, it seems to make paid microtask work more rewarding and engaging, especially when sociality features are introduced. Following these initial insights, we define a predictive model for estimating the most appropriate incentives for individual workers, based on their previous contributions. This allows us to build a personalised game experience, with gains seen on the volume and quality of work completed.
Tagging Personal Photos with Transfer Deep Learning BIBAFull-Text 344-354
  Jianlong Fu; Tao Mei; Kuiyuan Yang; Hanqing Lu; Yong Rui
The advent of mobile devices and media cloud services has led to the unprecedented growing of personal photo collections. One of the fundamental problems in managing the increasing number of photos is automatic image tagging. Existing research has predominantly focused on tagging general Web images with a well-labelled image database, e.g., ImageNet. However, they can only achieve limited success on personal photos due to the domain gaps between personal photos and Web images. These gaps originate from the differences in semantic distribution and visual appearance. To deal with these challenges, in this paper, we present a novel transfer deep learning approach to tag personal photos. Specifically, to solve the semantic distribution gap, we have designed an ontology consisting of a hierarchical vocabulary tailored for personal photos. This ontology is mined from 10,000 active users in Flickr with 20 million photos and 2.7 million unique tags. To deal with the visual appearance gap, we discover the intermediate image representations and ontology priors by deep learning with bottom-up and top-down transfers across two domains, where Web images are the source domain and personal photos are the target. Moreover, we present two modes (single and batch-modes) in tagging and find that the batch-mode is highly effective to tag photo collections. We conducted personal photo tagging on 7,000 real personal photos and personal photo search on the MIT-Adobe FiveK photo dataset. The proposed tagging approach is able to achieve a performance gain of 12.8% and 4.5% in terms of NDCG@5, against the state-of-the-art hand-crafted feature-based and deep learning-based methods, respectively.
MobInsight: On Improving The Performance of Mobile Apps in Cellular Networks BIBAFull-Text 355-365
  Vijay Gabale; Dilip Krishnaswamy
It is well-known that the performance of Web-browsing as well as mobile applications (or apps) suffers on today's cellular networks. In this work, we perform a systematic measurement study of more than 50 popular apps and 2 cellular networks, and discover that while cellular networks have predictable latency, it is the path between exit points of cellular networks (e.g., GGSN) and cloud-servers that degrades apps performance. High latency and unpredictability over this path affects browsing and activity completion times of apps, worsening the performance by several magnitudes. Furthermore, we find that as the number of apps on mobile devices increases, cellular networks in turn suffer due to large number of active connections, primarily used for push notifications, experiencing heavy signaling overhead in the network. Towards accelerating the performance of apps and improving their operational efficiency, we envision an easy to deploy operator-managed platform, and study two architectural optimizations that sit at vantage points inside cellular networks: virtual app-server (vApp) and network-assisted, virtual push-notification server (vPNS). vApps improve apps' browsing experience while vPNSs take the burden of carrying periodic message off cellular networks. Using trace-driven simulations, we find that vApps can improve activity completion times by more than 3-fold, whereas vPNS can reduce the signaling load by a factor of 6 in cellular networks and reduce energy consumption by a factor of 2 on mobile devices.
Rethinking Security of Web-Based System Applications BIBAFull-Text 366-376
  Martin Georgiev; Suman Jana; Vitaly Shmatikov
Many modern desktop and mobile platforms, including Ubuntu, Google Chrome, Windows, and Firefox OS, support so called Web-based system applications that run outside the Web browser and enjoy direct access to native objects such as files, camera, and geolocation. We show that the access-control models of these platforms are (a) incompatible and (b) prone to unintended delegation of native-access rights: when applications request native access for their own code, they unintentionally enable it for untrusted third-party code, too. This enables malicious ads and other third-party content to steal users' OAuth authentication credentials, access camera on their devices, etc.
   We then design, implement, and evaluate PowerGate, a new access-control mechanism for Web-based system applications. It solves two key problems plaguing all existing platforms: security and consistency. First, unlike the existing platforms, PowerGate correctly protects native objects from unauthorized access. Second, PowerGate provides uniform access-control semantics across all platforms and is 100% backward compatible. PowerGate enables application developers to write well-defined native-object access policies with explicit principals such as "application's own local code" and "third-party Web code," is easy to configure, and incurs negligible performance overhead.
Cardinal Contests BIBAFull-Text 377-387
  Arpita Ghosh; Patrick Hummel
Contests are widely used as a means for effort elicitation in settings ranging from government R&D contests to online crowdsourcing contests on platforms such as Kaggle, Innocentive, or TopCoder. Such rank-order mechanisms -- where agents' rewards depend only on the relative ranking of their submissions' qualities -- are natural mechanisms for incentivizing effort when it is easier to obtain ordinal, rather than cardinal, information about agents' outputs, or where absolute measures of quality are unverifiable. An increasing number of online contests, however, rank entries according to some numerical evaluation of their absolute quality -- for instance, the performance of an algorithm on a test dataset, or the performance of an intervention in a randomized trial. Can the contest designer incentivize higher effort by making the rewards in an ordinal rank-order mechanism contingent on such cardinal information? We model and analyze cardinal contests, where a principal running a rank-order tournament has access to an absolute measure of the qualities of agents' submissions in addition to their relative rankings, and ask how modifying the rank-order tournament to incorporate cardinal information can improve incentives for effort. Our main result is that a simple threshold mechanism -- a mechanism that awards the prize for a rank if and only if the absolute quality of the agent at that rank exceeds a certain threshold -- is optimal amongst all mixed cardinal-ordinal mechanisms where the fraction of the jth prize awarded to the jth-ranked agent is any arbitrary non-decreasing function of her submission's quality. Further, the optimal threshold mechanism uses exactly the same threshold for each rank. We study what contest parameters determine the extent of the benefit from incorporating such cardinal information into an ordinal rank-order contest, and investigate the extent of improvement in equilibrium effort via numerical simulations.
Accessible On-Line Floor Plans BIBAFull-Text 388-398
  Cagatay Goncu; Anuradha Madugalla; Simone Marinai; Kim Marriott
Better access to on-line information graphics is a pressing need for people who are blind or have severe vision impairment. We present a new model for accessible presentation of on-line information graphics and demonstrate its use for presenting floor plans. While floor plans are increasingly provided on-line, people who are blind are at best provided with only a high-level textual description. This makes it difficult for them to understand the spatial arrangement of the objects on the floor plan. Our new approach provides users with significantly better access to such plans. The users can automatically generate an accessible version of a floor plan from an on-line floor plan image quickly and independently by using a web service. This generates a simplified graphic showing the rooms, walls, doors and windows in the original floor plan as well as a textual overview. The accessible floor plan is presented on an iPad using audio feedback. As the users touch graphic elements on the screen, the element they are touching is described by speech and non-speech audio in order to help them navigate the graphic.
Network A/B Testing: From Sampling to Estimation BIBAFull-Text 399-409
  Huan Gui; Ya Xu; Anmol Bhasin; Jiawei Han
A/B testing, also known as bucket testing, split testing, or controlled experiment, is a standard way to evaluate user engagement or satisfaction from a new service, feature, or product. It is widely used in online websites, including social network sites such as Facebook, LinkedIn, and Twitter to make data-driven decisions. The goal of A/B testing is to estimate the treatment effect of a new change, which becomes intricate when users are interacting, i.e., the treatment effect of a user may spill over to other users via underlying social connections.When conducting these online controlled experiments, it is a common practice to make the Stable Unit Treatment Value Assumption (SUTVA) that each individual's response is affected by their own treatment only. Though this assumption simplifies the estimation of treatment effect, it does not hold when network interference is present, and may even lead to wrong conclusion.
   In this paper, we study the problem of network A/B testing in real networks, which have substantially different characteristics from the simulated random networks studied in previous works. We first examine the existence of network effect in a recent online experiment conducted at LinkedIn; Secondly, we propose an efficient and effective estimator for Average Treatment Effect (ATE) considering the interference between users in real online experiments; Finally, we apply our method in both simulations and a real world online experiment. The simulation results show that our estimator achieves better performance with respect to both bias and variance reduction. The real world online experiment not only demonstrates that large-scale network A/B test is feasible but also further validates many of our observations in the simulation studies.
User Session Identification Based on Strong Regularities in Inter-activity Time BIBAFull-Text 410-418
  Aaron Halfaker; Oliver Keyes; Daniel Kluver; Jacob Thebault-Spieker; Tien Nguyen; Kenneth Shores; Anuradha Uduwage; Morten Warncke-Wang
Session identification is a common strategy used to develop metrics for web analytics and perform behavioral analyses of user-facing systems. Past work has argued that session identification strategies based on an inactivity threshold is inherently arbitrary or has advocated that thresholds be set at about 30 minutes. In this work, we demonstrate a strong regularity in the temporal rhythms of user initiated events across several different domains of online activity (incl. video gaming, search, page views and volunteer contributions). We describe a methodology for identifying clusters of user activity and argue that the regularity with which these activity clusters appear implies a good rule-of-thumb inactivity threshold of about 1 hour. We conclude with implications that these temporal rhythms may have for system design based on our observations and theories of goal-directed human activity.
Incentivizing High Quality Crowdwork BIBAFull-Text 419-429
  Chien-Ju Ho; Aleksandrs Slivkins; Siddharth Suri; Jennifer Wortman Vaughan
We study the causal effects of financial incentives on the quality of crowdwork. We focus on performance-based payments (PBPs), bonus payments awarded to workers for producing high quality work. We design and run randomized behavioral experiments on the popular crowdsourcing platform Amazon Mechanical Turk with the goal of understanding when, where, and why PBPs help, identifying properties of the payment, payment structure, and the task itself that make them most effective. We provide examples of tasks for which PBPs do improve quality. For such tasks, the effectiveness of PBPs is not too sensitive to the threshold for quality required to receive the bonus, while the magnitude of the bonus must be large enough to make the reward salient. We also present examples of tasks for which PBPs do not improve quality. Our results suggest that for PBPs to improve quality, the task must be effort-responsive: the task must allow workers to produce higher quality work by exerting more effort. We also give a simple method to determine if a task is effort-responsive a priori. Furthermore, our experiments suggest that all payments on Mechanical Turk are, to some degree, implicitly performance-based in that workers believe their work may be rejected if their performance is sufficiently poor. Finally, we propose a new model of worker behavior that extends the standard principal-agent model from economics to include a worker's subjective beliefs about his likelihood of being paid, and show that the predictions of this model are in line with our experimental findings. This model may be useful as a foundation for theoretical studies of incentives in crowdsourcing markets.
Skolemising Blank Nodes while Preserving Isomorphism BIBAFull-Text 430-440
  Aidan Hogan
In this paper, we propose and evaluate a scheme to produce canonical labels for blank nodes in RDF graphs. These labels can be used as the basis for a Skolemisation scheme that gets rid of the blank nodes in an RDF graph by mapping them to globally canonical IRIs. Assuming no hash collisions, the scheme guarantees that two Skolemised graphs will be equal if and only if the two input graphs are isomorphic. Although the proposed scheme is exponential in the worst case, we claim that such cases are unlikely to be encountered in practice. To support these claims, we present the results of applying our Skolemisation scheme over a diverse collection of 43.5 million real-world RDF graphs (BTC-2014); we also provide results for some nasty synthetic cases.
Scalable Methods for Adaptively Seeding a Social Network BIBAFull-Text 441-451
  Thibaut Horel; Yaron Singer
In recent years, social networking platforms have developed into extraordinary channels for spreading and consuming information. Along with the rise of such infrastructure, there is continuous progress on techniques for spreading information effectively through influential users. In many applications, one is restricted to select influencers from a set of users who engaged with the topic being promoted, and due to the structure of social networks, these users often rank low in terms of their influence potential. An alternative approach one can consider is an adaptive method which selects users in a manner which targets their influential neighbors. The advantage of such an approach is that it leverages the friendship paradox in social networks: while users are often not influential, they often know someone who is. Despite the various complexities in such optimization problems, we show that scalable adaptive seeding is achievable. In particular, we develop algorithms for linear influence models with provable approximation guarantees that can be gracefully parallelized. To show the effectiveness of our methods we collected data from various verticals social network users follow. For each vertical, we collected data on the users who responded to a certain post as well as their neighbors, and applied our methods on this data. Our experiments show that adaptive seeding is scalable, and importantly, that it obtains dramatic improvements over standard approaches of information dissemination.
User Review Sites as a Resource for Large-Scale Sociolinguistic Studies BIBAFull-Text 452-461
  Dirk Hovy; Anders Johannsen; Anders Søgaard
Sociolinguistic studies investigate the relation between language and extra-linguistic variables. This requires both representative text data and the associated socio-economic meta-data of the subjects. Traditionally, sociolinguistic studies use small samples of hand-curated data and meta-data. This can lead to exaggerated or false conclusions. Using social media data offers a large-scale source of language data, but usually lacks reliable socio-economic meta-data. Our research aims to remedy both problems by exploring a large new data source, international review websites with user profiles. They provide more text data than manually collected studies, and more meta-data than most available social media text. We describe the data and present various pilot studies, illustrating the usefulness of this resource for sociolinguistic studies. Our approach can help generate new research hypotheses based on data-driven findings across several countries and languages.
When Does Improved Targeting Increase Revenue? BIBAFull-Text 462-472
  Patrick Hummel; R. Preston McAfee
In second price auctions with symmetric bidders, we find that improved targeting via enhanced information disclosure decreases revenue when there are two bidders and increases revenue if there are at least four bidders. With asymmetries, improved targeting increases revenue if the most frequent winner wins less than 30.4% of the time, but can decrease revenue otherwise. We derive analogous results for position auctions. Finally, we show that revenue can vary non-monotonically with the number of bidders who are able to take advantage of improved targeting.
Social Status and Badge Design BIBAFull-Text 473-483
  Nicole Immorlica; Greg Stoddard; Vasilis Syrgkanis
Many websites encourage user participation via the use of virtual rewards like badges. While badges typically have no explicit value, they act as symbols of social status within a community. In this paper, we study how to design virtual incentive mechanisms that maximize total contributions to a website when users are motivated by social status. We consider a game-theoretic model where users exert costly effort to make contributions and, in return, are awarded with badges. The value of a badge is determined endogenously by the number of users who earn an equal or higher badge; as more users earn a particular badge, the value of that badge diminishes for all users. We show that among all possible mechanisms for assigning status-driven rewards, the optimal mechanism is a leaderboard with a cutoff: users that contribute less than a certain threshold receive nothing while the remaining are ranked by contribution. We next study the necessary features of approximately optimal mechanisms and find that approximate optimality is influenced by the convexity of status valuations. When status valuations are concave, any approximately optimal mechanism must contain a coarse status partition, i.e. a partition of users into status classes whose size will grow as the population grows. Conversely when status valuations are convex, we prove that fine partitioning, that is a partition of users into status classes whose size stays constant as the population grows, is necessary for approximate optimality.
Mapping Temporal Horizons: Analysis of Collective Future and Past related Attention in Twitter BIBAFull-Text 484-494
  Adam Jatowt; Émilien Antoine; Yukiko Kawai; Toyokazu Akiyama
Microblogging platforms such as Twitter have recently received much attention as great sources for live web sensing, real-time event detection and opinion analysis. Previous works usually assumed that tweets mainly describe "what's happening now". However, a large portion of tweets contains time expressions that refer to time frames within the past or the future. Such messages often reflect expectations or memories of social media users. In this work we investigate how microblogging users collectively refer to time. In particular, we analyze half a year long portion of Japanese and four months long collection of US tweets and we quantify collective temporal attention of users as well as other related temporal characteristics. This kind of knowledge is helpful in the context of growing interest for detection and prediction of important events within social media. The exploratory analysis we perform is possible thanks to the development of visual analytics framework for robust overview and easy detection of various regularities in the past and future-oriented thinking of Twitter users. We believe that the visualizations we provide and the findings we outline can be also valuable for sociologists and computer scientists to test and refine their models about time in natural language.
Path Sampling: A Fast and Provable Method for Estimating 4-Vertex Subgraph Counts BIBAFull-Text 495-505
  Madhav Jha; C. Seshadhri; Ali Pinar
Counting the frequency of small subgraphs is a fundamental technique in network analysis across various domains, most notably in bioinformatics and social networks. The special case of triangle counting has received much attention. Getting results for 4-vertex patterns is highly challenging, and there are few practical results known that can scale to massive sizes. Indeed, even a highly tuned enumeration code takes more than a day on a graph with millions of edges. Most previous work that runs for truly massive graphs employ clusters and massive parallelization.
   We provide a sampling algorithm that provably and accurately approximates the frequencies of all 4-vertex pattern subgraphs. Our algorithm is based on a novel technique of 3-path sampling and a special pruning scheme to decrease the variance in estimates. We provide theoretical proofs for the accuracy of our algorithm, and give formal bounds for the error and confidence of our estimates. We perform a detailed empirical study and show that our algorithm provides estimates within 1% relative error for all subpatterns (over a large class of test graphs), while being orders of magnitude faster than enumeration and other sampling based algorithms. Our algorithm takes less than a minute (on a single commodity machine) to process an Orkut social network with 300 million edges.
Automatic Online Evaluation of Intelligent Assistants BIBAFull-Text 506-516
  Jiepu Jiang; Ahmed Hassan Awadallah; Rosie Jones; Umut Ozertem; Imed Zitouni; Ranjitha Gurunath Kulkarni; Omar Zia Khan
Voice-activated intelligent assistants, such as Siri, Google Now, and Cortana, are prevalent on mobile devices. However, it is challenging to evaluate them due to the varied and evolving number of tasks supported, e.g., voice command, web search, and chat. Since each task may have its own procedure and a unique form of correct answers, it is expensive to evaluate each task individually. This paper is the first attempt to solve this challenge. We develop consistent and automatic approaches that can evaluate different tasks in voice-activated intelligent assistants. We use implicit feedback from users to predict whether users are satisfied with the intelligent assistant as well as its components, i.e., speech recognition and intent classification. Using this approach, we can potentially evaluate and compare different tasks within and across intelligent assistants ac-cording to the predicted user satisfaction rates. Our approach is characterized by an automatic scheme of categorizing user-system interaction into task-independent dialog actions, e.g., the user is commanding, selecting, or confirming an action. We use the action sequence in a session to predict user satisfaction and the quality of speech recognition and intent classification. We also incorporate other features to further improve our approach, including features derived from previous work on web search satisfaction prediction, and those utilizing acoustic characteristics of voice requests. We evaluate our approach using data collected from a user study. Results show our approach can accurately identify satisfactory and unsatisfactory sessions.
Incorporating Social Context and Domain Knowledge for Entity Recognition BIBAFull-Text 517-526
  Jie Tang; Zhanpeng Fang; Jimeng Sun
Recognizing entity instances in documents according to a knowledge base is a fundamental problem in many data mining applications. The problem is extremely challenging for short documents in complex domains such as social media and biomedical domains. Large concept spaces and instance ambiguity are key issues that need to be addressed. Most of the documents are created in a social context by common authors via social interactions, such as reply and citations. Such social contexts are largely ignored in the instance-recognition literature. How can users' interactions help entity instance recognition? How can the social context be modeled so as to resolve the ambiguity of different instances?
   In this paper, we propose the SOCINST model to formalize the problem into a probabilistic model. Given a set of short documents (e.g., tweets or paper abstracts) posted by users who may connect with each other, SOCINST can automatically construct a context of subtopics for each instance, with each subtopic representing one possible meaning of the instance. The model is also able to incorporate social relationships between users to help build social context. We further incorporate domain knowledge into the model using a Dirichlet tree distribution.
   We evaluate the proposed model on three different genres of datasets: ICDM'12 Contest, Weibo, and I2B2. In ICDM'12 Contest, the proposed model clearly outperforms (+21.4%; $p < 1e-5 with t-test) all the top contestants. In Weibo and I2B2, our results also show that the recognition accuracy of SOCINST is up to 5.3-26.6% better than those of several alternative methods.
Querying Web-Scale Information Networks Through Bounding Matching Scores BIBAFull-Text 527-537
  Jiahui Jin; Samamon Khemmarat; Lixin Gao; Junzhou Luo
Web-scale information networks containing billions of entities are common nowadays. Querying these networks can be modeled as a subgraph matching problem. Since information networks are incomplete and noisy in nature, it is important to discover answers that match exactly as well as answers that are similar to queries. Existing graph matching algorithms usually use graph indices to improve the efficiency of query processing. For web-scale information networks, it may not be feasible to build the graph indices due to the amount of work and the memory/storage required. In this paper, we propose an efficient algorithm for finding the best k answers for a given query without precomputing graph indices. The quality of an answer is measured by a matching score that is computed online. To speed up query processing, we propose a novel technique for bounding the matching scores during the computation. By using bounds, we can efficiently prune the answers that have low qualities without having to evaluate all possible answers. The bounding technique can be implemented in a distributed environment, allowing our approach to efficiently answer the queries on web-scale information networks. We demonstrate the effectiveness and the efficiency of our approach through a series of experiments on real-world information networks. The result shows that our bounding technique can reduce the running time up to two orders of magnitude comparing to an approach that does not use bounds.
LN-Annote: An Alternative Approach to Information Extraction from Emails using Locally-Customized Named-Entity Recognition BIBAFull-Text 538-548
  YoungHoon Jung; Karl Stratos; Luca P. Carloni
Personal mobile devices offer a growing variety of personalized services that enrich considerably the user experience. This is made possible by increased access to personal information, which to a large extent is extracted from user email messages and archives. There are, however, two main issues. First, currently these services can be offered only by large web-service companies that can also deploy email services. Second, keeping a large amount of structured personal information on the cloud raises privacy concerns. To address these problems, we propose LN-Annote, a new method to extract personal information from the email that is locally available on mobile devices (without remote access to the cloud). LN-Annote enables third-party service providers to build a question-answering system on top of the local personal information without having to own the user data. In addition, LN-Annote mitigates the privacy concerns by keeping the structured personal information directly on the personal device. Our method is based on a named-entity recognizer trained in two separate steps: first using a common dataset on the cloud and then using a personal dataset in the mobile device at hand. Our contributions include also the optimization of the implementation of LN-Annote: in particular, we implemented an OpenCL version of the custom-training algorithm to leverage the Graphic Processing Unit (GPU) available on the mobile device. We present an extensive set of experiment results: beside proving the feasibility of our approach, they demonstrate its efficiency in terms of the named-entity extraction performance as well as the execution speed and the energy consumption spent in mobile devices.
Describing and Understanding Neighborhood Characteristics through Online Social Media BIBAFull-Text 549-559
  Mohamed Kafsi; Henriette Cramer; Bart Thomee; David A. Shamma
Geotagged data can be used to describe regions in the world and discover local themes. However, not all data produced within a region is necessarily specifically descriptive of that area. To surface the content that is characteristic for a region, we present the geographical hierarchy model (GHM), a probabilistic model based on the assumption that data observed in a region is a random mixture of content that pertains to different levels of a hierarchy. We apply the GHM to a dataset of 8 million Flickr photos in order to discriminate between content (i.e. tags) that specifically characterizes a region (e.g. neighborhood) and content that characterizes surrounding areas or more general themes. Knowledge of the discriminative and non-discriminative terms used throughout the hierarchy enables us to quantify the uniqueness of a given region and to compare similar but distant regions. Our evaluation demonstrates that our model improves upon traditional Naive Bayes classification by 47% and hierarchical TF-IDF by 27%. We further highlight the differences and commonalities with human reasoning about what is locally characteristic for a neighborhood, distilled from ten interviews and a survey that covered themes such as time, events, and prior regional knowledge.
Active Learning for Multi-relational Data Construction BIBAFull-Text 560-569
  Hiroshi Kajino; Akihiro Kishimoto; Adi Botea; Elizabeth Daly; Spyros Kotoulas
Knowledge on the Web relies heavily on multi-relational representations, such as RDF and Schema.org. Automatically extracting knowledge from documents and linking existing databases are common approaches to construct multi-relational data. Complementary to such approaches, there is still a strong demand for manually encoding human expert knowledge. For example, human annotation is necessary for constructing a common-sense knowledge base, which stores facts implicitly shared in a community, because such knowledge rarely appears in documents. As human annotation is both tedious and costly, an important research challenge is how to best use limited human resources, whiles maximizing the quality of the resulting dataset. In this paper, we formalize the problem of dataset construction as active learning problems and present the Active Multi-relational Data Construction (AMDC) method. AMDC repeatedly interleaves multi-relational learning and expert input acquisition, allowing us to acquire helpful labels for data construction. Experiments on real datasets demonstrate that our solution increases the number of positive triples by a factor of 2.28 to 17.0, and that the predictive performance of the multi-relational model in AMDC achieves the highest or comparable to the best performance throughout the data construction process.
The Social World of Content Abusers in Community Question Answering BIBAFull-Text 570-580
  Imrul Kayes; Nicolas Kourtellis; Daniele Quercia; Adriana Iamnitchi; Francesco Bonchi
Community-based question answering platforms can be rich sources of information on a variety of specialized topics, from finance to cooking. The usefulness of such platforms depends heavily on user contributions (questions and answers), but also on respecting the community rules. As a crowd-sourced service, such platforms rely on their users for monitoring and flagging content that violates community rules. Common wisdom is to eliminate the users who receive many flags. Our analysis of a year of traces from a mature Q&A site shows that the number of flags does not tell the full story: on one hand, users with many flags may still contribute positively to the community. On the other hand, users who never get flagged are found to violate community rules and get their accounts suspended. This analysis, however, also shows that abusive users are betrayed by their network properties: we find strong evidence of homophilous behavior and use this finding to detect abusive users who go under the community radar. Based on our empirical observations, we build a classifier that is able to detect abusive users with an accuracy as high as 83%.
The Lifecycles of Apps in a Social Ecosystem BIBAFull-Text 581-591
  Isabel Kloumann; Lada Adamic; Jon Kleinberg; Shaomei Wu
Apps are emerging as an important form of on-line content, and they combine aspects of Web usage in interesting ways -- they exhibit a rich temporal structure of user adoption and long-term engagement, and they exist in a broader social ecosystem that helps drive these patterns of adoption and engagement. It has been difficult, however, to study apps in their natural setting since this requires a simultaneous analysis of a large set of popular apps and the underlying social network they inhabit. In this work we address this challenge through an analysis of the collection of apps on Facebook Login, developing a novel framework for analyzing both temporal and social properties. At the temporal level, we develop a retention model that represents a user's tendency to return to an app using a very small parameter set. At the social level, we organize the space of apps along two fundamental axes -- popularity and sociality -- and we show how a user's probability of adopting an app depends both on properties of the local network structure and on the match between the user's attributes, his or her friends' attributes, and the dominant attributes within the app's user population. We also develop models that show the importance of different feature sets with strong performance in predicting app success.
Getting More for Less: Optimized Crowdsourcing with Dynamic Tasks and Goals BIBAFull-Text 592-602
  Ari Kobren; Chun How Tan; Panagiotis Ipeirotis; Evgeniy Gabrilovich
In crowdsourcing systems, the interests of contributing participants and system stakeholders are often not fully aligned. Participants seek to learn, be entertained, and perform easy tasks, which offer them instant gratification; system stakeholders want users to complete more difficult tasks, which bring higher value to the crowdsourced application. We directly address this problem by presenting techniques that optimize the crowdsourcing process by jointly maximizing the user longevity in the system and the true value that the system derives from user participation.
   We first present models that predict the "survival probability" of a user at any given moment, that is, the probability that a user will proceed to the next task offered by the system. We then leverage this survival model to dynamically decide what task to assign and what motivating goals to present to the user. This allows us to jointly optimize for the short term (getting difficult tasks done) and for the long term (keeping users engaged for longer periods of time).
   We show that dynamically assigning tasks significantly increases the value of a crowdsourcing system. In an extensive empirical evaluation, we observed that our task allocation strategy increases the amount of information collected by up to 117.8%. We also explore the utility of motivating users with goals. We demonstrate that setting specific, static goals can be highly detrimental to the long-term user participation, as the completion of a goal (e.g., earning a badge) is also a common drop-off point for many users. We show that setting the goals dynamically, in conjunction with judicious allocation of tasks, increases the amount of information collected by the crowdsourcing system by up to 249%, compared to the existing baselines that use fixed objectives.
Evolution of Conversations in the Age of Email Overload BIBAFull-Text 603-613
  Farshad Kooti; Luca Maria Aiello; Mihajlo Grbovic; Kristina Lerman; Amin Mantrach
Email is a ubiquitous communications tool in the workplace and plays an important role in social interactions. Previous studies of email were largely based on surveys and limited to relatively small populations of email users within organizations. In this paper, we report results of a large-scale study of more than 2 million users exchanging 16 billion emails over several months. We quantitatively characterize the replying behavior in conversations within pairs of users. In particular, we study the time it takes the user to reply to a received message and the length of the reply sent. We consider a variety of factors that affect the reply time and length, such as the stage of the conversation, user demographics, and use of portable devices. In addition, we study how increasing load affects emailing behavior. We find that as users receive more email messages in a day, they reply to a smaller fraction of them, using shorter replies. However, their responsiveness remains intact, and they may even reply to emails faster. Finally, we predict the time to reply, length of reply, and whether the reply ends a conversation. We demonstrate considerable improvement over the baseline in all three prediction tasks, showing the significant role that the factors that we uncover play, in determining replying behavior. We rank these factors based on their predictive power. Our findings have important implications for understanding human behavior and designing better email management applications for tasks like ranking unread emails.
Events and Controversies: Influences of a Shocking News Event on Information Seeking BIBAFull-Text 614-624
  Danai Koutra; Paul N. Bennett; Eric Horvitz
It has been suggested that online search and retrieval contributes to the intellectual isolation of users within their preexisting ideologies, where people's prior views are strengthened and alternative viewpoints are infrequently encountered. This so-called "filter bubble" phenomenon has been called out as especially detrimental when it comes to dialog among people on controversial, emotionally charged topics, such as the labeling of genetically modified food, the right to bear arms, the death penalty, and online privacy. We seek to identify and study information-seeking behavior and access to alternative versus reinforcing viewpoints following shocking, emotional, and large-scale news events. We choose for a case study to analyze search and browsing on gun control/rights, a strongly polarizing topic for both citizens and leaders of the United States. We study the period of time preceding and following a mass shooting to understand how its occurrence, follow-on discussions, and debate may have been linked to changes in the patterns of searching and browsing. We employ information-theoretic measures to quantify the diversity of Web domains of interest to users and understand the browsing patterns of users. We use these measures to characterize the influence of news events on these web search and browsing patterns.
Statistically Significant Detection of Linguistic Change BIBAFull-Text 625-635
  Vivek Kulkarni; Rami Al-Rfou; Bryan Perozzi; Steven Skiena
We propose a new computational approach for tracking and detecting statistically significant linguistic shifts in the meaning and usage of words. Such linguistic shifts are especially prevalent on the Internet, where the rapid exchange of ideas can quickly change a word's meaning. Our meta-analysis approach constructs property time series of word usage, and then uses statistically sound change point detection algorithms to identify significant linguistic shifts. We consider and analyze three approaches of increasing complexity to generate such linguistic property time series, the culmination of which uses distributional characteristics inferred from word co-occurrences. Using recently proposed deep neural language models, we first train vector representations of words for each time period. Second, we warp the vector spaces into one unified coordinate system. Finally, we construct a distance-based distributional time series for each word to track its linguistic displacement over time.
   We demonstrate that our approach is scalable by tracking linguistic change across years of micro-blogging using Twitter, a decade of product reviews using a corpus of movie reviews from Amazon, and a century of written books using the Google Book Ngrams. Our analysis reveals interesting patterns of language usage change commensurate with each medium.
Replacing the Irreplaceable: Fast Algorithms for Team Member Recommendation BIBAFull-Text 636-646
  Liangyue Li; Hanghang Tong; Nan Cao; Kate Ehrlich; Yu-Ru Lin; Norbou Buchler
In this paper, we study the problem of TEAM MEMBER REPLACEMENT -- given a team of people embedded in a social network working on the same task, find a good candidate to best replace a team member who becomes unavailable to perform the task for certain reason (e.g., conflicts of interests or resource capacity). Prior studies in teamwork have suggested that a good team member replacement should bring synergy to the team in terms of having both skill matching and structure matching. However, existing techniques either do not cover both aspects or consider the two aspects independently. In this work, we propose a novel problem formulation using the concept of graph kernels that takes into account the interaction of both skill and structure matching requirements. To tackle the computational challenges, we propose a family of fast algorithms by (a) designing effective pruning strategies, and (b) exploring the smoothness between the existing and the new team structures. We conduct extensive experimental evaluations and user studies on real world datasets to demonstrate the effectiveness and efficiency. Our algorithms (a) perform significantly better than the alternative choices in terms of both precision and recall and (b) scale sub-linearly.
Robust Group Linkage BIBAFull-Text 647-657
  Pei Li; Xin Luna Dong; Songtao Guo; Andrea Maurino; Divesh Srivastava
We study the problem of group linkage: linking records that refer to multiple entities in the same group. Applications for group linkage include finding businesses in the same chain, finding social network users from the same organization, and so on. Group linkage faces new challenges compared to traditional entity resolution. First, although different members in the same group can share some similar global values of an attribute, they represent different entities so can also have distinct local values for the same or different attributes, requiring a high tolerance for value diversity. Second, we need to be able to distinguish local values from erroneous values.
   We present a robust two-stage algorithm: the first stage identifies pivots -- maximal sets of records that are very likely to belong to the same group, while being robust to possible erroneous values; the second stage collects strong evidence from the pivots and leverages it for merging more records into the same group, while being tolerant to differences in local values of an attribute. Experimental results show the high effectiveness and efficiency of our algorithm on various real-world data sets.
Uncovering the Small Community Structure in Large Networks: A Local Spectral Approach BIBAFull-Text 658-668
  Yixuan Li; Kun He; David Bindel; John E. Hopcroft
Large graphs arise in a number of contexts and understanding their structure and extracting information from them is an important research area. Early algorithms on mining communities have focused on the global structure, and often run in time functional to the size of the entire graph. Nowadays, as we often explore networks with billions of vertices and find communities of size hundreds, it is crucial to shift our attention from macroscopic structure to microscopic structure when dealing with large networks. A growing body of work has been adopting local expansion methods in order to identify the community from a few exemplary seed members.
   In this paper, we propose a novel approach for finding overlapping communities called LEMON (Local Expansion via Minimum One Norm). Different from PageRank-like diffusion methods, LEMON finds the community by seeking a sparse vector in the span of the local spectra such that the seeds are in its support. We show that LEMON can achieve the highest detection accuracy among state-of-the-art proposals. The running time depends on the size of the community rather than that of the entire graph. The algorithm is easy to implement, and is highly parallelizable.
   Moreover, given that networks are not all similar in nature, a comprehensive analysis on how the local expansion approach is suited for uncovering communities in different networks is still lacking. We thoroughly evaluate our approach using both synthetic and real-world datasets across different domains, and analyze the empirical variations when applying our method to inherently different networks in practice. In addition, the heuristics on how the quality and quantity of the seed set would affect the performance are provided.
Scalable Parallel EM Algorithms for Latent Dirichlet Allocation in Multi-Core Systems BIBAFull-Text 669-679
  Xiaosheng Liu; Jia Zeng; Xi Yang; Jianfeng Yan; Qiang Yang
Latent Dirichlet allocation (LDA) is a widely-used probabilistic topic modeling tool for content analysis such as web mining. To handle web-scale content analysis on just a single PC, we propose multi-core parallel expectation-maximization (PEM) algorithms to infer and estimate LDA parameters in shared memory systems. By avoiding memory access conflicts, reducing the locking time among multiple threads and residual-based dynamic scheduling, we show that PEM algorithms are more scalable and accurate than the current state-of-the-art parallel LDA algorithms on a commodity PC. This parallel LDA toolbox is made publicly available as open source software at mloss.org.
Grading the Graders: Motivating Peer Graders in a MOOC BIBAFull-Text 680-690
  Yanxin Lu; Joe Warren; Christopher Jermaine; Swarat Chaudhuri; Scott Rixner
In this paper, we detail our efforts at creating and running a controlled study designed to examine how students in a MOOC might be motivated to do a better job during peer grading. This study involves more than one thousand students of a popular MOOC. We ask two specific questions: (1) When a student knows that his or her own peer grading efforts are being examined by peers, does this knowledge alone tend to motivate the student to do a better job when grading assignments? And (2) when a student not only knows that his or her own peer grading efforts are being examined by peers, but he or she is also given a number of other peer grading efforts to evaluate (so the peer graders see how other peer graders evaluate assignments), do both of these together tend to motivate the student to do a better job when grading assignments? We find strong statistical evidence that "grading the graders" does in fact tend to increase the quality of peer grading.
Measurement and Analysis of Mobile Web Cache Performance BIBAFull-Text 691-701
  Yun Ma; Xuanzhe Liu; Shuhui Zhang; Ruirui Xiang; Yunxin Liu; Tao Xie
The Web browser is a killer app on mobile devices such as smartphones. However, the user experience of mobile Web browsing is undesirable because of the slow resource loading. To improve the performance of Web resource loading, caching has been adopted as a key mechanism. However, the existing passive measurement studies cannot comprehensively characterize the performance of mobile Web caching. For example, most of these studies mainly focus on client-side implementations but not server-side configurations, suffer from biased user behaviors, and fail to study "miscached" resources. To address these issues, in this paper, we present a proactive approach for a comprehensive measurement study on mobile Web cache performance. The key idea of our approach is to proactively crawl resources from hundreds of websites periodically with a fine-grained time interval. Thus, we are able to uncover the resource update history and cache configurations at the server side, and analyze the cache performance in various time granularities. Based on our collected data, we build a new cache analysis model and study the upper bound of how high percentage of resources could potentially be cached and how effective the caching works in practice. We report detailed analysis results of different websites and various types of Web resources, and identify the problems caused by unsatisfactory cache performance. In particular, we identify two major problems -- Redundant Transfer and Miscached Resource, which lead to unsatisfactory cache performance. We investigate three main root causes: Same Content, Heuristic Expiration, and Conservative Expiration Time, and discuss what mobile Web developers can do to mitigate those problems.

Technical Papers 2

SCULPT: A Schema Language for Tabular Data on the Web BIBAFull-Text 702-720
  Wim Martens; Frank Neven; Stijn Vansummeren
Inspired by the recent working effort towards a recommendation by the World Wide Web Consortium (W3C) for tabular data and metadata on the Web, we present in this paper a concept for a schema language for tabular web data called SCULPT. The language consists of rules constraining and defining the structure of regions in the table. These regions are defined through the novel formalism of region selection expressions. We present a formal model for SCULPT and obtain a linear time combined complexity evaluation algorithm. In addition, we consider weak and strong streaming evaluation for SCULPT and present a SCULPT fragment for each of these streaming variants. Finally, we discuss several extensions of SCULPT including alternative semantics, types, complex content, and explore region selection expressions as a basis for a transformation language.
The Web as a Jungle: Non-Linear Dynamical Systems for Co-evolving Online Activities BIBAFull-Text 721-731
  Yasuko Matsubara; Yasushi Sakurai; Christos Faloutsos
Given a large collection of co-evolving online activities, such as searches for the keywords "Xbox", "PlayStation" and "Wii", how can we find patterns and rules? Are these keywords related? If so, are they competing against each other? Can we forecast the volume of user activity for the coming month? We conjecture that online activities compete for user attention in the same way that species in an ecosystem compete for food. We present ECOWEB, (i.e., Ecosystem on the Web), which is an intuitive model designed as a non-linear dynamical system for mining large-scale co-evolving online activities. Our second contribution is a novel, parameter-free, and scalable fitting algorithm, ECOWEB-FIT, that estimates the parameters of ECOWEB. Extensive experiments on real data show that ECOWEB is effective, in that it can capture long-range dynamics and meaningful patterns such as seasonalities, and practical, in that it can provide accurate long-range forecasts. ECOWEB consistently outperforms existing methods in terms of both accuracy and execution speed.
Spanning Edge Centrality: Large-scale Computation and Applications BIBAFull-Text 732-742
  Charalampos Mavroforakis; Richard Garcia-Lebron; Ioannis Koutis; Evimaria Terzi
The spanning centrality of an edge e in an undirected graph G is the fraction of the spanning trees of G that contain e. Despite its appealing definition and apparent value in certain applications in computational biology, spanning centrality hasn't so far received a wider attention as a measure of edge centrality. We may partially attribute this to the perceived complexity of computing it, which appears to be prohibitive for very large networks. Contrary to this intuition, spanning centrality can in fact be approximated arbitrary well by very efficient near-linear time algorithms due to Spielman and Srivastava, combined with progress in linear system solvers. In this article we bring theory into practice, with careful and optimized implementations that allow the fast computation of spanning centrality in very large graphs with millions of nodes. With this computational tool in our disposition, we demonstrate experimentally that spanning centrality is in fact a useful tool for the analysis of large networks. Specifically, we show that, relative to common centrality measures, spanning centrality is more effective in identifying edges whose removal causes a higher disruption in an information propagation procedure, while being very resilient to noise, in terms of both the edges scores and the resulting edge ranking.
No Escape From Reality: Security and Privacy of Augmented Reality Browsers BIBAFull-Text 743-753
  Richard McPherson; Suman Jana; Vitaly Shmatikov
Augmented reality (AR) browsers are an emerging category of mobile applications that add interactive virtual objects to the user's view of the physical world. This paper gives the first system-level evaluation of their security and privacy properties.
   We start by analyzing the functional requirements that AR browsers must support in order to present AR content. We then investigate the security architecture of Junaio, Layar, and Wikitude browsers, which are running today on over 30 million mobile devices, and identify new categories of security and privacy vulnerabilities unique to AR browsers. Finally, we provide the first engineering guidelines for securely implementing AR functionality.
Discovering Meta-Paths in Large Heterogeneous Information Networks BIBAFull-Text 754-764
  Changping Meng; Reynold Cheng; Silviu Maniu; Pierre Senellart; Wangda Zhang
The Heterogeneous Information Network (HIN) is a graph data model in which nodes and edges are annotated with class and relationship labels. Large and complex datasets, such as Yago or DBLP, can be modeled as HINs. Recent work has studied how to make use of these rich information sources. In particular, meta-paths, which represent sequences of node classes and edge types between two nodes in a HIN, have been proposed for such tasks as information retrieval, decision making, and product recommendation. Current methods assume meta-paths are found by domain experts. However, in a large and complex HIN, retrieving meta-paths manually can be tedious and difficult. We thus study how to discover meta-paths automatically. Specifically, users are asked to provide example pairs of nodes that exhibit high proximity. We then investigate how to generate meta-paths that can best explain the relationship between these node pairs. Since this problem is computationally intractable, we propose a greedy algorithm to select the most relevant meta-paths. We also present a data structure to enable efficient execution of this algorithm. We further incorporate hierarchical relationships among node classes in our solutions. Extensive experiments on real-world HIN show that our approach captures important meta-paths in an efficient and scalable manner.
From "Selena Gomez" to "Marlon Brando": Understanding Explorative Entity Search BIBAFull-Text 765-775
  Iris Miliaraki; Roi Blanco; Mounia Lalmas
Consider a user who submits a search query "Shakira" having a specific search goal in mind (such as her age) but at the same time willing to explore information for other entities related to her, such as comparable singers. In previous work, a system called Spark, was developed to provide such search experience. Given a query submitted to the Yahoo search engine, Spark provides related entity suggestions for the query, exploiting, among else, public knowledge bases from the Semantic Web. We refer to this search scenario as explorative entity search. The effectiveness and efficiency of the approach has been demonstrated in previous work. The way users interact with these related entity suggestions and whether this interaction can be predicted have however not been studied. In this paper, we perform a large-scale analysis into how users interact with the entity results returned by Spark. We characterize the users, queries and sessions that appear to promote an explorative behavior. Based on this analysis, we develop a set of query and user-based features that reflect the click behavior of users and explore their effectiveness in the context of a prediction task.
Children Seen But Not Heard: When Parents Compromise Children's Online Privacy BIBAFull-Text 776-786
  Tehila Minkus; Kelvin Liu; Keith W. Ross
Children's online privacy has garnered much attention in media, legislation, and industry. Adults are concerned that children may not adequately protect themselves online. However, relatively little discussion has focused on the privacy breaches that may occur to children at the hands of others, namely, their parents and relatives. When adults post information online, they may reveal personal information about their children to other people, online services, data brokers, or surveillant authorities. This information can be gathered in an automated fashion and then linked with other online and offline sources, creating detailed profiles which can be continually enhanced throughout the children's lives.
   In this paper, we conduct a study to see how widespread these behaviors are among adults on Facebook and Instagram. We use a number of methods. Firstly, we automate a process to examine 2,383 adult users on Facebook for evidence of children in their public photo albums. Using the associated comments in combination with publicly available voter registration records, we are able to infer children's names, faces, birth dates, and addresses. Secondly, in order to understand what additional information is available to Facebook and the users' friends, we survey 357 adult Facebook users about their behaviors and attitudes with regard to posting their children's information online. Thirdly, we analyze 1,089 users on Instagram to infer facts about their children.
   Finally, we make recommendations for privacy-conscious parents and suggest an interface change through which Facebook can nudge parents towards better stewardship of their children's privacy.
TrueView: Harnessing the Power of Multiple Review Sites BIBAFull-Text 787-797
  Amanda J. Minnich; Nikan Chavoshi; Abdullah Mueen; Shuang Luan; Michalis Faloutsos
Online reviews on products and services can be very useful for customers, but they need to be protected from manipulation. So far, most studies have focused on analyzing online reviews from a single hosting site. How could one leverage information from multiple review hosting sites? This is the key question in our work. In response, we develop a systematic methodology to merge, compare, and evaluate reviews from multiple hosting sites. We focus on hotel reviews and use more than 15 million reviews from more than 3.5 million users spanning three prominent travel sites. Our work consists of three thrusts: (a) we develop novel features capable of identifying cross-site discrepancies effectively, (b) we conduct arguably the first extensive study of cross-site variations using real data, and develop a hotel identity-matching method with 93% accuracy, (c) we introduce the TrueView score, as a proof of concept that cross-site analysis can better inform the end user. Our results show that: (1) we detect 7 times more suspicious hotels by using multiple sites compared to using the three sites in isolation, and (2) we find that 20% of all hotels appearing in all three sites seem to have low trustworthiness score. Our work is an early effort that explores the advantages and the challenges in using multiple reviewing sites towards more informed decision making.
QUOTUS: The Structure of Political Media Coverage as Revealed by Quoting Patterns BIBAFull-Text 798-808
  Vlad Niculae; Caroline Suen; Justine Zhang; Cristian Danescu-Niculescu-Mizil; Jure Leskovec
Given the extremely large pool of events and stories available, media outlets need to focus on a subset of issues and aspects to convey to their audience. Outlets are often accused of exhibiting a systematic bias in this selection process, with different outlets portraying different versions of reality. However, in the absence of objective measures and empirical evidence, the direction and extent of systematicity remains widely disputed. In this paper we propose a framework based on quoting patterns for quantifying and characterizing the degree to which media outlets exhibit systematic bias. We apply this framework to a massive dataset of news articles spanning the six years of Obama's presidency and all of his speeches, and reveal that a systematic pattern does indeed emerge from the outlet's quoting behavior. Moreover, we show that this pattern can be successfully exploited in an unsupervised prediction setting, to determine which new quotes an outlet will select to broadcast. By encoding bias patterns in a low-rank space we provide an analysis of the structure of political media coverage. This reveals a latent media bias space that aligns surprisingly well with political ideology and outlet type. A linguistic analysis exposes striking differences across these latent dimensions, showing how the different types of media outlets portray different realities even when reporting on the same events. For example, outlets mapped to the mainstream conservative side of the latent space focus on quotes that portray a presidential persona disproportionately characterized by negativity.
Energy and Performance of Smartphone Radio Bundling in Outdoor Environments BIBAFull-Text 809-819
  Ana Nika; Yibo Zhu; Ning Ding; Abhilash Jindal; Y. Charlie Hu; Xia Zhou; Ben Y. Zhao; Haitao Zheng
Most of today's mobile devices come equipped with both cellular LTE and WiFi wireless radios, making radio bundling (simultaneous data transfers over multiple interfaces) both appealing and practical. Despite recent studies documenting the benefits of radio bundling with MPTCP, many fundamental questions remain about potential gains from radio bundling, or the relationship between performance and energy consumption in these scenarios. In this study, we seek to answer these questions using extensive measurements to empirically characterize both energy and performance for radio bundling approaches. In doing so, we quantify potential gains of bundling using MPTCP versus an ideal protocol. We study the links between traffic partitioning and bundling performance, and use a novel componentized energy model to quantify the energy consumed by CPUs (and radios) during traffic management. Our results show that MPTCP achieves only a fraction of the total performance gain possible, and that its energy-agnostic design leads to considerable power consumption by the CPU. We conclude that not only there is room for improved bundling performance, but an energy-aware bundling protocol is likely to achieve a much better tradeoff between performance and power consumption.
PriVaricator: Deceiving Fingerprinters with Little White Lies BIBAFull-Text 820-830
  Nick Nikiforakis; Wouter Joosen; Benjamin Livshits
Researchers have shown that, in recent years, unwanted web tracking is on the rise, with browser-based fingerprinting being adopted by more and more websites as a viable alternative to third-party cookies. In this paper we propose PriVaricator, a solution to the problem of browser-based fingerprinting. A key insight is that when it comes to web tracking, the real problem with fingerprinting is not uniqueness of a fingerprint, it is linkability, i.e., the ability to connect the same fingerprint across multiple visits. Thus, making fingerprints non-deterministic also makes them hard to link across browsing sessions. In PriVaricator we use the power of randomization to "break" linkability by exploring a space of parameterized randomization policies. We evaluate our techniques in terms of being able to prevent fingerprinting and not breaking existing (benign) sites. The best of our randomization policies renders all the fingerprinters we tested ineffective, while causing minimal damage on a set of 1000 Alexa sites on which we tested, with no noticeable performance overhead.
Diagnoses, Decisions, and Outcomes: Web Search as Decision Support for Cancer BIBAFull-Text 831-841
  Michael J. Paul; Ryen W. White; Eric Horvitz
People diagnosed with a serious illness often turn to the Web for their rising information needs, especially when decisions are required. We analyze the search and browsing behavior of searchers who show a surge of interest in prostate cancer. Prostate cancer is the most common serious cancer in men and is a leading cause of cancer-related death. Diagnoses of prostate cancer typically involve reflection and decision making about treatment based on assessments of preferences and outcomes. We annotated timelines of treatment-related queries from nearly 300 searchers with tags indicating different phases of treatment, including decision making, preparation, and recovery. Using this corpus, we present a variety of analyses toward the goal of understanding search and decision making about treatments. We characterize search queries and the content of accessed pages for different treatment phases, model search behavior during the decision-making phase, and create an aggregate alignment of treatment timelines illustrated with a variety of visualizations. The experiments provide insights about how people who are engaged in intensive searches about prostate cancer over an extended period of time pursue and access information from the Web.
PocketTrend: Timely Identification and Delivery of Trending Search Content to Mobile Users BIBAFull-Text 842-852
  Gennady Pekhimenko; Dimitrios Lymberopoulos; Oriana Riva; Karin Strauss; Doug Burger
Trending search topics cause unpredictable query load spikes that hurt the end-user search experience, particularly the mobile one, by introducing longer delays. To understand how trending search topics are formed and evolve over time, we analyze 21 million queries submitted during periods where popular events caused search query volume spikes. Based on our findings, we design and evaluate PocketTrend, a system that automatically detects trending topics in real time, identifies the search content associated to the topics, and then intelligently pushes this content to users in a timely manner. In that way, PocketTrend enables a client-side search engine that can instantly answer user queries related to trending events, while at the same time reducing the impact of these trends on the datacenter workload. Our results, using real mobile search logs, show that in the presence of a trending event, up to 13-17% of the overall search traffic can be eliminated from the datacenter, with as many as 19% of all users benefiting from PocketTrend.
Overcoming Relational Learning Biases to Accurately Predict Preferences in Large Scale Networks BIBAFull-Text 853-863
  Joseph J., III Pfeiffer; Jennifer Neville; Paul N. Bennett
Many individuals on social networking sites provide traits about themselves, such as interests or demographics. Social networking sites can use this information to provide better content to match their users' interests, such as recommending scheduled events or various relevant products. These tasks require accurate probability estimates to determine the correct answer to return. Relational machine learning (RML) is an excellent framework for these problems as it jointly models the user labels given their attributes and the relational structure. Further, semi-supervised learning methods could enable RML methods to exploit the large amount of unlabeled data in networks.
   However, existing RML approaches have limitations that prevent their application in large scale domains. First, semi-supervised methods for RML do not fully utilize all the unlabeled instances in the network. Second, the collective inference procedures necessary to jointly infer the missing labels are generally viewed as too expensive to apply in large scale domains. In this work, we address each of these limitations. We analyze the effect of full semi-supervised RML and find that collective inference methods can introduce considerable bias into predictions. We correct this by implementing a maximum entropy constraint on the inference step, forcing the predictions to have the same distribution as the observed labels. Next, we outline a massively scalable variational inference algorithm for large scale relational network domains. We extend this inference algorithm to incorporate the maximum entropy constraint, proving that it only requires a constant amount of overhead while remaining massively parallel. We demonstrate our method's improvement over a variety of baselines on seven real world datasets, including large scale networks with over five million edges.
Deriving an Emergent Relational Schema from RDF Data BIBAFull-Text 864-874
  Minh-Duc Pham; Linnea Passing; Orri Erling; Peter Boncz
We motivate and describe techniques that allow to detect an "emergent" relational schema from RDF data. We show that on a wide variety of datasets, the found structure explains well over 90% of the RDF triples. Further, we also describe technical solutions to the semantic challenge to give short names that humans find logical to these emergent tables, columns and relationships between tables. Our techniques can be exploited in many ways, e.g., to improve the efficiency of SPARQL systems, or to use existing SQL-based applications on top of any RDF dataset using a RDBMS.
The Digital Life of Walkable Streets BIBAFull-Text 875-884
  Daniele Quercia; Luca Maria Aiello; Rossano Schifanella; Adam Davies
Walkability has many health, environmental, and economic benefits. That is why web and mobile services have been offering ways of computing walkability scores of individual street segments. Those scores are generally computed from survey data and manual counting (of even trees). However, that is costly, owing to the high time, effort, and financial costs. To partly automate the computation of those scores, we explore the possibility of using the social media data of Flickr and Foursquare to automatically identify safe and walkable streets. We find that unsafe streets tend to be photographed during the day, while walkable streets are tagged with walkability-related keywords. These results open up practical opportunities (for, e.g., room booking services, urban route recommenders, and real-estate sites) and have theoretical implications for researchers who might resort to the use social media data to tackle previously unanswered questions in the area of walkability.
Beyond Models: Forecasting Complex Network Processes Directly from Data BIBAFull-Text 885-895
  Bruno Ribeiro; Minh X. Hoang; Ambuj K. Singh
Complex network phenomena -- such as information cascades in online social networks -- are hard to fully observe, model, and forecast. In forecasting, a recent trend has been to forgo the use of parsimonious models in favor of models with increasingly large degrees of freedom that are trained to learn the behavior of a process from historical data. Extrapolating this trend into the future, eventually we would renounce models all together. But is it possible to forecast the evolution of a complex stochastic process directly from the data without a model? In this work we show that model-free forecasting is possible. We present SED, an algorithm that forecasts process statistics based on relationships of statistical equivalence using two general axioms and historical data. To the best of our knowledge, SED is the first method that can perform axiomatic, model-free forecasts of complex stochastic processes. Our simulations using simple and complex evolving processes and tests performed on a large real-world dataset show promising results.
Weakly Supervised Extraction of Computer Security Events from Twitter BIBAFull-Text 896-905
  Alan Ritter; Evan Wright; William Casey; Tom Mitchell
Twitter contains a wealth of timely information, however staying on top of breaking events requires that an information analyst constantly scan many sources, leading to information overload. For example, a user might wish to be made aware whenever an infectious disease outbreak takes place, when a new smartphone is announced or when a distributed Denial of Service (DoS) attack might affect an organization's network connectivity. There are many possible event categories an analyst may wish to track, making it impossible to anticipate all those of interest in advance. We therefore propose a weakly supervised approach, in which extractors for new categories of events are easy to define and train, by specifying a small number of seed examples. We cast seed-based event extraction as a learning problem where only positive and unlabeled data is available. Rather than assuming unlabeled instances are negative, as is common in previous work, we propose a learning objective which regularizes the label distribution towards a user-provided expectation. Our approach greatly outperforms heuristic negatives, used in most previous work, in experiments on real-world data. Significant performance gains are also demonstrated over two novel and competitive baselines: semi-supervised EM and one-class support-vector machines. We investigate three security-related events breaking on Twitter: DoS attacks, data breaches and account hijacking. A demonstration of security events extracted by our system is available at: http://kb1.cse.ohio-state.edu:8123/events/hacked
Groupsourcing: Team Competition Designs for Crowdsourcing BIBAFull-Text 906-915
  Markus Rokicki; Sergej Zerr; Stefan Siersdorfer
Many data processing tasks such as semantic annotation of images, translation of texts in foreign languages, and labeling of training data for machine learning models require human input, and, on a large scale, can only be accurately solved using crowd based online work. Recent work shows that frameworks where crowd workers compete against each other can drastically reduce crowdsourcing costs, and outperform conventional reward schemes where the payment of online workers is proportional to the number of accomplished tasks ("pay-per-task"). In this paper, we investigate how team mechanisms can be leveraged to further improve the cost efficiency of crowdsourcing competitions. To this end, we introduce strategies for team based crowdsourcing, ranging from team formation processes where workers are randomly assigned to competing teams, over strategies involving self-organization where workers actively participate in team building, to combinations of team and individual competitions. Our large-scale experimental evaluation with more than 1,100 participants and overall 5,400 hours of work spent by crowd workers demonstrates that our team based crowdsourcing mechanisms are well accepted by online workers and lead to substantial performance boosts.
Authentication Melee: A Usability Analysis of Seven Web Authentication Systems BIBAFull-Text 916-926
  Scott Ruoti; Brent Roberts; Kent Seamons
Passwords continue to dominate the authentication landscape in spite of numerous proposals to replace them. Even though usability is a key factor in replacing passwords, very few alternatives have been subjected to formal usability studies, and even fewer have been analyzed using a standard metric. We report the results of four within-subjects usability studies for seven web authentication systems. These systems span federated, smartphone, paper tokens, and email-based approaches. Our results indicate that participants prefer single sign-on systems. We report several insightful findings based on participants' qualitative responses: (1) transparency increases usability but also leads to confusion and a lack of trust, (2) participants prefer single sign-on but wish to augment it with site-specific low-entropy passwords, and (3) participants are intrigued by biometrics and phone-based authentication. We utilize the Systems Usability Scale (SUS) as a standard metric for empirical analysis and find that it produces reliable, replicable results. SUS proves to be an accurate measure of baseline usability. We recommend that new authentication systems be formally evaluated for usability using SUS, and should meet a minimum acceptable SUS score before receiving serious consideration.
Finding the Hierarchy of Dense Subgraphs using Nucleus Decompositions BIBAFull-Text 927-937
  Ahmet Erdem Sariyuce; C. Seshadhri; Ali Pinar; Umit V. Catalyurek
Finding dense substructures in a graph is a fundamental graph mining operation, with applications in bioinformatics, social networks, and visualization to name a few. Yet most standard formulations of this problem (like clique, quasiclique, k-densest subgraph) are NP-hard. Furthermore, the goal is rarely to find the "true optimum", but to identify many (if not all) dense substructures, understand their distribution in the graph, and ideally determine relationships among them. Current dense subgraph finding algorithms usually optimize some objective, and only find a few such subgraphs without providing any structural relations. We define the nucleus decomposition of a graph, which represents the graph as a forest of nuclei. Each nucleus is a subgraph where smaller cliques are present in many larger cliques. The forest of nuclei is a hierarchy by containment, where the edge density increases as we proceed towards leaf nuclei. Sibling nuclei can have limited intersections, which enables discovering overlapping dense subgraphs. With the right parameters, the nucleus decomposition generalizes the classic notions of k-cores and k-truss decompositions. We give provably efficient algorithms for nucleus decompositions, and empirically evaluate their behavior in a variety of real graphs. The tree of nuclei consistently gives a global, hierarchical snapshot of dense substructures, and outputs dense subgraphs of higher quality than other state-of-the-art solutions. Our algorithm can process graphs with tens of millions of edges in less than an hour.
Bringing CUPID Indoor Positioning System to Practice BIBAFull-Text 938-948
  Souvik Sen; Dongho Kim; Stephane Laroche; Kyu-Han Kim; Jeongkeun Lee
WiFi based indoor positioning has recently gained more attention due to the advent of the IEEE 802.11v standard, requirements by the FCC for E911 calls, and increased interest in location-based services. While there exist several indoor localization techniques, we find that these techniques tradeoff either accuracy, scalability, pervasiveness or cost -- all of which are important requirements for a truly deployable positioning solution. Wireless signal-strength based approaches suffer from location errors, whereas time-of-flight (ToF) based solutions provide good accuracy but are not scalable. Recent solutions address these issues by augmenting WiFi with either smartphone sensing or mobile crowdsourcing. However, they require tight coupling between WiFi infrastructure and a client device, or they can determine the client's location only if it is mobile. In this paper, we present CUPID2.0 which improved our previously proposed CUPID indoor positioning system to overcome these limitations. We achieve this by addressing the fundamental limitations in Time-of-Flight based localization and combining ToF with signal strength to address scalability. Experiments from 6 cities using 40 different mobile devices, comprising of more than 2.5 million location fixes demonstrate feasibility. CUPID2.0 is currently under production, and we expect CUPID2.0 to ignite the wide adoption of WLAN-based positioning systems and their services.
Early Detection of Spam Mobile Apps BIBAFull-Text 949-959
  Suranga Seneviratne; Aruna Seneviratne; Mohamed Ali Kaafar; Anirban Mahanti; Prasant Mohapatra
Increased popularity of smartphones has attracted a large number of developers to various smartphone platforms. As a result, app markets are also populated with spam apps, which reduce the users' quality of experience and increase the workload of app market operators. Apps can be "spammy" in multiple ways including not having a specific functionality, unrelated app description or unrelated keywords and publishing similar apps several times and across diverse categories. Market operators maintain anti-spam policies and apps are removed through continuous human intervention. Through a systematic crawl of a popular app market and by identifying a set of removed apps, we propose a method to detect spam apps solely using app metadata available at the time of publication. We first propose a methodology to manually label a sample of removed apps, according to a set of checkpoint heuristics that reveal the reasons behind removal. This analysis suggests that approximately 35% of the apps being removed are very likely to be spam apps. We then map the identified heuristics to several quantifiable features and show how distinguishing these features are for spam apps. Finally, we build an Adaptive Boost classifier for early identification of spam apps using only the metadata of the apps. Our classifier achieves an accuracy over 95% with precision varying between 85%-95% and recall varying between 38%-98%. By applying the classifier on a set of apps present at the app market during our crawl, we estimate that at least 2.7% of them are spam apps.
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance BIBAFull-Text 960-970
  Masumi Shirakawa; Takahiro Hara; Shojiro Nishio
This paper first reveals the relationship between Inverse Document Frequency (IDF), a global term weighting scheme, and information distance, a universal metric defined by Kolmogorov complexity. We concretely give a theoretical explanation that the IDF of a term is equal to the distance between the term and the empty string in the space of information distance in which the Kolmogorov complexity is approximated using Web documents and the Shannon-Fano coding. Based on our findings, we propose N-gram IDF, a theoretical extension of IDF for handling words and phrases of any length. By comparing weights among N-grams of any N, N-gram IDF enables us to determine dominant N-grams among overlapping ones and extract key terms of any length from texts without using any NLP techniques. To efficiently compute the weight for all possible N-grams, we adopt two string processing techniques, i.e., maximal substring extraction using enhanced suffix array and document listing using wavelet tree. We conducted experiments on key term extraction and Web search query segmentation, and found that N-gram IDF was competitive with state-of-the-art methods that were designed for each application using additional resources and efforts. The results exemplified the potential of N-gram IDF.
Query Suggestion and Data Fusion in Contextual Disambiguation BIBAFull-Text 971-980
  Milad Shokouhi; Marc Sloan; Paul N. Bennett; Kevyn Collins-Thompson; Siranush Sarkizova
Queries issued to a search engine are often under-specified or ambiguous. The user's search context or background may provide information that disambiguates their information need in order to automatically predict and issue a more effective query. The disambiguation can take place at different stages of the retrieval process. For instance, contextual query suggestions may be computed and recommended to users on the result page when appropriate, an approach that does not require modifying the original query's results. Alternatively, the search engine can attempt to provide efficient access to new relevant documents by injecting these documents directly into search results based on the user's context.
   In this paper, we explore these complementary approaches and how they might be combined. We first develop a general framework for mining context-sensitive query reformulations for query suggestion. We evaluate our context-sensitive suggestions against a state-of-the-art baseline using a click-based metric. The resulting query suggestions generated by our approach outperform the baseline by 13% overall and by 16% on an ambiguous query subset.
   While the query suggestions generated by our approach have higher quality than the existing baselines, we demonstrate that using them naively for injecting new documents into search results can lead to inferior rankings. To remedy this issue, we develop a classifier that decides when to inject new search results using features based on suggestion quality and user context. We show that our context-sensitive result fusion approach (Corfu) improves retrieval quality for ambiguous queries by up to 2.92%. Our approaches can efficiently scale to massive search logs, enabling a data-driven strategy that benefits from observing how users issue and reformulate queries in different contexts.
Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment BIBAFull-Text 981-991
  Anshumali Shrivastava; Ping Li
Minwise hashing (Minhash) is a widely popular indexing scheme in practice. Minhash is designed for estimating set resemblance and is known to be suboptimal in many applications where the desired measure is set overlap (i.e., inner product between binary vectors) or set containment. Minhash has inherent bias towards smaller sets, which adversely affects its performance in applications where such a penalization is not desirable. In this paper, we propose asymmetric minwise hashing (MH-ALSH), to provide a solution to this well-known problem. The new scheme utilizes asymmetric transformations to cancel the bias of traditional minhash towards smaller sets, making the final "collision probability" monotonic in the inner product. Our theoretical comparisons show that, for the task of retrieving with binary inner products, asymmetric minhash is provably better than traditional minhash and other recently proposed hashing algorithms for general inner products. Thus, we obtain an algorithmic improvement over existing approaches in the literature. Experimental evaluations on four publicly available high-dimensional datasets validate our claims. The proposed scheme outperforms, often significantly, other hashing algorithms on the task of near neighbor retrieval with set containment. Our proposal is simple and easy to implement in practice.
Language Understanding in the Wild: Combining Crowdsourcing and Machine Learning BIBAFull-Text 992-1002
  Edwin D. Simpson; Matteo Venanzi; Steven Reece; Pushmeet Kohli; John Guiver; Stephen J. Roberts; Nicholas R. Jennings
Social media has led to the democratisation of opinion sharing. A wealth of information about public opinions, current events, and authors' insights into specific topics can be gained by understanding the text written by users. However, there is a wide variation in the language used by different authors in different contexts on the web. This diversity in language makes interpretation an extremely challenging task. Crowdsourcing presents an opportunity to interpret the sentiment, or topic, of free-text. However, the subjectivity and bias of human interpreters raise challenges in inferring the semantics expressed by the text. To overcome this problem, we present a novel Bayesian approach to language understanding that relies on aggregated crowdsourced judgements. Our model encodes the relationships between labels and text features in documents, such as tweets, web articles, and blog posts, accounting for the varying reliability of human labellers. It allows inference of annotations that scales to arbitrarily large pools of documents. Our evaluation using two challenging crowdsourcing datasets shows that by efficiently exploiting language models learnt from aggregated crowdsourced labels, we can provide up to 25% improved classifications when only a small portion, less than 4% of documents has been labelled. Compared to the six state-of-the-art methods, we reduce by up to 67% the number of crowd responses required to achieve comparable accuracy. Our method was a joint winner of the CrowdFlower -- CrowdScale 2013 Shared Task challenge at the conference on Human Computation and Crowdsourcing (HCOMP 2013).
HypTrails: A Bayesian Approach for Comparing Hypotheses About Human Trails on the Web BIBAFull-Text 1003-1013
  Philipp Singer; Denis Helic; Andreas Hotho; Markus Strohmaier
When users interact with the Web today, they leave sequential digital trails on a massive scale. Examples of such human trails include Web navigation, sequences of online restaurant reviews, or online music play lists. Understanding the factors that drive the production of these trails can be useful for e.g., improving underlying network structures, predicting user clicks or enhancing recommendations. In this work, we present a general approach called HypTrails for comparing a set of hypotheses about human trails on the Web, where hypotheses represent beliefs about transitions between states. Our approach utilizes Markov chain models with Bayesian inference. The main idea is to incorporate hypotheses as informative Dirichlet priors and to leverage the sensitivity of Bayes factors on the prior for comparing hypotheses with each other. For eliciting Dirichlet priors from hypotheses, we present an adaption of the so-called (trial) roulette method. We demonstrate the general mechanics and applicability of HypTrails by performing experiments with (i) synthetic trails for which we control the mechanisms that have produced them and (ii) empirical trails stemming from different domains including website navigation, business reviews and online music played. Our work expands the repertoire of methods available for studying human trails on the Web.
Exploiting Collective Hidden Structures in Webpage Titles for Open Domain Entity Extraction BIBAFull-Text 1014-1024
  Wei Song; Shiqi Zhao; Chao Zhang; Hua Wu; Haifeng Wang; Lizhen Liu; Hanshi Wang
We present a novel method for open domain named entity extraction by exploiting the collective hidden structures in webpage titles. Our method uncovers the hidden textual structures shared by sets of webpage titles based on generalized URL patterns and a multiple sequence alignment technique. The highlights of our method include: 1) The boundaries of entities can be identified automatically in a collective way without any manually designed pattern, seed or class name. 2) The connections between entities are also discovered naturally based on the hidden structures, which makes it easy to incorporate distant or weak supervision. The experiments show that our method can harvest large scale of open domain entities with high precision. A large ratio of the extracted entities are long-tailed and complex and cover diverse topics. Given the extracted entities and their connections, we further show the effectiveness of our method in a weakly supervised setting. Our method can produce better domain specific entities in both precision and recall compared with the state-of-the-art approaches.
ROCKER: A Refinement Operator for Key Discovery BIBAFull-Text 1025-1033
  Tommaso Soru; Edgard Marx; Axel-Cyrille Ngonga Ngomo
The Linked Data principles provide a decentral approach for publishing structured data in the RDF format on the Web. In contrast to structured data published in relational databases where a key is often provided explicitly, finding a set of properties that allows identifying a resource uniquely is a non-trivial task. Still, finding keys is of central importance for manifold applications such as resource deduplication, link discovery, logical data compression and data integration. In this paper, we address this research gap by specifying a refinement operator, dubbed ROCKER, which we prove to be finite, proper and non-redundant. We combine the theoretical characteristics of this operator with two monotonicities of keys to obtain a time-efficient approach for detecting keys, i.e., sets of properties that describe resources uniquely. We then utilize a hash index to compute the discriminability score efficiently. Therewith, we ensure that our approach can scale to very large knowledge bases. Results show that ROCKER yields more accurate results, has a comparable runtime, and consumes less memory w.r.t. existing state-of-the-art techniques.
Random Walk TripleRush: Asynchronous Graph Querying and Sampling BIBAFull-Text 1034-1044
  Philip Stutz; Bibek Paudel; Mihaela Verman; Abraham Bernstein
Most Semantic Web applications rely on querying graphs, typically by using SPARQL with a triple store. Increasingly, applications also analyze properties of the graph structure to compute statistical inferences. The current Semantic Web infrastructure, however, does not efficiently support such operations. This forces developers to extract the relevant data for external statistical post-processing. In this paper we propose to rethink query execution in a triple store as a highly parallelized asynchronous graph exploration on an active index data structure. This approach also allows to integrate SPARQL-querying with the sampling of graph properties.
   To evaluate this architecture we implemented Random Walk TripleRush, which is built on a distributed graph processing system. Our evaluations show that this architecture enables both competitive graph querying, as well as the ability to execute various types of random walks with restarts that sample interesting graph properties. Thanks to the asynchronous architecture, first results are sometimes returned in a fraction of the full execution time. We also evaluate the scalability and show that the architecture supports fast query-times on a dataset with more than a billion triples.
Open Domain Question Answering via Semantic Enrichment BIBAFull-Text 1045-1055
  Huan Sun; Hao Ma; Wen-tau Yih; Chen-Tse Tsai; Jingjing Liu; Ming-Wei Chang
Most recent question answering (QA) systems query large-scale knowledge bases (KBs) to answer a question, after parsing and transforming natural language questions to KBs-executable forms (e.g., logical forms). As a well-known fact, KBs are far from complete, so that information required to answer questions may not always exist in KBs. In this paper, we develop a new QA system that mines answers directly from the Web, and meanwhile employs KBs as a significant auxiliary to further boost the QA performance. Specifically, to the best of our knowledge, we make the first attempt to link answer candidates to entities in Freebase, during answer candidate generation. Several remarkable advantages follow: (1) Redundancy among answer candidates is automatically reduced. (2) The types of an answer candidate can be effortlessly determined by those of its corresponding entity in Freebase. (3) Capitalizing on the rich information about entities in Freebase, we can develop semantic features for each answer candidate after linking them to Freebase. Particularly, we construct answer-type related features with two novel probabilistic models, which directly evaluate the appropriateness of an answer candidate's types under a given question. Overall, such semantic features turn out to play significant roles in determining the true answers from the large answer candidate pool. The experimental results show that across two testing datasets, our QA system achieves an 18%~54% improvement under F_1 metric, compared with various existing QA systems.
All Who Wander: On the Prevalence and Characteristics of Multi-community Engagement BIBAFull-Text 1056-1066
  Chenhao Tan; Lillian Lee
Although analyzing user behavior within individual communities is an active and rich research domain, people usually interact with multiple communities both on- and off-line. How do users act in such multi-community environments? Although there are a host of intriguing aspects to this question, it has received much less attention in the research community in comparison to the intra-community case. In this paper, we examine three aspects of multi-community engagement: the sequence of communities that users post to, the language that users employ in those communities, and the feedback that users receive, using longitudinal posting behavior on Reddit as our main data source, and DBLP for auxiliary experiments. We also demonstrate the effectiveness of features drawn from these aspects in predicting users' future level of activity. One might expect that a user's trajectory mimics the "settling-down" process in real life: an initial exploration of sub-communities before settling down into a few niches. However, we find that the users in our data continually post in new communities; moreover, as time goes on, they post increasingly evenly among a more diverse set of smaller communities. Interestingly, it seems that users that eventually leave the community are "destined" to do so from the very beginning, in the sense of showing significantly different "wandering" patterns very early on in their trajectories; this finding has potentially important design implications for community maintainers. Our multi-community perspective also allows us to investigate the "situation vs. personality" debate from language usage across different communities.
LINE: Large-scale Information Network Embedding BIBAFull-Text 1067-1077
  Jian Tang; Meng Qu; Mingzhe Wang; Ming Zhang; Jun Yan; Qiaozhu Mei
This paper studies the problem of embedding very large information networks into low-dimensional vector spaces, which is useful in many tasks such as visualization, node classification, and link prediction. Most existing graph embedding methods do not scale for real world information networks which usually contain millions of nodes. In this paper, we propose a novel network embedding method called the "LINE," which is suitable for arbitrary types of information networks: undirected, directed, and/or weighted. The method optimizes a carefully designed objective function that preserves both the local and global network structures. An edge-sampling algorithm is proposed that addresses the limitation of the classical stochastic gradient descent and improves both the effectiveness and the efficiency of the inference. Empirical experiments prove the effectiveness of the LINE on a variety of real-world information networks, including language networks, social networks, and citation networks. The algorithm is very efficient, which is able to learn the embedding of a network with millions of vertices and billions of edges in a few hours on a typical single machine. The source code of the LINE is available online: https://github.com/tangjianpku/LINE.
Leveraging Pattern Semantics for Extracting Entities in Enterprises BIBAFull-Text 1078-1088
  Fangbo Tao; Bo Zhao; Ariel Fuxman; Yang Li; Jiawei Han
Entity Extraction is a process of identifying meaningful entities from text documents. In enterprises, extracting entities improves enterprise efficiency by facilitating numerous applications, including search, recommendation, etc. However, the problem is particularly challenging on enterprise domains due to several reasons. First, the lack of redundancy of enterprise entities makes previous web-based systems like NELL and OpenIE not effective, since using only high-precision/low-recall patterns like those systems would miss the majority of sparse enterprise entities, while using more low-precision patterns in sparse setting also introduces noise drastically. Second, semantic drift is common in enterprises ("Blue" refers to "Windows Blue"), such that public signals from the web cannot be directly applied on entities. Moreover, many internal entities never appear on the web. Sparse internal signals are the only source for discovering them. To address these challenges, we propose an end-to-end framework for extracting entities in enterprises, taking the input of enterprise corpus and limited seeds to generate a high-quality entity collection as output. We introduce the novel concept of Semantic Pattern Graph to leverage public signals to understand the underlying semantics of lexical patterns, reinforce pattern evaluation using mined semantics, and yield more accurate and complete entities. Experiments on Microsoft enterprise data show the effectiveness of our approach.
Density-friendly Graph Decomposition BIBAFull-Text 1089-1099
  Nikolaj Tatti; Aristides Gionis
Decomposing a graph into a hierarchical structure via k-core analysis is a standard operation in any modern graph-mining toolkit. k-core decomposition is a simple and efficient method that allows to analyze a graph beyond its mere degree distribution. More specifically, it is used to identify areas in the graph of increasing centrality and connectedness, and it allows to reveal the structural organization of the graph.
   Despite the fact that k-core analysis relies on vertex degrees, k-cores do not satisfy a certain, rather natural, density property. Simply put, the most central k-core is not necessarily the densest subgraph. This inconsistency between k-cores and graph density provides the basis of our study.
   We start by defining what it means for a subgraph to be locally-dense, and we show that our definition entails a nested chain decomposition of the graph, similar to the one given by k-cores, but in this case the components are arranged in order of increasing density. We show that such a locally-dense decomposition for a graph G = (V, E) can be computed in polynomial time. The running time of the exact decomposition algorithm is O(|V|^2|E|) but is significantly faster in practice. In addition, we develop a linear-time algorithm that provides a factor-2 approximation to the optimal locally-dense decomposition. Furthermore, we show that the k-core decomposition is also a factor-2 approximation, however, as demonstrated by our experimental evaluation, in practice k-cores have different structure than locally-dense subgraphs, and as predicted by the theory, k-cores are not always well-aligned with graph density.
Crowd Fraud Detection in Internet Advertising BIBAFull-Text 1100-1110
  Tian Tian; Jun Zhu; Fen Xia; Xin Zhuang; Tong Zhang
The rise of crowdsourcing brings new types of malpractices in Internet advertising. One can easily hire web workers through malicious crowdsourcing platforms to attack other advertisers. Such human generated crowd frauds are hard to detect by conventional fraud detection methods. In this paper, we carefully examine the characteristics of the group behaviors of crowd fraud and identify three persistent patterns, which are moderateness, synchronicity and dispersivity. Then we propose an effective crowd fraud detection method for search engine advertising based on these patterns, which consists of a constructing stage, a clustering stage and a filtering stage. At the constructing stage, we remove irrelevant data and reorganize the click logs into a surfer-advertiser inverted list; At the clustering stage, we define the sync-similarity between surfers' click histories and transform the coalition detection to a clustering problem, solved by a nonparametric algorithm; and finally we build a dispersity filter to remove false alarm clusters. The nonparametric nature of our method ensures that we can find an unbounded number of coalitions with nearly no human interaction. We also provide a parallel solution to make the method scalable to Web data and conduct extensive experiments. The empirical results demonstrate that our method is accurate and scalable.
Provably Fast Inference of Latent Features from Networks: with Applications to Learning Social Circles and Multilabel Classification BIBAFull-Text 1111-1121
  Charalampos Tsourakakis
A well known phenomenon in social networks is homophily, the tendency of agents to connect with similar agents. A derivative of this phenomenon is the emergence of communities. Another phenomenon observed in numerous networks is the existence of certain agents that belong simultaneously to multiple communities. An understanding of these phenomena constitutes a central research topic of network science.
   In this work we focus on a fundamental theoretical question related to the above phenomena with various applications: given an undirected graph G, can we infer efficiently the latent vertex features which explain the observed network structure under the assumption of a generative model that exhibits homophily? We propose a probabilistic generative model with the property that the probability of an edge among two vertices is a non-decreasing function of the common features they possess. This property is true for many real-world networks and surprisingly is ignored by many popular overlapping community detection methods as it was shown recently by the empirical work of Yang and Leskovec [44]. Our main theoretical contribution is the first provably rapidly mixing Markov chain for inferring latent features. On the experimental side, we verify the efficiency of our method in terms of run times, where we observe that it significantly outperforms state-of-the-art methods. Our method is more than 2,400 times faster than a state-of-the-art machine learning method [37] and typically provides non-trivial speedups compared to BigClam [43]. Furthermore, we verify on real-data with ground-truth available that our method learns efficiently high quality labelings. We use our method to learn social circles from Twitter ego-networks and perform multilabel classification.
The K-clique Densest Subgraph Problem BIBAFull-Text 1122-1132
  Charalampos Tsourakakis
Numerous graph mining applications rely on detecting subgraphs which are large near-cliques. Since formulations that are geared towards finding large near-cliques are hard and frequently inapproximable due to connections with the Maximum Clique problem, the poly-time solvable densest subgraph problem which maximizes the average degree over all possible subgraphs "lies at the core of large scale data mining" [10]. However, frequently the densest subgraph problem fails in detecting large near-cliques in networks.
   In this work, we introduce the k-clique densest subgraph problem, k ≥ 2. This generalizes the well studied densest subgraph problem which is obtained as a special case for k=2. For k=3 we obtain a novel formulation which we refer to as the triangle densest subgraph problem: given a graph G(V,E), find a subset of vertices S* such that τ(S*)=max limitsS ⊆ V t(S)/|S|, where t(S) is the number of triangles induced by the set S.
   On the theory side, we prove that for any k constant, there exist an exact polynomial time algorithm for the k-clique densest subgraph problem}. Furthermore, we propose an efficient 1/k-approximation algorithm which generalizes the greedy peeling algorithm of Asahiro and Charikar [8,18] for k=2. Finally, we show how to implement efficiently this peeling framework on MapReduce for any k ≥ 3, generalizing the work of Bahmani, Kumar and Vassilvitskii for the case k=2 [10]. On the empirical side, our two main findings are that (i) the triangle densest subgraph is consistently closer to being a large near-clique compared to the densest subgraph and (ii) the peeling approximation algorithms for both k=2 and k=3 achieve on real-world networks approximation ratios closer to 1 rather than the pessimistic 1/k guarantee. An interesting consequence of our work is that triangle counting, a well-studied computational problem in the context of social network analysis can be used to detect large near-cliques. Finally, we evaluate our proposed method on a popular graph mining application.
GERBIL: General Entity Annotator Benchmarking Framework BIBAFull-Text 1133-1143
  Ricardo Usbeck; Michael Röder; Axel-Cyrille Ngonga Ngomo; Ciro Baron; Andreas Both; Martin Brümmer; Diego Ceccarelli; Marco Cornolti; Didier Cherix; Bernd Eickmann; Paolo Ferragina; Christiane Lemke; Andrea Moro; Roberto Navigli; Francesco Piccinno; Giuseppe Rizzo; Harald Sack; René Speck; Raphaël Troncy; Jörg Waitelonis; Lars Wesemann
We present GERBIL, an evaluation framework for semantic entity annotation. The rationale behind our framework is to provide developers, end users and researchers with easy-to-use interfaces that allow for the agile, fine-grained and uniform evaluation of annotation tools on multiple datasets. By these means, we aim to ensure that both tool developers and end users can derive meaningful insights pertaining to the extension, integration and use of annotation applications. In particular, GERBIL provides comparable results to tool developers so as to allow them to easily discover the strengths and weaknesses of their implementations with respect to the state of the art. With the permanent experiment URIs provided by our framework, we ensure the reproducibility and archiving of evaluation results. Moreover, the framework generates data in machine-processable format, allowing for the efficient querying and post-processing of evaluation results. Finally, the tool diagnostics provided by GERBIL allows deriving insights pertaining to the areas in which tools should be further refined, thus allowing developers to create an informed agenda for extensions and end users to detect the right tools for their purposes. GERBIL aims to become a focal point for the state of the art, driving the research agenda of the community by presenting comparable objective evaluation results.
An Optimization Framework for Weighting Implicit Relevance Labels for Personalized Web Search BIBAFull-Text 1144-1154
  Yury Ustinovskiy; Gleb Gusev; Pavel Serdyukov
Implicit feedback from users of a web search engine is an essential source providing consistent personal relevance labels from the actual population of users. However, previous studies on personalized search employ this source in a rather straightforward manner. Basically, documents that were clicked on get maximal gain, and the rest of the documents are assigned the zero gain. As we demonstrate in our paper, a ranking algorithm trained using these gains directly as the ground truth relevance labels leads to a suboptimal personalized ranking.
   In this paper we develop a framework for automatic reweighting of these labels. Our approach is based on more subtle aspects of user interaction with the result page. We propose an efficient methodology for deriving confidence levels for relevance labels that relies directly on the objective ranking measure. All our algorithms are evaluated on a large-scale query log provided by a major commercial search engine. The results of the experiments prove that the current state-of-the-art personalization approaches could be significantly improved by enriching relevance grades with weights extracted from post-impression user behavior.
A First Look at Tribal Web Traffic BIBAFull-Text 1155-1165
  Morgan Vigil; Matthew Rantanen; Elizabeth Belding
With broadband penetration rates of less than 10% per capita, Tribal areas in the U.S. represent some of the most underserved communities in terms of Internet access. Although numerous sources have identified this digital divide, there have been no empirical measurements of the performance and usage of services that do exist in these areas. In this paper, we present the characterization of the Tribal Digital Village (TDV) network, a multi-hop wireless network currently connecting 13 reservations in San Diego county. This work represents the first traffic analysis of broadband usage in Tribal lands. After identifying some of the unique purposes of broadband connectivity in indigenous communities, such as language revitalization and cultural development, we focus on the performance of popular applications that enable such activities, including YouTube and Instagram. Though only a fraction of the bandwidth capacity is actually used, 30% of YouTube uploads and 24% of Instagram uploads fail due to packet loss on the relay and access links that connect the reservations to the TDV backbone. Although failure rates are prohibitive to the contribution of locally generated media (particularly videos), our analysis of Instagram media interactions and engagement in the TDV network reveals a high locality of interest. Residents engage with locally created media 8.2 times more than media created by outside sources. Furthermore, locally created media circulates through the network two days longer than non-local media. The results of our analysis point to new directions for increasing content availability on reservations.
A Weighted Correlation Index for Rankings with Ties BIBAFull-Text 1166-1176
  Sebastiano Vigna
Understanding the correlation between two different scores for the same set of items is a common problem in graph analysis and information retrieval. The most commonly used statistics that quantifies this correlation is Kendall's tau; however, the standard definition fails to capture that discordances between items with high rank are more important than those between items with low rank. Recently, a new measure of correlation based on average precision has been proposed to solve this problem, but like many alternative proposals in the literature it assumes that there are no ties in the scores. This is a major deficiency in a number of contexts, and in particular when comparing centrality scores on large graphs, as the obvious baseline, indegree, has a very large number of ties in social networks and web graphs. We propose to extend Kendall's definition in a natural way to take into account weights in the presence of ties. We prove a number of interesting mathematical properties of our generalization and describe an O(n log n) algorithm for its computation. We also validate the usefulness of our weighted measure of correlation using experimental data on social networks and web graphs.
Gathering Additional Feedback on Search Results by Multi-Armed Bandits with Respect to Production Ranking BIBAFull-Text 1177-1187
  Aleksandr Vorobev; Damien Lefortier; Gleb Gusev; Pavel Serdyukov
Given a repeatedly issued query and a document with a not-yet-confirmed potential to satisfy the users' needs, a search system should place this document on a high position in order to gather user feedback and obtain a more confident estimate of the document utility. On the other hand, the main objective of the search system is to maximize expected user satisfaction over a rather long period, what requires showing more relevant documents on average. The state-of-the-art approaches to solving this exploration-exploitation dilemma rely on strongly simplified settings making these approaches infeasible in practice. We improve the most flexible and pragmatic of them to handle some actual practical issues. The first one is utilizing prior information about queries and documents, the second is combining bandit-based learning approaches with a default production ranking algorithm. We show experimentally that our framework enables to significantly improve the ranking of a leading commercial search engine.
The E-Commerce Market for "Lemons": Identification and Analysis of Websites Selling Counterfeit Goods BIBAFull-Text 1188-1197
  John Wadleigh; Jake Drew; Tyler Moore
We investigate the practice of websites selling counterfeit goods. We inspect web search results for 225 queries across 25 brands. We devise a binary classifier that predicts whether a given website is selling counterfeits by examining automatically extracted features such as WHOIS information, pricing and website content. We then apply the classifier to results collected between January and August 2014. We find that, overall, 32% of search results point to websites selling fakes. For 'complicit' search terms, such as "replica Rolex", 39% of the search results point to fakes, compared to 20% for 'innocent' terms, such as "hermes buy online". Using a linear regression, we find that brands with a higher street price for fakes have higher incidence of counterfeits in search results, but that brands who take active countermeasures such as filing DMCA requests experience lower incidence of counterfeits in search results. Finally, we study how the incidence of counterfeits evolves over time, finding that the fraction of search results pointing to fakes remains remarkably stable.
Concept Expansion Using Web Tables BIBAFull-Text 1198-1208
  Chi Wang; Kaushik Chakrabarti; Yeye He; Kris Ganjam; Zhimin Chen; Philip A. Bernstein
We study the following problem: given the name of an ad-hoc concept as well as a few seed entities belonging to the concept, output all entities belonging to it. Since producing the exact set of entities is hard, we focus on returning a ranked list of entities. Previous approaches either use seed entities as the only input, or inherently require negative examples. They suffer from input ambiguity and semantic drift, or are not viable options for ad-hoc tail concepts. In this paper, we propose to leverage the millions of tables on the web for this problem. The core technical challenge is to identify the "exclusive" tables for a concept to prevent semantic drift; existing holistic ranking techniques like personalized PageRank are inadequate for this purpose. We develop novel probabilistic ranking methods that can model a new type of table-entity relationship. Experiments with real-life concepts show that our proposed solution is significantly more effective than applying state-of-the-art set expansion or holistic ranking techniques.
User Latent Preference Model for Better Downside Management in Recommender Systems BIBAFull-Text 1209-1219
  Jian Wang; David Hardtke
Downside management is an important topic in the field of recommender systems. User satisfaction increases when good items are recommended, but satisfaction drops significantly when bad recommendations are pushed to them. For example, a parent would be disappointed if violent movies are recommended to their kids and may stop using the recommendation system entirely. A vegetarian would feel steak-house recommendations useless. A CEO in a mid-sized company would feel offended by receiving intern-level job recommendations. Under circumstances where there is penalty for a bad recommendation, a bad recommendation is worse than no recommendation at all. While most existing work focuses on upside management (recommending the best items to users), this paper emphasizes on achieving better downside management (reducing the recommendation of irrelevant or offensive items to users). The approach we propose is general and can be applied to any scenario or domain where downside management is key to the system.
   To tackle the problem, we design a user latent preference model to predict the user preference in a specific dimension, say, the dietary restrictions of the user, the acceptable level of adult content in a movie, or the geographical preference of a job seeker. We propose to use multinomial regression as the core model and extend it with a hierarchical Bayesian framework to address the problem of data sparsity. After the user latent preference is predicted, we leverage it to filter out downside items. We validate the soundness of our approach by evaluating it with an anonymous job application dataset on LinkedIn. The effectiveness of the latent preference model was demonstrated in both offline experiments and online A/B testings. The user latent preference model helps to improve the VPI (views per impression) and API (applications per impression) significantly which in turn achieves a higher user satisfaction.
The Role of Data Cap in Optimal Two-part Network Pricing BIBAFull-Text 1220-1230
  Xin Wang; Richard T. B. Ma; Yinlong Xu
Internet services are traditionally priced at flat rates; however, many Internet service providers (ISPs) have recently shifted towards two-part tariffs where a data cap is imposed to restrain data demand from heavy users and usage over the data cap is charged based on a per-unit fee. Although the two-part tariff could generally increase the revenue for ISPs and has been supported by the FCC chairman, the role of data cap and its revenue-optimal and welfare-optimal pricing structures are not well understood. In this paper, we study the impact of data cap on the optimal two-part pricing schemes for congestion-prone service markets, e.g., broadband or cloud services. We model users' demand and preferences over pricing and congestion alternatives and derive the market share and congestion of service providers under a market equilibrium. Based on the equilibrium model, we characterize the two-part structures of the revenue-optimal and welfare-optimal pricing schemes. Our results reveal that 1) the data cap provides a mechanism for ISPs to transition from flat-rate to pay-as-you-go type of schemes, 2) with growing data demand and network capacity, the revenue-optimal pricing moves towards usage-based schemes with diminishing data caps, and 3) the structure of the welfare-optimal tariff comprises lower fees and data cap than those of the revenue-optimal counterpart, suggesting that regulators might want to promote usage-based pricing but regulate the per-unit fees. Our results could help providers design revenue-optimal pricing schemes and guide regulatory authorities to legislate desirable regulations.
Tweeting Cameras for Event Detection BIBAFull-Text 1231-1241
  Yuhui Wang; Mohan S. Kankanhalli
We are living in a world of big sensor data. Due to the widespread prevalence of visual sensors (e.g. surveillance cameras) and social sensors (e.g. Twitter feeds), many events are implicitly captured in real-time by such heterogeneous "sensors". Combining these two complementary sensor streams can significantly improve the task of event detection and aid in comprehending evolving situations. However, the different characteristics of these social and sensor data make such information fusion for event detection a challenging problem. To tackle this problem, we propose an innovative multi-layer tweeting camera framework integrating both physical sensors and social sensors to detect various concepts of real-world events. In this framework, visual concept detectors are applied on camera video frames and these concepts can be construed as "camera tweets" posted regularly. These tweets are represented by a unified probabilistic spatio-temporal (PST) data structure which is then aggregated to a concept-based image (Cmage) as the common representation for visualization. To facilitate event analysis, we define a set of operators and analytic functions that can be applied on the PST data by the user to discover occurrences of events and to analyse evolving situations. We further leverage on geo-located social media data by mining current topics discussed on Twitter to obtain the high-level semantic meaning of detected events in images. We quantitatively evaluate our framework with a large-scale dataset containing images from 150 New York real-time traffic CCTV cameras, university foodcourt camera feeds and Twitter data, which demonstrates the feasibility and effectiveness of the proposed framework. Results of combining camera tweets and social tweets are shown to be promising for detecting real-world events.
Mining Missing Hyperlinks from Human Navigation Traces: A Case Study of Wikipedia BIBAFull-Text 1242-1252
  Robert West; Ashwin Paranjape; Jure Leskovec
Hyperlinks are an essential feature of the World Wide Web. They are especially important for online encyclopedias such as Wikipedia: an article can often only be understood in the context of related articles, and hyperlinks make it easy to explore this context. But important links are often missing, and several methods have been proposed to alleviate this problem by learning a linking model based on the structure of the existing links. Here we propose a novel approach to identifying missing links in Wikipedia. We build on the fact that the ultimate purpose of Wikipedia links is to aid navigation. Rather than merely suggesting new links that are in tune with the structure of existing links, our method finds missing links that would immediately enhance Wikipedia's navigability. We leverage data sets of navigation paths collected through a Wikipedia-based human-computation game in which users must find a short path from a start to a target article by only clicking links encountered along the way. We harness human navigational traces to identify a set of candidates for missing links and then rank these candidates. Experiments show that our procedure identifies missing links of high quality.
Semantic Annotation of Mobility Data using Social Media BIBAFull-Text 1253-1263
  Fei Wu; Zhenhui Li; Wang-Chien Lee; Hongjian Wang; Zhuojie Huang
Recent developments in sensors, GPS and smart phones have provided us with a large amount of mobility data. At the same time, large-scale crowd-generated social media data, such as geo-tagged tweets, provide rich semantic information about locations and events. Combining the mobility data and surrounding social media data enables us to semantically understand why a person travels to a location at a particular time (e.g., attending a local event or visiting a point of interest). Previous research on mobility data mining has been mainly focused on mining patterns using only the mobility data. In this paper, we study the problem of using social media to annotate mobility data. As social media data is often noisy, the key research problem lies in using the right model to retrieve only the relevant words with respect to a mobility record. We propose frequency-based method, Gaussian mixture model, and kernel density estimation (KDE) to tackle this problem. We show that KDE is the most suitable model as it captures the locality of word distribution very well. We test our proposal using the real dataset collected from Twitter and demonstrate the effectiveness of our techniques via both interesting case studies and a comprehensive evaluation.
Automatic Web Content Extraction by Combination of Learning and Grouping BIBAFull-Text 1264-1274
  Shanchan Wu; Jerry Liu; Jian Fan
Web pages consist of not only actual content, but also other elements such as branding banners, navigational elements, advertisements, copyright etc. This noisy content is typically not related to the main subjects of the webpages. Identifying the part of actual content, or clipping web pages, has many applications, such as high quality web printing, e-reading on mobile devices and data mining. Although there are many existing methods attempting to address this task, most of them can either work only on certain types of Web pages, e.g. article pages, or has to develop different models for different websites. We formulate the actual content identifying problem as a DOM tree node selection problem. We develop multiple features by utilizing the DOM tree node properties to train a machine learning model. Then candidate nodes are selected based on the learning model. Based on the observation that the actual content is usually located in a spatially continuous block, we develop a grouping technology to further filter out noisy data and pick missing data for the candidate nodes. We conduct extensive experiments on a real dataset and demonstrate our solution has high quality outputs and outperforms several baseline methods.
Executing Provenance-Enabled Queries over Web Data BIBAFull-Text 1275-1285
  Marcin Wylot; Philippe Cudre-Mauroux; Paul Groth
The proliferation of heterogeneous Linked Data on the Web poses new challenges to database systems. In particular, because of this heterogeneity, the capacity to store, track, and query provenance data is becoming a pivotal feature of modern triple stores. In this paper, we tackle the problem of efficiently executing provenance-enabled queries over RDF data. We propose, implement and empirically evaluate five different query execution strategies for RDF queries that incorporate knowledge of provenance. The evaluation is conducted on Web Data obtained from two different Web crawls (The Billion Triple Challenge, and the Web Data Commons). Our evaluation shows that using an adaptive query materialization execution strategy performs best in our context. Interestingly, we find that because provenance is prevalent within Web Data and is highly selective, it can be used to improve query processing performance. This is a counterintuitive result as provenance is often associated with additional overhead.
Understanding Malvertising Through Ad-Injecting Browser Extensions BIBAFull-Text 1286-1295
  Xinyu Xing; Wei Meng; Byoungyoung Lee; Udi Weinsberg; Anmol Sheth; Roberto Perdisci; Wenke Lee
Malvertising is a malicious activity that leverages advertising to distribute various forms of malware. Because advertising is the key revenue generator for numerous Internet companies, large ad networks, such as Google, Yahoo and Microsoft, invest a lot of effort to mitigate malicious ads from their ad networks. This drives adversaries to look for alternative methods to deploy malvertising. In this paper, we show that browser extensions that use ads as their monetization strategy often facilitate the deployment of malvertising. Moreover, while some extensions simply serve ads from ad networks that support malvertising, other extensions maliciously alter the content of visited webpages to force users into installing malware. To measure the extent of these behaviors we developed Expector, a system that automatically inspects and identifies browser extensions that inject ads, and then classifies these ads as malicious or benign based on their landing pages. Using Expector, we automatically inspected over 18,000 Chrome browser extensions. We found 292 extensions that inject ads, and detected 56 extensions that participate in malvertising using 16 different ad networks and with a total user base of 602,417.
E-commerce Reputation Manipulation: The Emergence of Reputation-Escalation-as-a-Service BIBAFull-Text 1296-1306
  Haitao Xu; Daiping Liu; Haining Wang; Angelos Stavrou
In online markets, a store's reputation is closely tied to its profitability. Sellers' desire to quickly achieve high reputation has fueled a profitable underground business, which operates as a specialized crowdsourcing marketplace and accumulates wealth by allowing online sellers to harness human laborers to conduct fake transactions for improving their stores' reputations. We term such an underground market a seller-reputation-escalation (SRE) market. In this paper, we investigate the impact of the SRE service on reputation escalation by performing in-depth measurements of the prevalence of the SRE service, the business model and market size of SRE markets, and the characteristics of sellers and offered laborers. To this end, we have infiltrated five SRE markets and studied their operations using daily data collection over a continuous period of two months. We identified more than 11,000 online sellers posting at least 219,165 fake-purchase tasks on the five SRE markets. These transactions earned at least $46,438 in revenue for the five SRE markets, and the total value of merchandise involved exceeded $3,452,530. Our study demonstrates that online sellers using SRE service can increase their stores' reputations at least 10 times faster than legitimate ones while only 2.2% of them were detected and penalized. Even worse, we found a newly launched service that can, within a single day, boost a seller's reputation by such a degree that would require a legitimate seller at least a year to accomplish. Finally, armed with our analysis of the operational characteristics of the underground economy, we offer some insights into potential mitigation strategies.
Effective Techniques for Message Reduction and Load Balancing in Distributed Graph Computation BIBAFull-Text 1307-1317
  Da Yan; James Cheng; Yi Lu; Wilfred Ng
Massive graphs, such as online social networks and communication networks, have become common today. To efficiently analyze such large graphs, many distributed graph computing systems have been developed. These systems employ the "think like a vertex" programming paradigm, where a program proceeds in iterations and at each iteration, vertices exchange messages with each other. However, using Pregel's simple message passing mechanism, some vertices may send/receive significantly more messages than others due to either the high degree of these vertices or the logic of the algorithm used. This forms the communication bottleneck and leads to imbalanced workload among machines in the cluster. In this paper, we propose two effective message reduction techniques: (1) vertex mirroring with message combining, and (2) an additional request-respond API. These techniques not only reduce the total number of messages exchanged through the network, but also bound the number of messages sent/received by any single vertex. We theoretically analyze the effectiveness of our techniques, and implement them on top of our open-source Pregel implementation called Pregel+. Our experiments on various large real graphs demonstrate that our message reduction techniques significantly improve the performance of distributed graph computation.
Tackling the Achilles Heel of Social Networks: Influence Propagation based Language Model Smoothing BIBAFull-Text 1318-1328
  Rui Yan; Ian E. H. Yen; Cheng-Te Li; Shiqi Zhao; Xiaohua Hu
Online social networks nowadays enjoy their worldwide prosperity, as they have revolutionized the way for people to discover, to share, and to distribute information. With millions of registered users and the proliferation of user-generated contents, the social networks become "giants", likely eligible to carry on any research tasks. However, the giants do have their Achilles Heel: extreme data sparsity. Compared with the massive data over the whole collection, individual posting documents, (e.g., a microblog less than 140 characters), seem to be too sparse to make a difference under various research scenarios, while actually they are different. In this paper we propose to tackle the Achilles Heel of social networks by smoothing the language model via influence propagation. We formulate a socialized factor graph model, which utilizes both the textual correlations between document pairs and the socialized augmentation networks behind the documents, such as user relationships and social interactions. These factors are modeled as attributes and dependencies among documents and their corresponding users. An efficient algorithm is designed to learn the proposed factor graph model. Finally we propagate term counts to smooth documents based on the estimated influence. Experimental results on Twitter and Weibo datasets validate the effectiveness of the proposed model. By leveraging the smoothed language model with social factors, our approach obtains significant improvement over several alternative methods on both intrinsic and extrinsic evaluations measured in terms of perplexity, nDCG and MAP results.
A Game Theoretic Model for the Formation of Navigable Small-World Networks BIBAFull-Text 1329-1339
  Zhi Yang; Wei Chen
Kleinberg proposed a family of small-world networks to explain the navigability of large-scale real-world social networks. However, the underlying mechanism that drives real networks to be navigable is not yet well understood. In this paper, we present a game theoretic model for the formation of navigable small world networks. We model the network formation as a game in which people seek for both high reciprocity and long-distance relationships. We show that the navigable small-world network is a Nash Equilibrium of the game. Moreover, we prove that the navigable small-world equilibrium tolerates collusions of any size and arbitrary deviations of a large random set of nodes, while non-navigable equilibria do not tolerate small group collusions or random perturbations. Our empirical evaluation further demonstrates that the system always converges to the navigable network even when limited or no information about other players' strategies is available. Our theoretical and empirical analyses provide important new insight on the connection between distance, reciprocity and navigability in social networks.
A Scalable Asynchronous Distributed Algorithm for Topic Modeling BIBAFull-Text 1340-1350
  Hsiang-Fu Yu; Cho-Jui Hsieh; Hyokun Yun; S. V. N. Vishwanathan; Inderjit S. Dhillon
Learning meaningful topic models with massive document collections which contain millions of documents and billions of tokens is challenging because of two reasons. First, one needs to deal with a large number of topics (typically on the order of thousands). Second, one needs a scalable and efficient way of distributing the computation across multiple machines. In this paper, we present a novel algorithm F+Nomad LDA which simultaneously tackles both these problems. In order to handle large number of topics we use an appropriately modified Fenwick tree. This data structure allows us to sample from a multinomial distribution over T items in O(log T) time. Moreover, when topic counts change the data structure can be updated in O(log T) time. In order to distribute the computation across multiple processors, we present a novel asynchronous framework inspired by the Nomad algorithm of Yun et al, 2014. We show that F+Nomad LDA significantly outperforms recent state-of-the-art topic modeling approaches on massive problems which involve millions of documents, billions of words, and thousands of topics.
LightLDA: Big Topic Models on Modest Computer Clusters BIBAFull-Text 1351-1361
  Jinhui Yuan; Fei Gao; Qirong Ho; Wei Dai; Jinliang Wei; Xun Zheng; Eric Po Xing; Tie-Yan Liu; Wei-Ying Ma
When building large-scale machine learning (ML) programs, such as massive topic models or deep neural networks with up to trillions of parameters and training examples, one usually assumes that such massive tasks can only be attempted with industrial-sized clusters with thousands of nodes, which are out of reach for most practitioners and academic researchers. We consider this challenge in the context of topic modeling on web-scale corpora, and show that with a modest cluster of as few as 8 machines, we can train a topic model with 1 million topics and a 1-million-word vocabulary (for a total of 1 trillion parameters), on a document collection with 200 billion tokens -- a scale not yet reported even with thousands of machines. Our major contributions include: 1) a new, highly-efficient O(1) Metropolis-Hastings sampling algorithm, whose running cost is (surprisingly) agnostic of model size, and empirically converges nearly an order of magnitude more quickly than current state-of-the-art Gibbs samplers; 2) a model-scheduling scheme to handle the big model challenge, where each worker machine schedules the fetch/use of sub-models as needed, resulting in a frugal use of limited memory capacity and network bandwidth; 3) a differential data-structure for model storage, which uses separate data structures for high- and low-frequency words to allow extremely large models to fit in memory, while maintaining high inference speed. These contributions are built on top of the Petuum open-source distributed ML framework, and we provide experimental evidence showing how this development puts massive data and models within reach on a small cluster, while still enjoying proportional time cost reductions with increasing cluster size.
A Novelty-Seeking based Dining Recommender System BIBAFull-Text 1362-1372
  Fuzheng Zhang; Kai Zheng; Nicholas Jing Yuan; Xing Xie; Enhong Chen; Xiaofang Zhou
The rapid growth of location-based services provide the potential to understand people's mobility pattern at an unprecedented level, which can also enable food-service industry to accurately predict consumer's dining behavior. In this paper, by leveraging users' historical dining pattern, socio-demographic characteristics and restaurants' attributes, we aim at generating the top-K restaurants for a user's next dining. Compared to previous studies in location prediction which mainly focus on regular mobility patterns, we present a novelty-seeking based dining recommender system, termed NDRS, in consideration of both exploration and exploitation. First, we apply a Conditional Random Field (CRF) with additional constraints to infer users' novelty-seeking statuses by considering both spatial-temporal-historical features and users' socio-demographic characteristics. On the one hand, when a user is predicted to be novelty-seeking, by incorporating the influence of restaurants' contextual factors such as price and service quality, we propose a context-aware collaborative filtering method to recommend restaurants she has never visited before. On the other hand, when a user is predicted to be not novelty-seeking, we then present a Hidden Markov Model (HMM) considering the temporal regularity to recommend the previously visited restaurants. To evaluate the performance of each component as well as the whole system, we conduct extensive experiments, with a large dataset we have collected covering the concerned dining related check-ins, users' demographics, and restaurants' attributes. The results reveal that our system is effective for dining recommendation.
Daily-Aware Personalized Recommendation based on Feature-Level Time Series Analysis BIBAFull-Text 1373-1383
  Yongfeng Zhang; Min Zhang; Yi Zhang; Guokun Lai; Yiqun Liu; Honghui Zhang; Shaoping Ma
The frequently changing user preferences and/or item profiles have put essential importance on the dynamic modeling of users and items in personalized recommender systems. However, due to the insufficiency of per user/item records when splitting the already sparse data across time dimension, previous methods have to restrict the drifting purchasing patterns to pre-assumed distributions, and were hardly able to model them rather directly with, for example, time series analysis. Integrating content information helps to alleviate the problem in practical systems, but the domain-dependent content knowledge is expensive to obtain due to the large amount of manual efforts.
   In this paper, we make use of the large volume of textual reviews for the automatic extraction of domain knowledge, namely, the explicit features/aspects in a specific product domain. We thus degrade the product-level modeling of user preferences, which suffers from the lack of data, to the feature-level modeling, which not only grants us the ability to predict user preferences through direct time series analysis, but also allows us to know the essence under the surface of product-level changes in purchasing patterns. Besides, the expanded feature space also helps to make cold-start recommendations for users with few purchasing records.
   Technically, we develop the Fourier-assisted Auto-Regressive Integrated Moving Average (FARIMA) process to tackle with the year-long seasonal period of purchasing data to achieve daily-aware preference predictions, and we leverage the conditional opportunity models for daily-aware personalized recommendation. Extensive experimental results on real-world cosmetic purchasing data from a major e-commerce website (JD.com) in China verified both the effectiveness and efficiency of our approach.
Automatic Detection of Information Leakage Vulnerabilities in Browser Extensions BIBAFull-Text 1384-1394
  Rui Zhao; Chuan Yue; Qing Yi
A large number of extensions exist in browser vendors' online stores for millions of users to download and use. Many of those extensions process sensitive information from user inputs and webpages; however, it remains a big question whether those extensions may accidentally leak such sensitive information out of the browsers without protection. In this paper, we present a framework, LvDetector, that combines static and dynamic program analysis techniques for automatic detection of information leakage vulnerabilities in legitimate browser extensions. Extension developers can use LvDetector to locate and fix the vulnerabilities in their code; browser vendors can use LvDetector to decide whether the corresponding extensions can be hosted in their online stores; advanced users can also use LvDetector to determine if certain extensions are safe to use. The design of LvDetector is not bound to specific browsers or JavaScript engines, and can adopt other program analysis techniques. We implemented LvDetector and evaluated it on 28 popular Firefox and Google Chrome extensions. LvDetector identified 18 previously unknown information leakage vulnerabilities in 13 extensions with a 87% accuracy rate. The evaluation results and the feedback to our responsible disclosure demonstrate that LvDetector is useful and effective.
Enquiring Minds: Early Detection of Rumors in Social Media from Enquiry Posts BIBAFull-Text 1395-1405
  Zhe Zhao; Paul Resnick; Qiaozhu Mei
Many previous techniques identify trending topics in social media, even topics that are not pre-defined. We present a technique to identify trending rumors, which we define as topics that include disputed factual claims. Putting aside any attempt to assess whether the rumors are true or false, it is valuable to identify trending rumors as early as possible. It is extremely difficult to accurately classify whether every individual post is or is not making a disputed factual claim. We are able to identify trending rumors by recasting the problem as finding entire clusters of posts whose topic is a disputed factual claim.
   The key insight is that when there is a rumor, even though most posts do not raise questions about it, there may be a few that do. If we can find signature text phrases that are used by a few people to express skepticism about factual claims and are rarely used to express anything else, we can use those as detectors for rumor clusters. Indeed, we have found a few phrases that seem to be used exactly that way, including: "Is this true?", "Really?", and "What?". Relatively few posts related to any particular rumor use any of these enquiry phrases, but lots of rumor diffusion processes have some posts that do and have them quite early in the diffusion.
   We have developed a technique based on searching for the enquiry phrases, clustering similar posts together, and then collecting related posts that do not contain these simple phrases. We then rank the clusters by their likelihood of really containing a disputed factual claim. The detector, which searches for the very rare but very informative phrases, combined with clustering and a classifier on the clusters, yields surprisingly good performance. On a typical day of Twitter, about a third of the top 50 clusters were judged to be rumors, a high enough precision that human analysts might be willing to sift through them.
Improving User Topic Interest Profiles by Behavior Factorization BIBAFull-Text 1406-1416
  Zhe Zhao; Zhiyuan Cheng; Lichan Hong; Ed H. Chi
Many recommenders aim to provide relevant recommendations to users by building personal topic interest profiles and then using these profiles to find interesting contents for the user. In social media, recommender systems build user profiles by directly combining users' topic interest signals from a wide variety of consumption and publishing behaviors, such as social media posts they authored, commented on, +1'd or liked. Here we propose to separately model users' topical interests that come from these various behavioral signals in order to construct better user profiles. Intuitively, since publishing a post requires more effort, the topic interests coming from publishing signals should be more accurate of a user's central interest than, say, a simple gesture such as a +1. By separating a single user's interest profile into several behavioral profiles, we obtain better and cleaner topic interest signals, as well as enabling topic prediction for different types of behavior, such as topics that the user might +1 or comment on, but might never write a post on that topic.
   To do this at large scales in Google+, we employed matrix factorization techniques to model each user's behaviors as a separate example entry in the input user-by-topic matrix. Using this technique, which we call "behavioral factorization", we implemented and built a topic recommender predicting user's topical interests using their actions within Google+. We experimentally showed that we obtained better and cleaner signals than baseline methods, and are able to more accurately predict topic interests as well as achieve better coverage.
Predicting Pinterest: Automating a Distributed Human Computation BIBAFull-Text 1417-1426
  Changtao Zhong; Dmytro Karamshuk; Nishanth Sastry
Everyday, millions of users save content items for future use on sites like Pinterest, by ''pinning'' them onto carefully categorised personal pinboards, thereby creating personal taxonomies of the Web. This paper seeks to understand Pinterest as a distributed human computation that categorises images from around the Web. We show that despite being categorised onto personal pinboards by individual actions, there is a generally a global agreement in implicitly assigning images into a coarse-grained global taxonomy of 32 categories, and furthermore, users tend to specialise in a handful of categories. By exploiting these characteristics, and augmenting with image-related features drawn from a state-of-the-art deep convolutional neural network, we develop a cascade of predictors that together automate a large fraction of Pinterest actions. Our end-to-end model is able to both predict whether a user will repin an image onto her own pinboard, and also which pinboard she might choose, with an accuracy of 0.69 (Accuracy@5 of 0.75).