HCI Bibliography Home | HCI Conferences | WWW Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
WWW Tables of Contents: 05-2060708091011-111-212-112-213-113-214-114-215-115-2

Proceedings of the 2013 International Conference on the World Wide Web

Fullname:Proceedings of the 22nd International Conference on World Wide Web
Editors:Daniel Schwabe; Virgílio Almeida; Hartmut Glaser; Ricardo Baeza-Yates; Sue Moon
Location:Rio de Janeiro, Brazil
Dates:2013-May-13 to 2013-May-17
Volume:1
Publisher:ACM
Standard No:ISBN: 978-1-4503-2035-1; ACM DL: Table of Contents; hcibib: WWW13-1
Papers:137
Pages:1591
Links:Conference Website
  1. WWW 2013-05-13 Volume 1
    1. Research papers

WWW 2013-05-13 Volume 1

Research papers

Real-time recommendation of diverse related articles BIBAFull-Text 1-12
  Sofiane Abbar; Sihem Amer-Yahia; Piotr Indyk; Sepideh Mahabadi
News articles typically drive a lot of traffic in the form of comments posted by users on a news site. Such user-generated content tends to carry additional information such as entities and sentiment. In general, when articles are recommended to users, only popularity (e.g., most shared and most commented), recency, and sometimes (manual) editors' picks (based on daily hot topics), are considered. We formalize a novel recommendation problem where the goal is to find the closest most diverse articles to the one the user is currently browsing. Our diversity measure incorporates entities and sentiment extracted from comments. Given the real-time nature of our recommendations, we explore the applicability of nearest neighbor algorithms to solve the problem. Our user study on real opinion articles from aljazeera.net and reuters.com validates the use of entities and sentiment extracted from articles and their comments to achieve news diversity when compared to content-based diversity. Finally, our performance experiments show the real-time feasibility of our solution.
Multi-label learning with millions of labels: recommending advertiser bid phrases for web pages BIBAFull-Text 13-24
  Rahul Agrawal; Archit Gupta; Yashoteja Prabhu; Manik Varma
Recommending phrases from web pages for advertisers to bid on against search engine queries is an important research problem with direct commercial impact. Most approaches have found it infeasible to determine the relevance of all possible queries to a given ad landing page and have focussed on making recommendations from a small set of phrases extracted (and expanded) from the page using NLP and ranking based techniques. In this paper, we eschew this paradigm, and demonstrate that it is possible to efficiently predict the relevant subset of queries from a large set of monetizable ones by posing the problem as a multi-label learning task with each query being represented by a separate label.
   We develop Multi-label Random Forests to tackle problems with millions of labels. Our proposed classifier has prediction costs that are logarithmic in the number of labels and can make predictions in a few milliseconds using 10 Gb of RAM. We demonstrate that it is possible to generate training data for our classifier automatically from click logs without any human annotation or intervention. We train our classifier on tens of millions of labels, features and training points in less than two days on a thousand node cluster. We develop a sparse semi-supervised multi-label learning formulation to deal with training set biases and noisy labels harvested automatically from the click logs. This formulation is used to infer a belief in the state of each label for each training ad and the random forest classifier is extended to train on these beliefs rather than the given labels. Experiments reveal significant gains over ranking and NLP based techniques on a large test set of 5 million ads using multiple metrics.
Hierarchical geographical modeling of user locations from social media posts BIBAFull-Text 25-36
  Amr Ahmed; Liangjie Hong; Alexander J. Smola
With the availability of cheap location sensors, geotagging of messages in online social networks is proliferating. For instance, Twitter, Facebook, Foursquare, and Google+ provide these services both explicitly by letting users choose their location or implicitly via a sensor. This paper presents an integrated generative model of location and message content. That is, we provide a model for combining distributions over locations, topics, and over user characteristics, both in terms of location and in terms of their content preferences. Unlike previous work which modeled data in a flat pre-defined representation, our model automatically infers both the hierarchical structure over content and over the size and position of geographical locations. This affords significantly higher accuracy -- location uncertainty is reduced by 40% relative to the best previous results [21] achieved on location estimation from Tweets.
   We achieve this goal by proposing a new statistical model, the nested Chinese Restaurant Franchise (nCRF), a hierarchical model of tree distributions. Much statistical structure is shared between users. That said, each user has his own distribution over interests and places. The use of the nCRF allows us to capture the following effects: (1) We provide a topic model for Tweets; (2) We obtain location specific topics; (3) We infer a latent distribution of locations; (4) We provide a joint hierarchical model of topics and locations; (5) We infer personalized preferences over topics and locations within the above model. In doing so, we are both able to obtain accurate estimates of the location of a user based on his tweets and to obtain a detailed estimate of a geographical language model.
Distributed large-scale natural graph factorization BIBAFull-Text 37-48
  Amr Ahmed; Nino Shervashidze; Shravan Narayanamurthy; Vanja Josifovski; Alexander J. Smola
Natural graphs, such as social networks, email graphs, or instant messaging patterns, have become pervasive through the internet. These graphs are massive, often containing hundreds of millions of nodes and billions of edges. While some theoretical models have been proposed to study such graphs, their analysis is still difficult due to the scale and nature of the data.
   We propose a framework for large-scale graph decomposition and inference. To resolve the scale, our framework is distributed so that the data are partitioned over a shared-nothing set of machines. We propose a novel factorization technique that relies on partitioning a graph so as to minimize the number of neighboring vertices rather than edges across partitions. Our decomposition is based on a streaming algorithm. It is network-aware as it adapts to the network topology of the underlying computational hardware. We use local copies of the variables and an efficient asynchronous communication protocol to synchronize the replicated values in order to perform most of the computation without having to incur the cost of network communication. On a graph of 200 million vertices and 10 billion edges, derived from an email communication network, our algorithm retains convergence properties while allowing for almost linear scalability in the number of computers.
A CRM system for social media: challenges and experiences BIBAFull-Text 49-58
  Jitendra Ajmera; Hyung-iL Ahn; Meena Nagarajan; Ashish Verma; Danish Contractor; Stephen Dill; Matthew Denesuk
The social Customer Relationship Management (CRM) landscape is attracting significant attention from customers and enterprises alike as a sustainable channel for tracking, managing and improving customer relations. Enterprises are taking a hard look at this open, unmediated platform because the community effect generated on this channel can have a telling effect on their brand image, potential market opportunity and customer loyalty. In this work we present our experiences in building a system that mines conversations on social platforms to identify and prioritize those posts and messages that are relevant to enterprises. The system presented in this work aims to empower an agent or a representative in an enterprise to monitor, track and respond to customer communication while also encouraging community participation.
Here's my cert, so trust me, maybe?: understanding TLS errors on the web BIBAFull-Text 59-70
  Devdatta Akhawe; Bernhard Amann; Matthias Vallentin; Robin Sommer
When browsers report TLS errors, they cannot distinguish between attacks and harmless server misconfigurations; hence they leave it to the user to decide whether continuing is safe. However, actual attacks remain rare. As a result, users quickly become used to "false positives" that deplete their attention span, making it unlikely that they will pay sufficient scrutiny when a real attack comes along. Consequently, browser vendors should aim to minimize the number of low-risk warnings they report. To guide that process, we perform a large-scale measurement study of common TLS warnings. Using a set of passive network monitors located at different sites, we identify the prevalence of warnings for a total population of about 300,000 users over a nine-month period. We identify low-risk scenarios that consume a large chunk of the user attention budget and make concrete recommendations to browser vendors that will help maintain user attention in high-risk situations. We study the impact on end users with a data set much larger in scale than the data sets used in previous TLS measurement studies. A key novelty of our approach involves the use of internal browser code instead of generic TLS libraries for analysis, providing more accurate and representative results.
Towards a robust modeling of temporal interest change patterns for behavioral targeting BIBAFull-Text 71-82
  Mohamed Aly; Sandeep Pandey; Vanja Josifovski; Kunal Punera
Modern web-scale behavioral targeting platforms leverage historical activity of billions of users to predict user interests and inclinations, and consequently future activities. Future activities of particular interest involve purchases or transactions, and are referred to as conversions. Unlike ad-clicks, conversions directly translate to advertiser's revenue, and thus provide a very concrete metric for return on advertising investment. A typical behavioral targeting system faces two main challenges: the web-scale amounts of user histories to process on a daily basis, and the relative sparsity of conversions (compared to clicks in a traditional setting). These challenges call for generation of effective and efficient user profiles. Most existing works use the historical intensity of a user's interest in various topics to model future interest. In this paper we explore how the change in user behavior can be used to predict future actions and show how it complements the traditional models of decaying interest and action recency to build a complete picture about the user interests and better predict conversions. Our evaluation over a real-world set of campaigns indicates that the combination of change of interest, decaying intensity, and action recency helps in: 1) scoring significant improvements in optimizing for conversions over traditional baselines, 2) substantially improving the targeting efficiency for campaigns with highly sparse conversions, and 3) highly reducing the overall history sizes used in targeting. Furthermore, our techniques have been deployed to production and scored a substantial improvement in targeting performance while imposing a negligible overhead in terms of overall platform running time.
The anatomy of LDNS clusters: findings and implications for web content delivery BIBAFull-Text 83-94
  Hussein A. Alzoubi; Michael Rabinovich; Oliver Spatscheck
We present a large-scale measurement of clusters of hosts sharing the same local DNS servers. We analyze properties of these "LDNS clusters" from the perspective of content delivery networks, which commonly use DNS for load distribution. We found that not only LDNS clusters differ widely in terms of their size and geographical compactness but that the largest clusters are actually extremely compact. This suggests potential benefits of a load distribution strategy with nuanced treatment of different LDNS clusters based on the combination of their size and compactness. We further observed interesting variations in LDNS setups including a wide use of "LDNS pools" (which as we explain in the paper are different from setups where end-hosts simply utilize multiple resolvers).
Steering user behavior with badges BIBAFull-Text 95-106
  Ashton Anderson; Daniel Huttenlocher; Jon Kleinberg; Jure Leskovec
An increasingly common feature of online communities and social media sites is a mechanism for rewarding user achievements based on a system of badges. Badges are given to users for particular contributions to a site, such as performing a certain number of actions of a given type. They have been employed in many domains, including news sites like the Huffington Post, educational sites like Khan Academy, and knowledge-creation sites like Wikipedia and Stack Overflow. At the most basic level, badges serve as a summary of a user's key accomplishments; however, experience with these sites also shows that users will put in non-trivial amounts of work to achieve particular badges, and as such, badges can act as powerful incentives. Thus far, however, the incentive structures created by badges have not been well understood, making it difficult to deploy badges with an eye toward the incentives they are likely to create.
   In this paper, we study how badges can influence and steer user behavior on a site -- leading both to increased participation and to changes in the mix of activities a user pursues on the site. We introduce a formal model for reasoning about user behavior in the presence of badges, and in particular for analyzing the ways in which badges can steer users to change their behavior. To evaluate the main predictions of our model, we study the use of badges and their effects on the widely used Stack Overflow question-answering site, and find evidence that their badges steer behavior in ways closely consistent with the predictions of our model. Finally, we investigate the problem of how to optimally place badges in order to induce particular user behaviors. Several robust design principles emerge from our framework that could potentially aid in the design of incentives for a broad range of sites.
Cascading tree sheets and recombinant HTML: better encapsulation and retargeting of web content BIBAFull-Text 107-118
  Edward O. Benson; David R. Karger
Cascading Style Sheets (CSS) took a valuable step towards separating web content from presentation. But HTML pages still contain large amounts of "design scaffolding" needed to hierarchically layer content for proper presentation. This paper presents Cascading Tree Sheets (CTS), a CSS-like language for separating this presentational HTML from real content. With CTS, authors can use standard CSS selectors to describe how to graft presentational scaffolding onto their pure-content HTML. This improved separation of content from presentation enables even naive authors to incorporate rich layouts (including interactive Javascript) into their own pages simply by linking to a tree sheet and adding some class names to their HTML.
CopyCatch: stopping group attacks by spotting lockstep behavior in social networks BIBAFull-Text 119-130
  Alex Beutel; Wanhong Xu; Venkatesan Guruswami; Christopher Palow; Christos Faloutsos
How can web services that depend on user generated content discern fraudulent input by spammers from legitimate input? In this paper we focus on the social network Facebook and the problem of discerning ill-gotten Page Likes, made by spammers hoping to turn a profit, from legitimate Page Likes. Our method, which we refer to as CopyCatch, detects lockstep Page Like patterns on Facebook by analyzing only the social graph between users and Pages and the times at which the edges in the graph (the Likes) were created. We offer the following contributions: (1) We give a novel problem formulation, with a simple concrete definition of suspicious behavior in terms of graph structure and edge constraints. (2) We offer two algorithms to find such suspicious lockstep behavior -- one provably-convergent iterative algorithm and one approximate, scalable MapReduce implementation. (3) We show that our method severely limits "greedy attacks" and analyze the bounds from the application of the Zarankiewicz problem to our setting. Finally, we demonstrate and discuss the effectiveness of CopyCatch at Facebook and on synthetic data, as well as potential extensions to anomaly detection problems in other domains. CopyCatch is actively in use at Facebook, searching for attacks on Facebook's social graph of over a billion users, many millions of Pages, and billions of Page Likes.
Inferring the demographics of search users: social data meets search queries BIBAFull-Text 131-140
  Bin Bi; Milad Shokouhi; Michal Kosinski; Thore Graepel
Knowing users' views and demographic traits offers a great potential for personalizing web search results or related services such as query suggestion and query completion. Such signals however are often only available for a small fraction of search users, namely those who log in with their social network account and allow its use for personalization of search results. In this paper, we offer a solution to this problem by showing how user demographic traits such as age and gender, and even political and religious views can be efficiently and accurately inferred based on their search query histories. This is accomplished in two steps; we first train predictive models based on the publically available myPersonality dataset containing users' Facebook Likes and their demographic information. We then match Facebook Likes with search queries using Open Directory Project categories. Finally, we apply the model trained on Facebook Likes to large-scale query logs of a commercial search engine while explicitly taking into account the difference between the traits distribution in both datasets. We find that the accuracy of classifying age and gender, expressed by the area under the ROC curve (AUC), are 77% and 84% respectively for predictions based on Facebook Likes, and only degrade to 74% and 80% when based on search queries. On a US state-by-state basis we find a Pearson correlation of 0.72 for political views between the predicted scores and Gallup data, and 0.54 for affiliation with Judaism between predicted scores and data from the US Religious Landscape Survey. We conclude that it is indeed feasible to infer important demographic data of users from their query history based on labelled Likes data and believe that this approach could provide valuable information for personalization and monetization even in the absence of demographic data.
Strategyproof mechanisms for competitive influence in networks BIBAFull-Text 141-150
  Allan Borodin; Mark Braverman; Brendan Lucier; Joel Oren
Motivated by applications to word-of-mouth advertising, we consider a game-theoretic scenario in which competing advertisers want to target initial adopters in a social network. Each advertiser wishes to maximize the resulting cascade of influence, modeled by a general network diffusion process. However, competition between products may adversely impact the rate of adoption for any given firm. The resulting framework gives rise to complex preferences that depend on the specifics of the stochastic diffusion model and the network topology.
   We study this model from the perspective of a central mechanism, such as a social networking platform, that can optimize seed placement as a service for the advertisers. We ask: given the reported demands of the competing firms, how should a mechanism choose seeds to maximize overall efficiency? Beyond the algorithmic problem, competition raises issues of strategic behaviour: rational agents should not be incentivized to underreport their budget demands.
   We show that when there are two players, the social welfare can be 2-approximated by a polynomial-time strategyproof mechanism. Our mechanism is defined recursively, randomizing the order in which advertisers are allocated seeds according to a particular greedy method. For three or more players, we demonstrate that under additional assumptions (satisfied by many existing models of influence spread) there exists a simpler strategyproof (e/e-1)-approximation mechanism; notably, this second mechanism is not necessarily strategyproof when there are only two players.
Reactive crowdsourcing BIBAFull-Text 153-164
  Alessandro Bozzon; Marco Brambilla; Stefano Ceri; Andrea Mauri
An essential aspect for building effective crowdsourcing computations is the ability of "controlling the crowd", i.e. of dynamically adapting the behaviour of the crowdsourcing systems as response to the quantity and quality of completed tasks or to the availability and reliability of performers. Most crowdsourcing systems only provide limited and predefined controls; in contrast, we present an approach to crowdsourcing which provides fine-level, powerful and flexible controls. We model each crowdsourcing application as composition of elementary task types and we progressively transform these high level specifications into the features of a reactive execution environment that supports task planning, assignment and completion as well as performer monitoring and exclusion. Controls are specified as active rules on top of data structures which are derived from the model of the application; rules can be added, dropped or modified, thus guaranteeing maximal flexibility with limited effort.
   We also report on our prototype platform that implements the proposed framework and we show the results of our experimentations with different rule sets, demonstrating how simple changes to the rules can substantially affect time, effort and quality involved in crowdsourcing activities.
On participation in group chats on Twitter BIBAFull-Text 165-176
  Ceren Budak; Rakesh Agrawal
The success of a group depends on continued participation of its members through time. We study the factors that affect continued user participation in the context of educational Twitter chats. To predict whether a user that attended her first session in a particular Twitter chat group will return to the group, we build 5F Model that captures five different factors: individual initiative, group characteristics, perceived receptivity, linguistic affinity and geographical proximity. Through statistical data analysis of thirty Twitter chats over a two year period as well as a survey study, our work provides many insights about group dynamics in Twitter chats. We show similarities between Twitter chats and traditional groups such as the importance of social inclusion and linguistic similarity while also identifying important distinctions such as the insignificance of geographical proximity. We also show that informational support is more important than emotional support in educational Twitter chats, but this does not reduce the sense of community as suggested in earlier studies.
The role of web hosting providers in detecting compromised websites BIBAFull-Text 177-188
  Davide Canali; Davide Balzarotti; Aurélien Francillon
Compromised websites are often used by attackers to deliver malicious content or to host phishing pages designed to steal private information from their victims. Unfortunately, most of the targeted websites are managed by users with little security background -- often unable to detect this kind of threats or to afford an external professional security service.
   In this paper we test the ability of web hosting providers to detect compromised websites and react to user complaints. We also test six specialized services that provide security monitoring of web pages for a small fee.
   During a period of 30 days, we hosted our own vulnerable websites on 22 shared hosting providers, including 12 of the most popular ones. We repeatedly ran five different attacks against each of them. Our tests included a bot-like infection, a drive-by download, the upload of malicious files, an SQL injection stealing credit card numbers, and a phishing kit for a famous American bank. In addition, we also generated traffic from seemingly valid victims of phishing and drive-by download sites. We show that most of these attacks could have been detected by free network or file analysis tools. After 25 days, if no malicious activity was detected, we started to file abuse complaints to the providers. This allowed us to study the reaction of the web hosting providers to both real and bogus complaints.
   The general picture we drew from our study is quite alarming. The vast majority of the providers, or "add-on" security monitoring services, are unable to detect the most simple signs of malicious activity on hosted websites.
Your browsing behavior for a big mac: economics of personal information online BIBAFull-Text 189-200
  Juan Pablo Carrascal; Christopher Riederer; Vijay Erramilli; Mauro Cherubini; Rodrigo de Oliveira
Most online service providers offer free services to users and in part, these services collect and monetize personally identifiable information (PII), primarily via targeted advertisements. Against this backdrop of economic exploitation of PII, it is vital to understand the value that users put to their own PII. Although studies have tried to discover how users value their privacy, little is known about how users value their PII while browsing, or the exploitation of their PII. Extracting valuations of PII from users is non-trivial -- surveys cannot be relied on as they do not gather information of the context where PII is being released, thus reducing validity of answers. In this work, we rely on refined Experience Sampling -- a data collection method that probes users to valuate their PII at the time and place where it was generated in order to minimize retrospective recall and hence increase measurement validity. For obtaining an honest valuation of PII, we use a reverse second price auction. We developed a web browser plugin and had 168 users -- living in Spain -- install and use this plugin for 2 weeks in order to extract valuations of PII in different contexts.
   We found that users value items of their online browsing history for about €7 (~10USD), and they give higher valuations to their offline PII, such as age and address (about 25€ or 36USD). When it comes to PII shared in specific online services, users value information pertaining to financial transactions and social network interactions more than activities like search and shopping. No significant distinction was found between valuations of different quantities of PII (e.g. one vs. 10 search keywords), but deviation was found between types of PII (e.g. photos vs. keywords). Finally, the users' preferred goods for exchanging their PII included money and improvements in service, followed by getting more free services and targeted advertisements.
Is this app safe for children?: a comparison study of maturity ratings on Android and iOS applications BIBAFull-Text 201-212
  Ying Chen; Heng Xu; Yilu Zhou; Sencun Zhu
There is a rising concern among parents who have experienced unreliable content maturity ratings for mobile applications (apps) that result in inappropriate risk exposure for their children and adolescents. In reality, there is no consistent maturity rating policy for mobile applications. The maturity ratings of Android apps are provided purely by developers' self-disclosure and are rarely verified. While Apple's iOS app ratings are considered to be more accurate, they can also be inconsistent with Apple's published policies. To address these issues, this research aims to systematically uncover the extent and severity of unreliable maturity ratings for mobile apps. Specifically, we develop mechanisms to verify the maturity ratings of mobile apps and investigate possible reasons behind the incorrect ratings. We believe that our findings have important implications for platform providers (e.g., Google or Apple) as well as for regulatory bodies and application developers.
Traveling the silk road: a measurement analysis of a large anonymous online marketplace BIBAFull-Text 213-224
  Nicolas Christin
We perform a comprehensive measurement analysis of Silk Road, an anonymous, international online marketplace that operates as a Tor hidden service and uses Bitcoin as its exchange currency. We gather and analyze data over eight months between the end of 2011 and 2012, including daily crawls of the marketplace for nearly six months in 2012. We obtain a detailed picture of the type of goods sold on Silk Road, and of the revenues made both by sellers and Silk Road operators. Through examining over 24,400 separate items sold on the site, we show that Silk Road is overwhelmingly used as a market for controlled substances and narcotics, and that most items sold are available for less than three weeks. The majority of sellers disappears within roughly three months of their arrival, but a core of 112 sellers has been present throughout our measurement interval. We evaluate the total revenue made by all sellers, from public listings, to slightly over USD 1.2 million per month; this corresponds to about USD 92,000 per month in commissions for the Silk Road operators. We further show that the marketplace has been operating steadily, with daily sales and number of sellers overall increasing over our measurement interval. We discuss economic and policy implications of our analysis and results, including ethical considerations for future research in this area.
Group chats on Twitter BIBAFull-Text 225-236
  James Cook; Krishnaram Kenthapadi; Nina Mishra
We report on a new kind of group conversation on Twitter that we call a group chat. These chats are periodic, synchronized group conversations focused on specific topics and they exist at a massive scale. The groups and the members of these groups are not explicitly known. Rather, members agree on a hashtag and a meeting time (e.g, 3pm Pacific Time every Wednesday) to discuss a subject of interest. Topics of these chats are numerous and varied. Some are support groups, for example, post-partum depression and mood disorder groups. Others are about a passionate interest: topics include skiing, photography, movies, wine and foodie communities. We develop a definition of a group that is inspired by how sociologists define groups and present an algorithm for discovering groups. We prove that our algorithms find all groups under certain assumptions. While these groups are of course known to the people who participate in the discussions, what we do not believe is known is the scale and variety of groups. We provide some insight into the nature of these groups based on over two years of tweets. Finally, we show that group chats are a growing phenomenon on Twitter and hope that reporting their existence propels their growth even further.
How to grow more pairs: suggesting review targets for comparison-friendly review ecosystems BIBAFull-Text 237-248
  James Cook; Alex Fabrikant; Avinatan Hassidim
We consider the algorithmic challenges behind a novel interface that simplifies consumer research of online reviews by surfacing relevant comparable review bundles: reviews for two or more of the items being researched, all generated in similar enough circumstances to provide for easy comparison. This can be reviews by the same reviewer, or by the same demographic category of reviewer, or reviews focusing on the same aspect of the items. But such an interface will work only if the review ecosystem often has comparable review bundles for common research tasks.
   Here, we develop and evaluate practical algorithms for suggesting additional review targets to reviewers to maximize comparable pair coverage, the fraction of co-researched pairs of items that have both been reviewed by the same reviewer (or more generally are comparable in one of several ways). We show the exact problem and many subcases to be intractable, and give a greedy online, linear-time 2-approximation for a very general setting, and an offline 1.583-approximation for a narrower setting. We evaluate the algorithms on the Google+ Local reviews dataset, yielding more than 10x gain in pair coverage from six months of simulated replacement of existing reviews by suggested reviews. Even allowing for 90% of reviewers ignoring the suggestions, the pair coverage grows more than 2x in the simulation. To explore other parts of the parameter space, we also evaluate the algorithms on synthetic models.
A framework for benchmarking entity-annotation systems BIBAFull-Text 249-260
  Marco Cornolti; Paolo Ferragina; Massimiliano Ciaramita
In this paper we design and implement a benchmarking framework for fair and exhaustive comparison of entity-annotation systems. The framework is based upon the definition of a set of problems related to the entity-annotation task, a set of measures to evaluate systems performance, and a systematic comparative evaluation involving all publicly available datasets, containing texts of various types such as news, tweets and Web pages. Our framework is easily-extensible with novel entity annotators, datasets and evaluation measures for comparing systems, and it has been released to the public as open source. We use this framework to perform the first extensive comparison among all available entity annotators over all available datasets, and draw many interesting conclusions upon their efficiency and effectiveness. We also draw conclusions between academic versus commercial annotators.
A framework for learning web wrappers from the crowd BIBAFull-Text 261-272
  Valter Crescenzi; Paolo Merialdo; Disheng Qiu
The development of solutions to scale the extraction of data from Web sources is still a challenging issue. High accuracy can be achieved by supervised approaches but the costs of training data, i.e., annotations over a set of sample pages, limit their scalability. Crowd sourcing platforms are making the manual annotation process more affordable. However, the tasks demanded to these platforms should be extremely simple, to be performed by non-expert people, and their number should be minimized, to contain the costs. We introduce a framework to support a supervised wrapper inference system with training data generated by the crowd. Training data are labeled values generated by means of membership queries, the simplest form of queries, posed to the crowd. We show that the costs of producing the training data are strongly affected by the expressiveness of the wrapper formalism and by the choice of the training set. Traditional supervised wrapper inference approaches use a statically defined formalism, assuming it is able to express the wrapper. Conversely, we present an inference algorithm that dynamically chooses the expressiveness of the wrapper formalism and actively selects the training set, while minimizing the number of membership queries to the crowd. We report the results of experiments on real web sources to confirm the effectiveness and the feasibility of the approach.
Lightweight server support for browser-based CSRF protection BIBAFull-Text 273-284
  Alexei Czeskis; Alexander Moshchuk; Tadayoshi Kohno; Helen J. Wang
Cross-Site Request Forgery (CSRF) attacks are one of the top threats on the web today. These attacks exploit ambient authority in browsers (eg cookies, HTTP authentication state), turning them into confused deputies and causing undesired side effects on vulnerable web sites. Existing defenses against CSRFs fall short in their coverage and/or ease of deployment. In this paper, we present a browser/server solution, Allowed Referrer Lists (ARLs), that addresses the root cause of CSRFs and removes ambient authority for participating web sites that want to be resilient to CSRF attacks. Our solution is easy for web sites to adopt and does not affect any functionality on non-participating sites. We have implemented our design in Firefox and have evaluated it with real-world sites. We found that ARLs successfully block CSRF attacks, are simpler to implement than existing defenses, and do not significantly impact browser performance.
Aggregating crowdsourced binary ratings BIBAFull-Text 285-294
  Nilesh Dalvi; Anirban Dasgupta; Ravi Kumar; Vibhor Rastogi
In this paper we analyze a crowdsourcing system consisting of a set of users and a set of binary choice questions. Each user has an unknown, fixed, reliability that determines the user's error rate in answering questions. The problem is to determine the truth values of the questions solely based on the user answers. Although this problem has been studied extensively, theoretical error bounds have been shown only for restricted settings: when the graph between users and questions is either random or complete. In this paper we consider a general setting of the problem where the user -- question graph can be arbitrary. We obtain bounds on the error rate of our algorithm and show it is governed by the expansion of the graph. We demonstrate, using several synthetic and real datasets, that our algorithm outperforms the state of the art.
Optimal hashing schemes for entity matching BIBAFull-Text 295-306
  Nilesh Dalvi; Vibhor Rastogi; Anirban Dasgupta; Anish Das Sarma; Tamas Sarlos
In this paper, we consider the problem of devising blocking schemes for entity matching. There is a lot of work on blocking techniques for supporting various kinds of predicates, e.g. exact matches, fuzzy string-similarity matches, and spatial matches. However, given a complex entity matching function in the form of a Boolean expression over several such predicates, we show that it is an important and non-trivial problem to combine the individual blocking techniques into an efficient blocking scheme for the entity matching function, a problem that has not been studied previously.
   In this paper, we make fundamental contributions to this problem. We consider an abstraction for modeling complex entity matching functions as well as blocking schemes. We present several results of theoretical and practical interest for the problem. We show that in general, the problem of computing the optimal blocking strategy is NP-hard in the size of the DNF formula describing the matching function. We also present several algorithms for computing the exact optimal strategies (with exponential complexity, but often feasible in practice) as well as fast approximation algorithms. We experimentally demonstrate over commercially used rule-based matching systems over real datasets at Yahoo!, as well as synthetic datasets, that our blocking strategies can be an order of magnitude faster than the baseline methods, and our algorithms can efficiently find good blocking strategies.
No country for old members: user lifecycle and linguistic change in online communities BIBAFull-Text 307-318
  Cristian Danescu-Niculescu-Mizil; Robert West; Dan Jurafsky; Jure Leskovec; Christopher Potts
Vibrant online communities are in constant flux. As members join and depart, the interactional norms evolve, stimulating further changes to the membership and its social dynamics. Linguistic change -- in the sense of innovation that becomes accepted as the norm -- is essential to this dynamic process: it both facilitates individual expression and fosters the emergence of a collective identity.
   We propose a framework for tracking linguistic change as it happens and for understanding how specific users react to these evolving norms. By applying this framework to two large online communities we show that users follow a determined two-stage lifecycle with respect to their susceptibility to linguistic change: a linguistically innovative learning phase in which users adopt the language of the community followed by a conservative phase in which users stop changing and the evolving community norms pass them by.
   Building on this observation, we show how this framework can be used to detect, early in a user's career, how long she will stay active in the community. Thus, this work has practical significance for those who design and maintain online communities. It also yields new theoretical insights into the evolution of linguistic norms and the complex interplay between community-level and individual-level linguistic change.
Crowdsourced judgement elicitation with endogenous proficiency BIBAFull-Text 319-330
  Anirban Dasgupta; Arpita Ghosh
Crowdsourcing is now widely used to replace judgement or evaluation by an expert authority with an aggregate evaluation from a number of non-experts, in applications ranging from rating and categorizing online content all the way to evaluation of student assignments in massively open online courses (MOOCs) via peer grading. A key issue in these settings, where direct monitoring of both effort and accuracy is infeasible, is incentivizing agents in the 'crowd' to put in effort to make good evaluations, as well as to truthfully report their evaluations. We study the design of mechanisms for crowdsourced judgement elicitation when workers strategically choose both their reports and the effort they put into their evaluations. This leads to a new family of information elicitation problems with unobservable ground truth, where an agent's proficiency -- the probability with which she correctly evaluates the underlying ground truth -- is endogenously determined by her strategic choice of how much effort to put into the task.
   Our main contribution is a simple, new, mechanism for binary information elicitation for multiple tasks when agents have endogenous proficiencies, with the following properties: (i) Exerting maximum effort followed by truthful reporting of observations is a Nash equilibrium. (ii) This is the equilibrium with maximum payoff to all agents, even when agents have different maximum proficiencies, can use mixed strategies, and can choose a different strategy for each of their tasks. Our information elicitation mechanism requires only minimal bounds on the priors, asks agents to only report their own evaluations, and does not require any conditions on a diverging number of agent reports per task to achieve its incentive properties. The main idea behind our mechanism is to use the presence of multiple tasks and ratings to estimate a reporting statistic to identify and penalize low-effort agreement -- the mechanism rewards agents for agreeing with another 'reference' report on the same task, but also penalizes for blind agreement by subtracting out this statistic term, designed so that agents obtain rewards only when they put in effort into their observations.
Timespent based models for predicting user retention BIBAFull-Text 331-342
  Kushal S. Dave; Vishal Vaingankar; Sumanth Kolar; Vasudeva Varma
Content discovery is fast becoming the preferred tool for user engagement on the web. Discovery allows users to get educated and entertained about their topics of interest. StumbleUpon is the largest personalized content discovery engine on the Web, delivering more than 1 billion personalized recommendations per month. As a recommendation system one of the primary metrics we track is whether the user returns (retention) to use the product after their initial experience (session) with StumbleUpon.
   In this paper, we attempt to address the problem of predicting user retention based on the user's previous sessions. The paper first explores the different user and content features that are helpful in predicting user retention. This involved mapping the user and the user's recommendations (stumbles) in a descriptive feature space such as the time-spent by user, number of stumbles, and content features of the recommendations. To model the diversity in user behaviour, we also generated normalized features that account for the user's speed of stumbling. Using these features, we built a decision tree classifier to predict retention. We find that a model that uses both the user and content features achieves higher prediction accuracy than a model that uses the two features separately. Further, we used information theoretical analysis to find a subset of recommendations that are most indicative of user retention. A classifier trained on this subset of recommendations achieves the highest prediction accuracy. This indicates that not every recommendation seen by the user is predictive of whether the user will be retained; instead, a subset of most informative recommendations is more useful in predicting retention.
Attributing authorship of revisioned content BIBAFull-Text 343-354
  Luca de Alfaro; Michael Shavlovsky
A considerable portion of web content, from wikis to collaboratively edited documents, to code posted online, is revisioned. We consider the problem of attributing authorship to such revisioned content, and we develop scalable attribution algorithms that can be applied to very large bodies of revisioned content, such as the English Wikipedia.
   Since content can be deleted, only to be later re-inserted, we introduce a notion of authorship that requires comparing each new revision with the entire set of past revisions. For each portion of content in the newest revision, we search the entire history for content matches that are statistically unlikely to occur spontaneously, thus denoting common origin. We use these matches to compute the earliest possible attribution of each word (or each token) of the new content. We show that this "earliest plausible attribution" can be computed efficiently via compact summaries of the past revision history. This leads to an algorithm that runs in time proportional to the sum of the size of the most recent revision, and the total amount of change (edit work) in the revision history. This amount of change is typically much smaller than the total size of all past revisions. The resulting algorithm can scale to very large repositories of revisioned content, as we show via experimental data over the English Wikipedia.
ClausIE: clause-based open information extraction BIBAFull-Text 355-366
  Luciano Del Corro; Rainer Gemulla
We propose ClausIE, a novel, clause-based approach to open information extraction, which extracts relations and their arguments from natural language text. ClausIE fundamentally differs from previous approaches in that it separates the detection of "useful" pieces of information expressed in a sentence from their representation in terms of extractions. In more detail, ClausIE exploits linguistic knowledge about the grammar of the English language to first detect clauses in an input sentence and to subsequently identify the type of each clause according to the grammatical function of its constituents. Based on this information, ClausIE is able to generate high-precision extractions; the representation of these extractions can be flexibly customized to the underlying application. ClausIE is based on dependency parsing and a small set of domain-independent lexica, operates sentence by sentence without any post-processing, and requires no training data (whether labeled or unlabeled). Our experimental study on various real-world datasets suggests that ClausIE obtains higher recall and higher precision than existing approaches, both on high-quality text as well as on noisy text as found in the web.
Pick-a-crowd: tell me what you like, and i'll tell you what to do BIBAFull-Text 367-374
  Djellel Eddine Difallah; Gianluca Demartini; Philippe Cudré-Mauroux
Crowdsourcing allows to build hybrid online platforms that combine scalable information systems with the power of human intelligence to complete tasks that are difficult to tackle for current algorithms. Examples include hybrid database systems that use the crowd to fill missing values or to sort items according to subjective dimensions such as picture attractiveness. Current approaches to Crowdsourcing adopt a pull methodology where tasks are published on specialized Web platforms where workers can pick their preferred tasks on a first-come-first-served basis. While this approach has many advantages, such as simplicity and short completion times, it does not guarantee that the task is performed by the most suitable worker. In this paper, we propose and extensively evaluate a different Crowdsourcing approach based on a push methodology. Our proposed system carefully selects which workers should perform a given task based on worker profiles extracted from social networks. Workers and tasks are automatically matched using an underlying categorization structure that exploits entities extracted from the task descriptions on one hand, and categories liked by the user on social platforms on the other hand. We experimentally evaluate our approach on tasks of varying complexity and show that our push methodology consistently yield better results than usual pull strategies.
Compact explanation of data fusion decisions BIBAFull-Text 379-390
  Xin Luna Dong; Divesh Srivastava
Despite the abundance of useful information on the Web, different Web sources often provide conflicting data, some being out-of-date, inaccurate, or erroneous. Data fusion aims at resolving conflicts and finding the truth. Advanced fusion techniques apply iterative MAP (Maximum A Posteriori) analysis that reasons about trustworthiness of sources and copying relationships between them. Providing explanations for such decisions is important for a better understanding, but can be extremely challenging because of the complexity of the analysis during decision making.
   This paper proposes two types of explanations for data-fusion results: snapshot explanations take the provided data and any other decision inferred from the data as evidence and provide a high-level understanding of a fusion decision; comprehensive explanations take only the data as evidence and provide an in-depth understanding of a fusion decision. We propose techniques that can efficiently generate correct and compact explanations. Experimental results show that (1) we generate correct explanations, (2) our techniques can significantly reduce the sizes of the explanations, and (3) we can generate the explanations efficiently.
From query to question in one click: suggesting synthetic questions to searchers BIBAFull-Text 391-402
  Gideon Dror; Yoelle Maarek; Avihai Mejer; Idan Szpektor
In Web search, users may remain unsatisfied for several reasons: the search engine may not be effective enough or the query might not reflect their intent. Years of research focused on providing the best user experience for the data available to the search engine. However, little has been done to address the cases in which relevant content for the specific user need has not been posted on the Web yet. One obvious solution is to directly ask other users to generate the missing content using Community Question Answering services such as Yahoo! Answers or Baidu Zhidao. However, formulating a full-fledged question after having issued a query requires some effort. Some previous work proposed to automatically generate natural language questions from a given query, but not for scenarios in which a searcher is presented with a list of questions to choose from. We propose here to generate synthetic questions that can actually be clicked by the searcher so as to be directly posted as questions on a Community Question Answering service. This imposes new constraints, as questions will be actually shown to searchers, who will not appreciate an awkward style or redundancy. To this end, we introduce a learning-based approach that improves not only the relevance of the suggested questions to the original query, but also their grammatical correctness. In addition, since queries are often underspecified and ambiguous, we put a special emphasis on increasing the diversity of suggestions via a novel diversification mechanism. We conducted several experiments to evaluate our approach by comparing it to prior work. The experiments show that our algorithm improves question quality by 14% over prior work and that adding diversification reduced redundancy by 55%.
Perception and understanding of social annotations in web search BIBAFull-Text 403-412
  Jennifer Fernquist; Ed H. Chi
As web search increasingly becomes reliant on social signals, it is imperative for us to understand the effect of these signals on users' behavior. There are multiple ways in which social signals can be used in search: (a) to surface and rank important social content; (b) to signal to users which results are more trustworthy and important by placing annotations on search results. We focus on the latter problem of understanding how social annotations affect user behavior.
   In previous work, through eyetracking research we learned that users do not generally seem to fixate on social annotations when they are placed at the bottom of the search result block, with 11% probability of fixation [22]. A second eyetracking study showed that placing the annotation on top of the snippet block might mitigate this issue [22], but this study was conducted using mock-ups and with expert searchers.
   In this paper, we describe a study conducted with a new eyetracking mix-method using a live traffic search engine with the suggested design changes on real users using the same experimental procedures. The study comprised of 11 subjects with an average of 18 tasks per subject using an eyetrace-assisted retrospective think-aloud protocol. Using a funnel analysis, we found that users are indeed more likely to notice the annotations with a 60% probability of fixation (if the annotation was in view). Moreover, we found no learning effects across search sessions but found significant differences in query types, with subjects having a lower chance of fixating on annotations for queries in the news category. In the interview portion of the study, users reported interesting "wow" moments as well as usefulness in recalling or re-finding content previously shared by oneself or friends. The results not only shed light on how social annotations should be designed in search engines, but also how users make use of social annotations to make decisions about which pages are useful and potentially trustworthy.
AMIE: association rule mining under incomplete evidence in ontological knowledge bases BIBAFull-Text 413-422
  Luis Antonio Galárraga; Christina Teflioudi; Katja Hose; Fabian Suchanek
Recent advances in information extraction have led to huge knowledge bases (KBs), which capture knowledge in a machine-readable format. Inductive Logic Programming (ILP) can be used to mine logical rules from the KB. These rules can help deduce and add missing knowledge to the KB. While ILP is a mature field, mining logical rules from KBs is different in two aspects: First, current rule mining systems are easily overwhelmed by the amount of data (state-of-the art systems cannot even run on today's KBs). Second, ILP usually requires counterexamples. KBs, however, implement the open world assumption (OWA), meaning that absent data cannot be used as counterexamples. In this paper, we develop a rule mining model that is explicitly tailored to support the OWA scenario. It is inspired by association rule mining and introduces a novel measure for confidence. Our extensive experiments show that our approach outperforms state-of-the-art approaches in terms of precision and coverage. Furthermore, our system, AMIE, mines rules orders of magnitude faster than state-of-the-art approaches.
PrefixSolve: efficiently solving multi-source multi-destination path queries on RDF graphs by sharing suffix computations BIBAFull-Text 423-434
  Sidan Gao; Kemafor Anyanwu
Uncovering the "nature" of the connections between a set of entities e.g. passengers on a flight and organizations on a watchlist can be viewed as a Multi-Source Multi-Destination (MSMD) Path Query problem on labeled graph data models such as RDF. Using existing graph-navigational path finding techniques to solve MSMD problems will require queries to be decomposed into multiple single-source or destination path subqueries, each of which is solved independently. Navigational techniques on disk-resident graphs typically generate very poor I/O access patterns for large, disk-resident graphs and for MSMD path queries, such poor access patterns may be repeated if common graph exploration steps exist across subqueries.
   In this paper, we propose an optimization technique for general MSMD path queries that generalizes an efficient algebraic approach for solving a variety of single-source path problems. The generalization enables holistic evaluation of MSMD path queries without the need for query decomposition. We present a conceptual framework for sharing computation in the algebraic framework that is based on "suffix equivalence". Suffix equivalence amongst subqueries captures the fact that multiple subqueries with different prefixes can share a suffix and as such share the computation of shared suffixes, which allows prefix path computations to share common suffix path computations. This approach offers orders of magnitude better performance than current existing techniques as demonstrated by a comprehensive experimental evaluation over real and synthetic datasets.
When tolerance causes weakness: the case of injection-friendly browsers BIBAFull-Text 435-446
  Yossi Gilad; Amir Herzberg
We present a practical off-path TCP-injection attack for connections between current, non-buggy browsers and web-servers. The attack allows web-cache poisoning with malicious objects; these objects can be cached for long time period, exposing any user of that cache to XSS, CSRF and phishing attacks.
   In contrast to previous TCP-injection attacks, we assume neither vulnerabilities such as client-malware nor predictable choice of client port or IP-ID. We only exploit subtle details of HTTP and TCP specifications, and features of legitimate (and common) browser implementations. An empirical evaluation of our techniques with current versions of browsers shows that connections with popular websites are vulnerable. Our attack is modular, and its modules may improve other off-path attacks on TCP communication.
   We present practical patches against the attack; however, the best defense is surely adoption of TLS, that ensures security even against the stronger Man-in-the-Middle attacker.
Exploiting innocuous activity for correlating users across sites BIBAFull-Text 447-458
  Oana Goga; Howard Lei; Sree Hari Krishnan Parthasarathi; Gerald Friedland; Robin Sommer; Renata Teixeira
We study how potential attackers can identify accounts on different social network sites that all belong to the same user, exploiting only innocuous activity that inherently comes with posted content. We examine three specific features on Yelp, Flickr, and Twitter: the geo-location attached to a user's posts, the timestamp of posts, and the user's writing style as captured by language models. We show that among these three features the location of posts is the most powerful feature to identify accounts that belong to the same user in different sites. When we combine all three features, the accuracy of identifying Twitter accounts that belong to a set of Flickr users is comparable to that of existing attacks that exploit usernames. Our attack can identify 37% more accounts than using usernames when we instead correlate Yelp and Twitter. Our results have significant privacy implications as they present a novel class of attacks that exploit users' tendency to assume that, if they maintain different personas with different names, the accounts cannot be linked together; whereas we show that the posts themselves can provide enough information to correlate the accounts.
The cost of annoying ads BIBAFull-Text 459-470
  Daniel G. Goldstein; R. Preston McAfee; Siddharth Suri
Display advertisements vary in the extent to which they annoy users. While publishers know the payment they receive to run annoying ads, little is known about the cost such ads incur due to user abandonment. We conducted a two-experiment investigation to analyze ad features that relate to annoyingness and to put a monetary value on the cost of annoying ads. The first experiment asked users to rate and comment on a large number of ads taken from the Web. This allowed us to establish sets of annoying and innocuous ads for use in the second experiment, in which users were given the opportunity to categorize emails for a per-message wage and quit at any time. Participants were randomly assigned to one of three different pay rates and also randomly assigned to categorize the emails in the presence of no ads, annoying ads, or innocuous ads. Since each email categorization constituted an impression, this design, inspired by Toomim et al., allowed us to determine how much more one must pay a person to generate the same number of impressions in the presence of annoying ads compared to no ads or innocuous ads. We conclude by proposing a theoretical model which relates ad quality to publisher market share, illustrating how our empirical findings could affect the economics of Internet advertising.
Researcher homepage classification using unlabeled data BIBAFull-Text 471-482
  Sujatha Das Gollapalli; Cornelia Caragea; Prasenjit Mitra; C. Lee Giles
A classifier that determines if a webpage is relevant to a specified set of topics comprises a key component for focused crawling. Can a classifier that is tuned to perform well on training datasets continue to filter out irrelevant pages in the face of changed content on the Web? We investigate this question in the context of researcher homepage crawling.
   We show experimentally that classifiers trained on existing datasets for homepage identification underperform while classifying "irrelevant" pages on current-day academic websites. As an alternative to obtaining datasets to retrain the classifier for the new content, we propose to use effectively unlimited amounts of unlabeled data readily available from these websites in a co-training scenario. To this end, we design novel URL-based features and use them in conjunction with content-based features as complementary views of the data to obtain remarkable improvements in accurately identifying homepages from the current-day university websites.
   In addition, we propose a novel technique for "learning a conforming pair of classifiers" using mini-batch gradient descent. Our algorithm seeks to minimize a loss (objective) function quantifying the difference in predictions from the two views afforded by co-training. We demonstrate that tuning the classifiers so that they make "similar" predictions on unlabeled data strongly corresponds to the effect achieved by co-training algorithms. We argue that this loss formulation provides insight into understanding the co-training process and can be used even in absence of a validation set.
Google+ or Google-?: dissecting the evolution of the new OSN in its first year BIBAFull-Text 483-494
  Roberto Gonzalez; Ruben Cuevas; Reza Motamedi; Reza Rejaie; Angel Cuevas
In the era when Facebook and Twitter dominate the market for social media, Google has introduced Google+ (G+) and reported a significant growth in its size while others called it a ghost town. This begs the question that "whether G+ can really attract a significant number of connected and active users despite the dominance of Facebook and Twitter?".
   This paper tackles the above question by presenting a detailed characterization of G+ based on large scale measurements. We identify the main components of G+ structure, characterize the key features of their users and their evolution over time. We then conduct detailed analysis on the evolution of connectivity and activity among users in the largest connected component (LCC) of G+ structure, and compare their characteristics with other major OSNs. We show that despite the dramatic growth in the size of G+, the relative size of LCC has been decreasing and its connectivity has become less clustered. While the aggregate user activity has gradually increased, only a very small fraction of users exhibit any type of activity. To our knowledge, our study offers the most comprehensive characterization of G+ based on the largest collected data sets.
Probabilistic group recommendation via information matching BIBAFull-Text 495-504
  Jagadeesh Gorla; Neal Lathia; Stephen Robertson; Jun Wang
Increasingly, web recommender systems face scenarios where they need to serve suggestions to groups of users; for example, when families share e-commerce or movie rental web accounts. Research to date in this domain has proposed two approaches: computing recommendations for the group by merging any members' ratings into a single profile, or computing ranked recommendations for each individual that are then merged via a range of heuristics. In doing so, none of the past approaches reason on the preferences that arise in individuals when they are members of a group. In this work, we present a probabilistic framework, based on the notion of information matching, for group recommendation. This model defines group relevance as a combination of the item's relevance to each user as an individual and as a member of the group; it can then seamlessly incorporate any group recommendation strategy in order to rank items for a set of individuals. We evaluate the model's efficacy at generating recommendations for both single individuals and groups using the MovieLens and MoviePilot data sets. In both cases, we compare our results with baselines and state-of-the-art collaborative filtering algorithms, and show that the model outperforms all others over a variety of ranking metrics.
WTF: the who to follow service at Twitter BIBAFull-Text 505-514
  Pankaj Gupta; Ashish Goel; Jimmy Lin; Aneesh Sharma; Dong Wang; Reza Zadeh
WTF ("Who to Follow") is Twitter's user recommendation service, which is responsible for creating millions of connections daily between users based on shared interests, common connections, and other related factors. This paper provides an architectural overview and shares lessons we learned in building and running the service over the past few years. Particularly noteworthy was our design decision to process the entire Twitter graph in memory on a single server, which significantly reduced architectural complexity and allowed us to develop and deploy the service in only a few months. At the core of our architecture is Cassovary, an open-source in-memory graph processing engine we built from scratch for WTF. Besides powering Twitter's user recommendations, Cassovary is also used for search, discovery, promoted products, and other services as well. We describe and evaluate a few graph recommendation algorithms implemented in Cassovary, including a novel approach based on a combination of random walks and SALSA. Looking into the future, we revisit the design of our architecture and comment on its limitations, which are presently being addressed in a second-generation system under development.
Mining expertise and interests from social media BIBAFull-Text 515-526
  Ido Guy; Uri Avraham; David Carmel; Sigalit Ur; Michal Jacovi; Inbal Ronen
The rising popularity of social media in the enterprise presents new opportunities for one of the organization's most important needs -- expertise location. Social media data can be very useful for expertise mining due to the variety of existing applications, the rich metadata, and the diversity of user associations with content. In this work, we provide an extensive study that explores the use of social media to infer expertise within a large global organization. We examine eight different social media applications by evaluating the data they produce through a large user survey, with 670 enterprise social media users. We distinguish between two semantics that relate a user to a topic: expertise in the topic and interest in it and compare these two semantics across the different social media applications.
Measuring personalization of web search BIBAFull-Text 527-538
  Aniko Hannak; Piotr Sapiezynski; Arash Molavi Kakhki; Balachander Krishnamurthy; David Lazer; Alan Mislove; Christo Wilson
Web search is an integral part of our daily lives. Recently, there has been a trend of personalization in Web search, where different users receive different results for the same search query. The increasing personalization is leading to concerns about Filter Bubble effects, where certain users are simply unable to access information that the search engines' algorithm decides is irrelevant. Despite these concerns, there has been little quantification of the extent of personalization in Web search today, or the user attributes that cause it.
   In light of this situation, we make three contributions. First, we develop a methodology for measuring personalization in Web search results. While conceptually simple, there are numerous details that our methodology must handle in order to accurately attribute differences in search results to personalization. Second, we apply our methodology to 200 users on Google Web Search; we find that, on average, 11.7% of results show differences due to personalization, but that this varies widely by search query and by result ranking. Third, we investigate the causes of personalization on Google Web Search. Surprisingly, we only find measurable personalization as a result of searching with a logged in account and the IP address of the searching user. Our results are a first step towards understanding the extent and effects of personalization on Web search engines today.
Estimating clustering coefficients and size of social networks via random walk BIBAFull-Text 539-550
  Stephen J. Hardiman; Liran Katzir
Online social networks have become a major force in today's society and economy. The largest of today's social networks may have hundreds of millions to more than a billion users. Such networks are too large to be downloaded or stored locally, even if terms of use and privacy policies were to permit doing so. This limitation complicates even simple computational tasks. One such task is computing the clustering coefficient of a network. Another task is to compute the network size (number of registered users) or a subpopulation size.
   The clustering coefficient, a classic measure of network connectivity, comes in two flavors, global and network average. In this work, we provide efficient algorithms for estimating these measures which (1) assume no prior knowledge about the network; and (2) access the network using only the publicly available interface. More precisely, this work provides three new estimation algorithms (a) the first external access algorithm for estimating the global clustering coefficient; (b) an external access algorithm that improves on the accuracy of previous network average clustering coefficient estimation algorithms; and (c) an improved external access network size estimation algorithm.
   The main insight offered by this work is that only a relatively small number of public interface calls are required to allow our algorithms to achieve a high accuracy estimation. Our approach is to view a social network as an undirected graph and use the public interface to retrieve a random walk. To estimate the clustering coefficient, the connectivity of each node in the random walk sequence is tested in turn. We show that the error of this estimation drops exponentially in the number of random walk steps. Another insight of this work is the fact that, although the proposed algorithms can be used to estimate the clustering coefficient of any undirected graph, they are particularly efficient on social network-like graphs. To improve the network size prior-art estimation algorithms, we count node collision one step before they actually occur. In our experiments we validate our algorithms on several publicly available social network datasets. Our results validate the theoretical claims and demonstrate the effectiveness of our algorithms.
Exploiting annotations for the rapid development of collaborative web applications BIBAFull-Text 551-560
  Matthias Heinrich; Franz Josef Grüneberger; Thomas Springer; Martin Gaedke
Web application frameworks are a proven means to accelerate the development of interactive web applications. However, implementing collaborative real-time applications like Google Docs requires specific concurrency control services (i.e. document synchronization and conflict resolution) that are not included in prevalent general-purpose frameworks like jQuery or Knockout. Hence, developers have to get familiar with specific collaboration frameworks (e.g. ShareJS) which substantially increases the development effort. To ease the development of collaborative web applications, we propose a set of source code annotations representing a lightweight mechanism to introduce concurrency control services into mature web frameworks. Those annotations are interpreted at runtime by a dedicated collaboration engine to sync documents and resolve conflicts. We enhanced the general-purpose framework Knockout with a collaboration engine and conducted a developer study comparing our approach to a traditional concurrency control library. The evaluation results show that the effort to incorporate collaboration capabilities into a web application can be reduced by up to 40 percent using the annotation-based solution.
Web usage mining with semantic analysis BIBAFull-Text 561-570
  Laura Hollink; Peter Mika; Roi Blanco
Web usage mining has traditionally focused on the individual queries or query words leading to a web site or web page visit, mining patterns in such data. In our work, we aim to characterize websites in terms of the semantics of the queries that lead to them by linking queries to large knowledge bases on the Web. We demonstrate how to exploit such links for more effective pattern mining on query log data. We also show how such patterns can be used to qualitatively describe the differences between competing websites in the same domain and to quantitatively predict website abandonment.
Organizational overlap on social networks and its applications BIBAFull-Text 571-582
  Cho-Jui Hsieh; Mitul Tiwari; Deepak Agarwal; Xinyi (Lisa) Huang; Sam Shah
Online social networks have become important for networking, communication, sharing, and discovery. A considerable challenge these networks face is the fact that an online social network is partially observed because two individuals might know each other, but may not have established a connection on the site. Therefore, link prediction and recommendations are important tasks for any online social network. In this paper, we address the problem of computing edge affinity between two users on a social network, based on the users belonging to organizations such as companies, schools, and online groups. We present experimental insights from social network data on organizational overlap, a novel mathematical model to compute the probability of connection between two people based on organizational overlap, and experimental validation of this model based on real social network data. We also present novel ways in which the organization overlap model can be applied to link prediction and community detection, which in itself could be useful for recommending entities to follow and generating personalized news feed.
Space-efficient data structures for Top-k completion BIBAFull-Text 583-594
  Bo-June (Paul) Hsu; Giuseppe Ottaviano
Virtually every modern search application, either desktop, web, or mobile, features some kind of query auto-completion. In its basic form, the problem consists in retrieving from a string set a small number of completions, i.e. strings beginning with a given prefix, that have the highest scores according to some static ranking. In this paper, we focus on the case where the string set is so large that compression is needed to fit the data structure in memory. This is a compelling case for web search engines and social networks, where it is necessary to index hundreds of millions of distinct queries to guarantee a reasonable coverage; and for mobile devices, where the amount of memory is limited.
   We present three different trie-based data structures to address this problem, each one with different space/time/complexity trade-offs. Experiments on large-scale datasets show that it is possible to compress the string sets, including the scores, down to spaces competitive with the gzip'ed data, while supporting efficient retrieval of completions at about a microsecond per completion.
Personalized recommendation via cross-domain triadic factorization BIBAFull-Text 595-606
  Liang Hu; Jian Cao; Guandong Xu; Longbing Cao; Zhiping Gu; Can Zhu
Collaborative filtering (CF) is a major technique in recommender systems to help users find their potentially desired items. Since the data sparsity problem is quite commonly encountered in real-world scenarios, Cross-Domain Collaborative Filtering (CDCF) hence is becoming an emerging research topic in recent years. However, due to the lack of sufficient dense explicit feedbacks and even no feedback available in users' uninvolved domains, current CDCF approaches may not perform satisfactorily in user preference prediction. In this paper, we propose a generalized Cross Domain Triadic Factorization (CDTF) model over the triadic relation user-item-domain, which can better capture the interactions between domain-specific user factors and item factors. In particular, we devise two CDTF algorithms to leverage user explicit and implicit feedbacks respectively, along with a genetic algorithm based weight parameters tuning algorithm to trade off influence among domains optimally. Finally, we conduct experiments to evaluate our models and compare with other state-of-the-art models by using two real world datasets. The results show the superiority of our models against other comparative models.
Unsupervised sentiment analysis with emotional signals BIBAFull-Text 607-618
  Xia Hu; Jiliang Tang; Huiji Gao; Huan Liu
The explosion of social media services presents a great opportunity to understand the sentiment of the public via analyzing its large-scale and opinion-rich data. In social media, it is easy to amass vast quantities of unlabeled data, but very costly to obtain sentiment labels, which makes unsupervised sentiment analysis essential for various applications. It is challenging for traditional lexicon-based unsupervised methods due to the fact that expressions in social media are unstructured, informal, and fast-evolving. Emoticons and product ratings are examples of emotional signals that are associated with sentiments expressed in posts or words. Inspired by the wide availability of emotional signals in social media, we propose to study the problem of unsupervised sentiment analysis with emotional signals. In particular, we investigate whether the signals can potentially help sentiment analysis by providing a unified way to model two main categories of emotional signals, i.e., emotion indication and emotion correlation. We further incorporate the signals into an unsupervised learning framework for sentiment analysis. In the experiment, we compare the proposed framework with the state-of-the-art methods on two Twitter datasets and empirically evaluate our proposed framework to gain a deep understanding of the effects of emotional signals.
An analysis of socware cascades in online social networks BIBAFull-Text 619-630
  Ting-Kai Huang; Md Sazzadur Rahman; Harsha V. Madhyastha; Michalis Faloutsos
Online social networks (OSNs) have become a popular new vector for distributing malware and spam, which we refer to as socware. Unlike email spam, which is sent by spammers directly to intended victims, socware cascades through OSNs as compromised users spread it to their friends. In this paper, we analyze data from the walls of roughly 3 million Facebook users over five months, with the goal of developing a better understanding of socware cascades.
   We study socware cascades to understand: (a) their spatio-temporal properties, (b) the underlying motivations and mechanisms, and (c) the social engineering tricks used to con users. First, we identify an evolving trend in which cascades appear to be throttling their rate of growth to evade detection, and thus, lasting longer. Second, our forensic investigation into the infrastructure that supports these cascades shows that, surprisingly, Facebook seems to be inadvertently enabling most cascades; 44% of cascades are disseminated via Facebook applications. At the same time, we observe large groups of synergistic Facebook apps (more than 144 groups of size 5 or more) that collaborate to support multiple cascades. Lastly, we find that hackers rely on two social engineering tricks in equal measure?luring users with free products and appealing to users' social curiosity?to enable socware cascades. Our findings present several promising avenues towards reducing socware on Facebook, but also highlight associated challenges.
Measurement and analysis of child pornography trafficking on P2P networks BIBAFull-Text 631-642
  Ryan Hurley; Swagatika Prusty; Hamed Soroush; Robert J. Walls; Jeannie Albrecht; Emmanuel Cecchet; Brian Neil Levine; Marc Liberatore; Brian Lynn; Janis Wolak
Peer-to-peer networks are the most popular mechanism for the criminal acquisition and distribution of child pornography (CP). In this paper, we examine observations of peers sharing known CP on the eMule and Gnutella networks, which were collected by law enforcement using forensic tools that we developed. We characterize a year's worth of network activity and evaluate different strategies for prioritizing investigators' limited resources. The highest impact research in criminal forensics works within, and is evaluated under, the constraints and goals of investigations. We follow that principle, rather than presenting a set of isolated, exploratory characterizations of users.
   First, we focus on strategies for reducing the number of CP files available on the network by removing a minimal number of peers. We present a metric for peer removal that is more effective than simply selecting peers with the largest libraries or the most days online. Second, we characterize six aggressive peer subgroups, including: peers using Tor, peers that bridge multiple p2p networks, and the top 10% of peers contributing to file availability. We find that these subgroups are more active in their trafficking, having more known CP and more uptime, than the average peer. Finally, while in theory Tor presents a challenge to investigators, we observe that in practice offenders use Tor inconsistently. Over 90% of regular Tor users send traffic from a non-Tor IP at least once after first using Tor.
HeteroMF: recommendation in heterogeneous information networks using context dependent factor models BIBAFull-Text 643-654
  Mohsen Jamali; Laks Lakshmanan
With the growing amount of information available online, recommender systems are starting to provide a viable alternative and complement to search engines, in helping users to find objects of interest. Methods based on Matrix Factorization (MF) models are the state-of-the-art in recommender systems. The input to MF is user feedback, in the form of a rating matrix. However, users can be engaged in interactions with multiple types of entities across different contexts, leading to multiple rating matrices. In other words, users can have interactions in a heterogeneous information network. Generally, in a heterogeneous network, entities from any two entity types can have interactions with a weight (rating) indicating the level of endorsement. Collective Matrix Factorization (CMF) has been proposed to address the recommendation problem in heterogeneous networks. However, a main issue with CMF is that entities share the same latent factor across different contexts. This is particularly problematic in two cases: Latent factors for entities that are cold-start in a context will be learnt mainly based on the data from other contexts where these entities are not cold-start, and therefore the factors are not properly learned for the cold-start context. Also, if a context has more data compared to another context, then the dominant context will dominate the learning process for the latent factors for entities shared in these two contexts. In this paper, we propose a context-dependent matrix factorization model, HeteroMF, that considers a general latent factor for entities of every entity type and context-dependent latent factors for every context in which the entities are involved. We learn a general latent factor for every entity and transfer matrices for every context to convert the general latent factors into a context-dependent latent factor. Experiments on two real life datasets from Epinions and Flixster demonstrate that HeteroMF substantially outperforms CMF, particularly for cold-start entities and for contexts where interactions in one contexts are dominated by other contexts.
Interactive exploratory search for multi page search results BIBAFull-Text 655-666
  Xiaoran Jin; Marc Sloan; Jun Wang
Modern information retrieval interfaces typically involve multiple pages of search results, and users who are recall minded or engaging in exploratory search using ad hoc queries are likely to access more than one page. Document rankings for such queries can be improved by allowing additional context to the query to be provided by the user herself using explicit ratings or implicit actions such as clickthroughs. Existing methods using this information usually involved detrimental UI changes that can lower user satisfaction. Instead, we propose a new feedback scheme that makes use of existing UIs and does not alter user's browsing behaviour; to maximise retrieval performance over multiple result pages, we propose a novel retrieval optimisation framework and show that the optimal ranking policy should choose a diverse, exploratory ranking to display on the first page. Then, a personalised re-ranking of the next pages can be generated based on the user's feedback from the first page. We show that document correlations used in result diversification have a significant impact on relevance feedback and its effectiveness over a search session. TREC evaluations demonstrate that our optimal rank strategy (including approximative Monte Carlo Sampling) can naturally optimise the trade-off between exploration and exploitation and maximise the overall user's satisfaction over time against a number of similar baselines.
Spatio-temporal dynamics of online memes: a study of geo-tagged tweets BIBAFull-Text 667-678
  Krishna Y. Kamath; James Caverlee; Kyumin Lee; Zhiyuan Cheng
We conduct a study of the spatio-temporal dynamics of Twitter hashtags through a sample of 2 billion geo-tagged tweets. In our analysis, we (i) examine the impact of location, time, and distance on the adoption of hashtags, which is important for understanding meme diffusion and information propagation; (ii) examine the spatial propagation of hashtags through their focus, entropy, and spread; and (iii) present two methods that leverage the spatio-temporal propagation of hashtags to characterize locations. Based on this study, we find that although hashtags are a global phenomenon, the physical distance between locations is a strong constraint on the adoption of hashtags, both in terms of the hashtags shared between locations and in the timing of when these hashtags are adopted. We find both spatial and temporal locality as most hashtags spread over small geographical areas but at high speeds. We also find that hashtags are mostly a local phenomenon with long-tailed life spans. These (and other) findings have important implications for a variety of systems and applications, including targeted advertising, location-based services, social media search, and content delivery networks.
Accountable key infrastructure (AKI): a proposal for a public-key validation infrastructure BIBAFull-Text 679-690
  Tiffany Hyun-Jin Kim; Lin-Shung Huang; Adrian Perring; Collin Jackson; Virgil Gligor
Recent trends in public-key infrastructure research explore the tradeoff between decreased trust in Certificate Authorities (CAs), resilience against attacks, communication overhead (bandwidth and latency) for setting up an SSL/TLS connection, and availability with respect to verifiability of public key information. In this paper, we propose AKI as a new public-key validation infrastructure, to reduce the level of trust in CAs. AKI integrates an architecture for key revocation of all entities (e.g., CAs, domains) with an architecture for accountability of all infrastructure parties through checks-and-balances. AKI efficiently handles common certification operations, and gracefully handles catastrophic events such as domain key loss or compromise. We propose AKI to make progress towards a public-key validation infrastructure with key revocation that reduces trust in any single entity.
DIGTOBI: a recommendation system for Digg articles using probabilistic modeling BIBAFull-Text 691-702
  Younghoon Kim; Yoonjae Park; Kyuseok Shim
Digg is a social news website that lets people submit articles to share their favorite web pages (e.g. blog postings or news articles) and vote the articles posted by others. Digg service currently lists the articles in the front page by popularity without considering each user's preference to the topics in the articles. Helping users to find the most interesting Digg articles tailored to each user's own interests will be very useful, but it is not an easy task to classify the articles according to their topics in order to recommend the articles differently to each user.
   In this paper, we propose DIGTOBI, a personalized recommendation system for Digg articles using a novel probabilistic modeling. Our model considers the relevant articles with low Digg scores important as well. We show that our model can handle both warm-start and cold-start scenarios seamlessly through a single model. We next propose an EM algorithm to learn the parameters of our probabilistic model. Our performance study with Digg data confirms the effectiveness of DIGTOBI compared to the traditional recommendations algorithms.
Understanding latency variations of black box services BIBAFull-Text 703-714
  Darja Krushevskaja; Mark Sandler
Data centers run many services that impact millions of users daily. In reality, the latency of each service varies from one request to another. Existing tools allow to monitor services for performance glitches or service disruptions, but typically they do not help understanding the variations in latency.
   We propose a general framework for understanding performance of arbitrary black box services. We consider a stream of requests to a given service with their monitored attributes, as well as latencies of serving each request. We propose what we call the multi-dimensional f-measure, that helps for a given interval to identify the subset of monitored attributes that explains it. We design algorithms that use this measure not only for a fixed latency interval, but also to explain the entire range of latencies of the service by segmenting it into smaller intervals.
   We perform a detailed experimental study with synthetic data, as well as real data from a large search engine. Our experiments show that our methods automatically identify significant latency intervals together with request attributes that explain them, and are robust.
Diversified recommendation on graphs: pitfalls, measures, and algorithms BIBAFull-Text 715-726
  Onur Küçüktunç; Erik Saule; Kamer Kaya; Ümit V. Çatalyürek
Result diversification has gained a lot of attention as a way to answer ambiguous queries and to tackle the redundancy problem in the results. In the last decade, diversification has been applied on or integrated into the process of PageRank- or eigenvector-based methods that run on various graphs, including social networks, collaboration networks in academia, web and product co-purchasing graphs. For these applications, the diversification problem is usually addressed as a bicriteria objective optimization problem of relevance and diversity. However, such an approach is questionable since a query-oblivious diversification algorithm that recommends most of its results without even considering the query may perform the best on these commonly used measures. In this paper, we show the deficiencies of popular evaluation techniques of diversification methods, and investigate multiple relevance and diversity measures to understand whether they have any correlations. Next, we propose a novel measure called expanded relevance which combines both relevance and diversity into a single function in order to measure the coverage of the relevant part of the graph. We also present a new greedy diversification algorithm called BestCoverage, which optimizes the expanded relevance of the result set with (1-1/e)-approximation. With a rigorous experimentation on graphs from various applications, we show that the proposed method is efficient and effective for many use cases.
What is the added value of negative links in online social networks? BIBAFull-Text 727-736
  Jérôme Kunegis; Julia Preusse; Felix Schwagereit
We investigate the "negative link" feature of social networks that allows users to tag other users as foes or as distrusted in addition to the usual friend and trusted links. To answer the question whether negative links have an added value for an online social network, we investigate the machine learning problem of predicting the negative links of such a network using only the positive links as a basis, with the idea that if this problem can be solved with high accuracy, then the "negative link" feature is redundant. In doing so, we also present a general methodology for assessing the added value of any new link type in online social networks. Our evaluation is performed on two social networks that allow negative links: The technology news website Slashdot and the product review site Epinions. In experiments with these two datasets, we come to the conclusion that a combination of centrality-based and proximity-based link prediction functions can be used to predict the negative edges in the networks we analyse. We explain this result by an application of the models of preferential attachment and balance theory to our learning problem, and show that the "negative link" feature has a small but measurable added value for these social networks.
Voices of victory: a computational focus group framework for tracking opinion shift in real time BIBAFull-Text 737-748
  Yu-Ru Lin; Drew Margolin; Brian Keegan; David Lazer
Social media have been employed to assess public opinions on events, markets, and policies. Most current work focuses on either developing aggregated measures or opinion extraction methods like sentiment analysis. These approaches suffer from unpredictable turnover in the participants and the information they react to, making it difficult to distinguish meaningful shifts from those that follow from known information. We propose a novel approach to tame these sources of uncertainty through the introduction of "computational focus groups" to track opinion shifts in social media streams. Our approach uses prior user behaviors to detect users' biases, then groups users with similar biases together. We track the behavior streams from these like-minded sub-groups and present time-dependent collective measures of their opinions. These measures control for the response rate and base attitudes of the users, making shifts in opinion both easier to detect and easier to interpret. We test the effectiveness of our system by tracking groups' Twitter responses to a common stimulus set: the 2012 U.S. presidential election debates. While our groups' behavior is consistent with their biases, there are numerous moments and topics on which they behave "out of character," suggesting precise targets for follow-up inquiry. We also demonstrate that tracking elite users with well-established biases does not yield such insights, as they are insensitive to the stimulus and simply reproduce expected patterns. The effectiveness of our system suggests a new direction both for researchers and data-driven journalists interested in identifying opinion shifting processes in real-time.
Rethinking the web as a personal archive BIBAFull-Text 749-760
  Siân E. Lindley; Catherine C. Marshall; Richard Banks; Abigail Sellen; Tim Regan
In recent years the Web has evolved substantially, transforming from a place where we primarily find information to a place where we also leave, share and keep it. This presents a fresh set of challenges for the management of personal information, which include how to underpin greater awareness and more control over digital belongings and other personally meaningful content that is hosted online. In the study reported here, we follow up on research that suggests a sense of ownership and control can be reinforced by federating online content as a virtual, single store; we do this by conducting interviews with 14 individuals about their Web-based content. Participants were asked to give the researchers a tour of online content that is personally meaningful to them; to perform a search for themselves in order to uncover additional content; and to respond to a series of design envisionments. We examine whether there is any value in an integrated personal archive that would automatically update and serve firstly, as a source of information regarding the content within it (e.g. where it is stored, who has the rights to it), and secondly, as a resource for crafting personal artefacts such as scrapbooks, CVs and gifts for others. Our analysis leads us to reject the concept of a single archive. Instead, we present a framework of five different types of online content, each of which has separate implications for personal information management.
Expressive languages for selecting groups from graph-structured data BIBAFull-Text 761-770
  Vitaliy Liptchinsky; Benjamin Satzger; Rostyslav Zabolotnyi; Schahram Dustdar
Many query languages for graph-structured data are based on regular path expressions, which describe relations among pairs of nodes. We propose an extension that allows to retrieve groups of nodes based on group structural characteristics and relations to other nodes or groups. It allows to express group selection queries in a concise and natural style, and can be integrated into any query language based on regular path queries. We present an efficient algorithm for evaluating group queries in polynomial time from an input data graph. Evaluations using real-world social networks demonstrate the practical feasibility of our approach.
Modeling/predicting the evolution trend of osn-based applications BIBAFull-Text 771-780
  Han Liu; Atif Nazir; Jinoo Joung; Chen-Nee Chuah
While various models have been proposed for generating social/friendship network graphs, the dynamics of user interactions through online social network (OSN) based applications remain largely unexplored. We previously developed a growth model to capture static weekly snapshots of user activity graphs (UAGs) using data from popular Facebook gifting applications. This paper presents a new continuous graph evolution model aimed to capture microscopic user-level behaviors that govern the growth of the UAG and collectively define the overall graph structure. We demonstrate the utility of our model by applying it to forecast the number of active users over time as the application transitions from initial growth to peak/mature and decline/fatigue phase. Using empirical evaluations, we show that our model can accurately reproduce the evolution trend of active user population for gifting applications, or other OSN applications that employ similar growth mechanisms. We also demonstrate that the predictions from our model can guide the generation of synthetic graphs that accurately represent empirical UAG snapshots sampled at different evolution stages.
SoCo: a social network aided context-aware recommender system BIBAFull-Text 781-802
  Xin Liu; Karl Aberer
Contexts and social network information have been proven to be valuable information for building accurate recommender system. However, to the best of our knowledge, no existing works systematically combine diverse types of such information to further improve recommendation quality. In this paper, we propose SoCo, a novel context-aware recommender system incorporating elaborately processed social network information. We handle contextual information by applying random decision trees to partition the original user-item-rating matrix such that the ratings with similar contexts are grouped. Matrix factorization is then employed to predict missing preference of a user for an item using the partitioned matrix. In order to incorporate social network information, we introduce an additional social regularization term to the matrix factorization objective function to infer a user's preference for an item by learning opinions from his/her friends who are expected to share similar tastes. A context-aware version of Pearson Correlation Coefficient is proposed to measure user similarity. Real datasets based experiments show that SoCo improves the performance (in terms of root mean square error) of the state-of-the-art context-aware recommender system and social recommendation model by 15.7% and 12.2% respectively.
Using stranger as sensors: temporal and geo-sensitive question answering via social media BIBAFull-Text 803-814
  Yefeng Liu; Todorka Alexandrova; Tatsuo Nakajima
MoboQ is a location-based real-time social question answering service deployed in the field in China. Using MoboQ, people can ask temporal and geo-sensitive questions, such as how long is the line at a popular business right now, and then receive answers that crowdsourced from other users in a timely fashion. To obtain answers for questions, the system analyzes the live stream from public microblogging service Sina Weibo to identify people who are likely to currently be at the place that is associated with a question and sends them the unsolicited question through the microblogging service from which they were identified. MoboQ was deployed in China at the beginning of 2012, until October of the same year, it was used to ask 15,224 questions by 35,214 registered users, and it gathered 29,491 answers; 74.6% of the questions received at least one answer, 28% received a first response within 10 minutes, and 51% of the questions got first answer within 20 minutes. In total, 91% of the questions successfully found at least one answer candidate, and they were sent to 162,954 microblogging service users. We analyze the usage patterns and behaviors of the real-world end-users, discuss the lessons learned, and outline the future directions and possible applications that could be built on top of MoboQ.
Imagen: runtime migration of browser sessions for javascript web applications BIBAFull-Text 815-826
  James Teng Kin Lo; Eric Wohlstadter; Ali Mesbah
Due to the increasing complexity of web applications and emerging HTML5 standards, a large amount of runtime state is created and managed in the user's browser. While such complexity is desirable for user experience, it makes it hard for developers to implement mechanisms that provide users ubiquitous access to the data they create during application use. This paper presents our research into browser session migration for JavaScript-based web applications. Session migration is the act of transferring a session between browsers at runtime. Without burden to developers, our system allows users to create a snapshot image that captures all runtime state needed to resume the session elsewhere. Our system works completely in the JavaScript layer and thus snapshots can be transfered between different browser vendors and hardware devices. We report on performance metrics of the system using five applications, four different browsers, and three different devices.
Gender swapping and user behaviors in online social games BIBAFull-Text 827-836
  Jing-Kai Lou; Kunwoo Park; Meeyoung Cha; Juyong Park; Chin-Laung Lei; Kuan-Ta Chen
Modern Massively Multiplayer Online Role-Playing Games (MMORPGs) provide lifelike virtual environments in which players can conduct a variety of activities including combat, trade, and chat with other players. While the game world and the available actions therein are inspired by their offline counterparts, the games' popularity and dedicated fan base are testaments to the allure of novel social interactions granted to people by allowing them an alternative life as a new character and persona. In this paper we investigate the phenomenon of "gender swapping," which refers to players choosing avatars of genders opposite to their natural ones. We report the behavioral patterns observed in players of Fairyland Online, a globally serviced MMORPG, during social interactions when playing as in-game avatars of their own real gender or gender-swapped. We also discuss the effect of gender role and self-image in virtual social situations and the potential of our study for improving MMORPG quality and detecting online identity frauds.
Mining structural hole spanners through information diffusion in social networks BIBAFull-Text 837-847
  Tiancheng Lou; Jie Tang
The theory of structural holes suggests that individuals would benefit from filling the "holes" (called as structural hole spanners) between people or groups that are otherwise disconnected. A few empirical studies have verified that structural hole spanners play a key role in the information diffusion. However, there is still lack of a principled methodology to detect structural hole spanners from a given social network.
   In this work, we precisely define the problem of mining top-k structural hole spanners in large-scale social networks and provide an objective (quality) function to formalize the problem. Two instantiation models have been developed to implement the objective function. For the first model, we present an exact algorithm to solve it and prove its convergence. As for the second model, the optimization is proved to be NP-hard, and we design an efficient algorithm with provable approximation guarantees.
   We test the proposed models on three different networks: Coauthor, Twitter, and Inventor. Our study provides evidence for the theory of structural holes, e.g., 1% of Twitter users who span structural holes control 25% of the information diffusion on Twitter. We compare the proposed models with several alternative methods and the results show that our models clearly outperform the comparison methods. Our experiments also demonstrate that the detected structural hole spanners can help other social network applications, such as community kernel detection and link prediction. To the best of our knowledge, this is the first attempt to address the problem of mining structural hole spanners in large social networks.
On the evolution of the internet economic ecosystem BIBAFull-Text 849-860
  Richard T. B. Ma; John C. S. Lui; Vishal Misra
The evolution of the Internet has manifested itself in many ways: the traffic characteristics, the interconnection topologies and the business relationships among the autonomous components. It is important to understand why (and how) this evolution came about, and how the interplay of these dynamics may affect future evolution and services. We propose a network aware, macroscopic model that captures the characteristics and interactions of the application and network providers, and show how it leads to a market equilibrium of the ecosystem. By analyzing the driving forces and the dynamics of the market equilibrium, we obtain some fundamental understandings of the cause and effect of the Internet evolution, which explain why some historical and recent evolutions have happened. Furthermore, by projecting the likely future evolutions, our model can help application and network providers to make informed business decisions so as to succeed in this competitive ecosystem.
Two years of short URLs internet measurement: security threats and countermeasures BIBAFull-Text 861-872
  Federico Maggi; Alessandro Frossi; Stefano Zanero; Gianluca Stringhini; Brett Stone-Gross; Christopher Kruegel; Giovanni Vigna
URL shortening services have become extremely popular. However, it is still unclear whether they are an effective and reliable tool that can be leveraged to hide malicious URLs, and to what extent these abuses can impact the end users. With these questions in mind, we first analyzed existing countermeasures adopted by popular shortening services. Surprisingly, we found such countermeasures to be ineffective and trivial to bypass. This first measurement motivated us to proceed further with a large-scale collection of the HTTP interactions that originate when web users access live pages that contain short URLs. To this end, we monitored 622 distinct URL shortening services between March 2010 and April 2012, and collected 24,953,881 distinct short URLs. With this large dataset, we studied the abuse of short URLs. Despite short URLs are a significant, new security risk, in accordance with the reports resulting from the observation of the overall phishing and spamming activity, we found that only a relatively small fraction of users ever encountered malicious short URLs. Interestingly, during the second year of measurement, we noticed an increased percentage of short URLs being abused for drive-by download campaigns and a decreased percentage of short URLs being abused for spam campaigns. In addition to these security-related findings, our unique monitoring infrastructure and large dataset allowed us to complement previous research on short URLs and analyze these web services from the user's perspective.
Know your personalization: learning topic level personalization in online services BIBAFull-Text 873-884
  Anirban Majumder; Nisheeth Shrivastava
Online service platforms (OSPs), such as search engines, news-websites, ad-providers, etc., serve highly personalized content to the user, based on the profile extracted from her history with the OSP. In this paper, we capture OSP's personalization for an user in a new data structure called the personalization vector (?), which is a weighted vector over a set of topics, and present efficient algorithms to learn it.
   Our approach treats OSPs as black-boxes, and extracts η by mining only their output, specifically, the personalized (for an user) and vanilla (without any user information) contents served, and the differences in these content. We believe that such treatment of OSPs is a unique aspect of our work, not just enabling access to (so far hidden) profiles in OSPs, but also providing a novel and practical approach for retrieving information from OSPs by mining differences in their outputs.
   We formulate a new model called Latent Topic Personalization (LTP) that captures the personalization vector in a learning framework and present efficient inference algorithms for determining it. We perform extensive experiments targeting search engine personalization, using data from both real Google users and synthetic setup. Our results indicate that LTP achieves high accuracy (R-pre = 84%) in discovering personalized topics. For Google data, our qualitative results demonstrate that the topics determined by LTP for a user correspond well to his ad-categories determined by Google.
Saving, reusing, and remixing web video: using attitudes and practices to reveal social norms BIBAFull-Text 885-896
  Catherine C. Marshall; Frank M. Shipman
The growth of online videos has spurred a concomitant increase in the storage, reuse, and remix of this content. As we gain more experience with video content, social norms about ownership have evolved accordingly, spelling out what people think is appropriate use of content that is not necessarily their own. We use a series of three studies, each centering on a different genre of recordings, to probe 634 participants' attitudes toward video storage, reuse, and remix; we also question participants about their own experiences with online video. The results allow us to characterize current practice and emerging social norms and to establish the relationship between the two. Hypotheticals borrowed from legal research are used as the primary vehicle for testing attitudes, and for identifying boundaries between socially acceptable and unacceptable behavior.
From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews BIBAFull-Text 897-908
  Julian John McAuley; Jure Leskovec
Recommending products to consumers means not only understanding their tastes, but also understanding their level of experience. For example, it would be a mistake to recommend the iconic film Seven Samurai simply because a user enjoys other action movies; rather, we might conclude that they will eventually enjoy it -- once they are ready. The same is true for beers, wines, gourmet foods -- or any products where users have acquired tastes: the 'best' products may not be the most 'accessible'. Thus our goal in this paper is to recommend products that a user will enjoy now, while acknowledging that their tastes may have changed over time, and may change again in the future. We model how tastes change due to the very act of consuming more products -- in other words, as users become more experienced. We develop a latent factor recommendation system that explicitly accounts for each user's level of experience. We find that such a model not only leads to better recommendations, but also allows us to study the role of user experience and expertise on a novel dataset of fifteen million beer, wine, food, and movie reviews.
The FLDA model for aspect-based opinion mining: addressing the cold start problem BIBAFull-Text 909-918
  Samaneh Moghaddam; Martin Ester
Aspect-based opinion mining from online reviews has attracted a lot of attention recently. The main goal of all of the proposed methods is extracting aspects and/or estimating aspect ratings. Recent works, which are often based on Latent Dirichlet Allocation (LDA), consider both tasks simultaneously. These models are normally trained at the item level, i.e., a model is learned for each item separately. Learning a model per item is fine when the item has been reviewed extensively and has enough training data. However, in real-life data sets such as those from Epinions.com and Amazon.com more than 90% of items have less than 10 reviews, so-called cold start items. State-of-the-art LDA models for aspect-based opinion mining are trained at the item level and therefore perform poorly for cold start items due to the lack of sufficient training data. In this paper, we propose a probabilistic graphical model based on LDA, called Factorized LDA (FLDA), to address the cold start problem. The underlying assumption of FLDA is that aspects and ratings of a review are influenced not only by the item but also by the reviewer. It further assumes that both items and reviewers can be modeled by a set of latent factors which represent their aspect and rating distributions. Different from state-of-the-art LDA models, FLDA is trained at the category level and learns the latent factors using the reviews of all the items of a category, in particular the non cold start items, and uses them as prior for cold start items. Our experiments on three real-life data sets demonstrate the improved effectiveness of the FLDA model in terms of likelihood of the held-out test set. We also evaluate the accuracy of FLDA based on two application-oriented measures.
Iolaus: securing online content rating systems BIBAFull-Text 919-930
  Arash Molavi Kakhki; Chloe Kliman-Silver; Alan Mislove
Online content ratings services allow users to find and share content ranging from news articles (Digg) to videos (YouTube) to businesses (Yelp). Generally, these sites allow users to create accounts, declare friendships, upload and rate content, and locate new content by leveraging the aggregated ratings of others. These services are becoming increasingly popular; Yelp alone has over 33 million reviews. Unfortunately, this popularity is leading to increasing levels of malicious activity, including multiple identity (Sybil) attacks and the "buying" of ratings from users.
   In this paper, we present Iolaus, a system that leverages the underlying social network of online content rating systems to defend against such attacks. Iolaus uses two novel techniques: (a) weighing ratings to defend against multiple identity attacks and (b) relative ratings to mitigate the effect of "bought" ratings. An evaluation of Iolaus using microbenchmarks, synthetic data, and real-world content rating data demonstrates that Iolaus is able to outperform existing approaches and serve as a practical defense against multiple-identity and rating-buying attacks.
On cognition, emotion, and interaction aspects of search tasks with different search intentions BIBAFull-Text 931-942
  Yashar Moshfeghi; Joemon M. Jose
The complex and dynamic nature of search processes surrounding information seeking have been exhaustively studied. Recent studies have highlighted search processes with different intentions, such as those for entertainment purposes or re-finding a visited information object, are fundamentally different in nature to typical information seeking intentions. Despite the popularity of such search processes on the Web, they have not yet been thoroughly explored. Using a video retrieval system as a use case, we study the characteristics of four different search task types: seeking information, re-finding a particular information object, and two different entertainment intentions (i.e. entertainment by adjusting arousal level, and entertainment by adjusting mood). In particular, we looked at the cognition, emotion and action aspects of these search tasks at different phases of a search process. This follows the common assumption in the information seeking and retrieval community that a complex search process can be broken down into a relatively small number of activity phases. Our experimental results show significant differences in the characteristics of studied search tasks. Furthermore, we investigate whether we can predict these search tasks given user's interaction with the system. Results show that we can learn a model that predicts the search task types with reasonable accuracy. Overall, these findings may help to steer search engines to better satisfy searchers' needs beyond typically assumed information seeking processes.
Ad impression forecasting for sponsored search BIBAFull-Text 943-952
  Abhirup Nath; Shibnath Mukherjee; Prateek Jain; Navin Goyal; Srivatsan Laxman
A typical problem for a search engine (hosting sponsored search service) is to provide the advertisers with a forecast of the number of impressions his/her ad is likely to obtain for a given bid. Accurate forecasts have high business value, since they enable advertisers to select bids that lead to better returns on their investment. They also play an important role in services such as automatic campaign optimization. Despite its importance the problem has remained relatively unexplored in literature. Existing methods typically overfit to the training data, leading to inconsistent performance. Furthermore, some of the existing methods cannot provide predictions for new ads, i.e., for ads that are not present in the logs. In this paper, we develop a generative model based approach that addresses these drawbacks. We design a Bayes net to capture inter-dependencies between the query traffic features and the competitors in an auction. Furthermore, we account for variability in the volume of query traffic by using a dynamic linear model. Finally, we implement our approach on a production grade MapReduce framework and conduct extensive large scale experiments on substantial volumes of sponsored search data from Bing. Our experimental results demonstrate significant advantages over existing methods as measured using several accuracy/error criteria, improved ability to provide estimates for new ads and more consistent performance with smaller variance in accuracies. Our method can also be adapted to several other related forecasting problems such as predicting average position of ads or the number of clicks under budget constraints.
Measurement and modeling of eye-mouse behavior in the presence of nonlinear page layouts BIBAFull-Text 953-964
  Vidhya Navalpakkam; LaDawn Jentzsch; Rory Sayres; Sujith Ravi; Amr Ahmed; Alex Smola
As search pages are becoming increasingly complex, with images and nonlinear page layouts, understanding how users examine the page is important. We present a lab study on the effect of a rich informational panel to the right of the search result column, on eye and mouse behavior. Using eye and mouse data, we show that the flow of user attention on nonlinear page layouts is different from the widely believed top-down linear examination order of search results. We further demonstrate that the mouse, like the eye, is sensitive to two key attributes of page elements -- their position (layout), and their relevance to the user's task. We identify mouse measures that are strongly correlated with eye movements, and develop models to predict user attention (eye gaze) from mouse activity. These findings show that mouse tracking can be used to infer user attention and information flow patterns on search pages. Potential applications include ranking, search page optimization, and UI evaluation.
Understanding and decreasing the network footprint of catch-up tv BIBAFull-Text 965-976
  Gianfranco Nencioni; Nishanth Sastry; Jigna Chandaria; Jon Crowcroft
"Catch-up", or on-demand access of previously broadcast TV content over the public Internet, constitutes a significant fraction of peak time network traffic. This paper analyses consumption patterns of nearly 6 million users of a nationwide deployment of a catch-up TV service, to understand the network support required. We find that catch-up has certain natural scaling properties compared to traditional TV: The on-demand nature spreads load over time, and users have much higher completion rates for content streams than previously reported. Users exhibit strong preferences for serialised content, and for specific genres.
   Exploiting this, we design a Speculative Content Offloading and Recording Engine (SCORE) that predictively records a personalised set of shows on user-local storage, and thereby offloads traffic that might result from subsequent catch-up access. Evaluations show that even with a modest storage of 32GB, an oracle with complete knowledge of user consumption can save up to 74% of the energy, and 97% of the peak bandwidth compared to the current IP streaming-based architecture. In the best case, optimising for energy consumption, SCORE can recover more than 60% of the traffic and energy savings achieved by the oracle. Optimising purely for traffic rather than energy can reduce bandwidth by an additional 5%.
Sorry, i don't speak SPARQL: translating SPARQL queries into natural language BIBAFull-Text 977-988
  Axel-Cyrille Ngonga Ngomo; Lorenz Bühmann; Christina Unger; Jens Lehmann; Daniel Gerber
Over the past years, Semantic Web and Linked Data technologies have reached the backend of a considerable number of applications. Consequently, large amounts of RDF data are constantly being made available across the planet. While experts can easily gather information from this wealth of data by using the W3C standard query language SPARQL, most lay users lack the expertise necessary to proficiently interact with these applications. Consequently, non-expert users usually have to rely on forms, query builders, question answering or keyword search tools to access RDF data. However, these tools have so far been unable to explicate the queries they generate to lay users, making it difficult for these users to i) assess the correctness of the query generated out of their input, and ii) to adapt their queries or iii) to choose in an informed manner between possible interpretations of their input. This paper addresses this drawback by presenting SPARQL2NL, a generic approach that allows verbalizing SPARQL queries, i.e., converting them into natural language. Our framework can be integrated into applications where lay users are required to understand SPARQL or to generate SPARQL queries in a direct (forms, query builders) or an indirect (keyword search, question answering) manner. We evaluate our approach on the DBpedia question set provided by QALD-2 within a survey setting with both SPARQL experts and lay users. The results of the 115 filled surveys show that SPARQL2NL can generate complete and easily understandable natural language descriptions. In addition, our results suggest that even SPARQL experts can process the natural language representation of SPARQL queries computed by our approach more efficiently than the corresponding SPARQL queries. Moreover, non-experts are enabled to reliably understand the content of SPARQL queries.
Bitsquatting: exploiting bit-flips for fun, or profit? BIBAFull-Text 989-998
  Nick Nikiforakis; Steven Van Acker; Wannes Meert; Lieven Desmet; Frank Piessens; Wouter Joosen
Over the last fifteen years, several types of attacks against domain names and the companies relying on them have been observed. The well-known cybersquatting of domain names gave way to typosquatting, the abuse of a user's mistakes when typing a URL in her browser's address bar. Recently, a new attack against domain names surfaced, namely bitsquatting. In bitsquatting, an attacker leverages random bit-errors occurring in the memory of commodity computers and smartphones, to redirect Internet traffic to attacker-controlled domains.
   In this paper, we report on a large-scale experiment, measuring the adoption of bitsquatting by the domain-squatting community through the tracking of registrations of bitsquatting domains targeting popular web sites over a 9-month period. We show how new bitsquatting domains are registered daily and how attackers are trying to monetize their domains through the use of ads, abuse of affiliate programs and even malware installations. Lastly, given the discovered prevalence of bitsquatting, we review possible defense measures that companies, software developers and Internet Service Providers can use to protect against it.
One-class collaborative filtering with random graphs BIBAFull-Text 999-1008
  Ulrich Paquet; Noam Koenigstein
The bane of one-class collaborative filtering is interpreting and modelling the latent signal from the missing class. In this paper we present a novel Bayesian generative model for implicit collaborative filtering. It forms a core component of the Xbox Live architecture, and unlike previous approaches, delineates the odds of a user disliking an item from simply being unaware of it. The latent signal is treated as an unobserved random graph connecting users with items they might have encountered. We demonstrate how large-scale distributed learning can be achieved through a combination of stochastic gradient descent and mean field variational inference over random graph samples. A fine-grained comparison is done against a state of the art baseline on real world data.
Latent credibility analysis BIBAFull-Text 1009-1020
  Jeff Pasternack; Dan Roth
A frequent problem when dealing with data gathered from multiple sources on the web (ranging from booksellers to Wikipedia pages to stock analyst predictions) is that these sources disagree, and we must decide which of their (often mutually exclusive) claims we should accept. Current state-of-the-art information credibility algorithms known as "fact-finders" are transitive voting systems with rules specifying how votes iteratively flow from sources to claims and then back to sources. While this is quite tractable and often effective, fact-finders also suffer from substantial limitations; in particular, a lack of transparency obfuscates their credibility decisions and makes them difficult to adapt and analyze: knowing the mechanics of how votes are calculated does not readily tell us what those votes mean, and finding, for example, that a source has a score of 6 is not informative. We introduce a new approach to information credibility, Latent Credibility Analysis (LCA), constructing strongly principled, probabilistic models where the truth of each claim is a latent variable and the credibility of a source is captured by a set of model parameters. This gives LCA models clear semantics and modularity that make extending them to capture additional observed and latent credibility factors straightforward. Experiments over four real-world datasets demonstrate that LCA models can outperform the best fact-finders in both unsupervised and semi-supervised settings.
Predicting group stability in online social networks BIBAFull-Text 1021-1030
  Akshay Patil; Juan Liu; Jie Gao
Social groups often exhibit a high degree of dynamism. Some groups thrive, while many others die over time. Modeling group stability dynamics and understanding whether/when a group will remain stable or shrink over time can be important in a number of social domains. In this paper, we study two different types of social networks as exemplar platforms for modeling and predicting group stability dynamics. We build models to predict if a group is going to remain stable or is likely to shrink over a period of time. We observe that both the level of member diversity and social activities are critical in maintaining the stability of groups. We also find that certain 'prolific' members play a more important role in maintaining the group stability. Our study shows that group stability can be predicted with high accuracy, and feature diversity is critical to prediction performance.
Predictive web automation assistant for people with vision impairments BIBAFull-Text 1031-1040
  Yury Puzis; Yevgen Borodin; Rami Puzis; I. V. Ramakrishnan
The Web is far less usable and accessible for people with vision impairments than it is for sighted people. Web automation, a process of automating browsing actions on behalf of the user, has the potential to bridge the divide between the ways sighted and people with vision impairment access the Web; specifically, it can enable the latter to breeze through web browsing tasks that beforehand were slow, hard, or even impossible to accomplish. Typical web automation requires that the user record a macro, a sequence of browsing steps, so that these steps can be automated in the future by replaying the macro. However, for people with vision impairment, automation with macros is not usable.
   In this paper, we propose a novel model-based approach that facilitates web automation without having to either record or replay macros. Using the past browsing history and the current web page as the browsing context, the proposed model can predict the most probable browsing actions that the user can do. The model construction is "unsupervised". More importantly, the model is continuously and incrementally updated as history evolves, thereby, ensuring the predictions are not "outdated".
   We also describe a novel interface that lets the user focus on the objects associated with the most probable predicted browsing steps (e.g., clicking links and filling out forms), and facilitates automatic execution of the selected steps. A study with 19 blind participants showed that the proposed approach dramatically reduced the interaction time needed to accomplish typical browsing tasks, and the user interface was perceived to be much more usable than the standard screen-reading interfaces.
Mining collective intelligence in diverse groups BIBAFull-Text 1041-1052
  Guo-Jun Qi; Charu C. Aggarwal; Jiawei Han; Thomas Huang
Collective intelligence, which aggregates the shared information from large crowds, is often negatively impacted by unreliable information sources with the low quality data. This becomes a barrier to the effective use of collective intelligence in a variety of applications. In order to address this issue, we propose a probabilistic model to jointly assess the reliability of sources and find the true data. We observe that different sources are often not independent of each other. Instead, sources are prone to be mutually influenced, which makes them dependent when sharing information with each other. High dependency between sources makes collective intelligence vulnerable to the overuse of redundant (and possibly incorrect) information from the dependent sources. Thus, we reveal the latent group structure among dependent sources, and aggregate the information at the group level rather than from individual sources directly. This can prevent the collective intelligence from being inappropriately dominated by dependent sources. We will also explicitly reveal the reliability of groups, and minimize the negative impacts of unreliable groups. Experimental results on real-world data sets show the effectiveness of the proposed approach with respect to existing algorithms.
Trade area analysis using user generated mobile location data BIBAFull-Text 1053-1064
  Yan Qu; Jun Zhang
In this paper, we illustrate how User Generated Mobile Location Data (UGMLD) like Foursquare check-ins can be used in Trade Area Analysis (TAA) by introducing a new framework and corresponding analytic methods. Three key processes were created: identifying the activity center of a mobile user, profiling users based on their location history, and modeling users' preference probability. Extensions to traditional TAA are introduced, including customer-centric distance decay analysis and check-in sequence analysis. Adopting the rich content and context of UGMLD, these methods introduce new dimensions to modeling and delineating trade areas. Analyzing customers' visits to a business in the context of their daily life sheds new light on the nature and performance of the venue. This work has important business implications in the field of mobile computing.
Psychological maps 2.0: a web engagement enterprise starting in London BIBAFull-Text 1065-1076
  Daniele Quercia; Joao Paulo Pesce; Virgilio Almeida; Jon Crowcroft
Planners and social psychologists have suggested that the recognizability of the urban environment is linked to people's socio-economic well-being. We build a web game that puts the recognizability of London's streets to the test. It follows as closely as possible one experiment done by Stanley Milgram in 1972. The game picks up random locations from Google Street View and tests users to see if they can judge the location in terms of closest subway station, borough, or region. Each participant dedicates only few minutes to the task (as opposed to 90 minutes in Milgram's). We collect data from 2,255 participants (one order of magnitude a larger sample) and build a recognizability map of London based on their responses. We find that some boroughs have little cognitive representation; that recognizability of an area is explained partly by its exposure to Flickr and Foursquare users and mostly by its exposure to subway passengers; and that areas with low recognizability do not fare any worse on the economic indicators of income, education, and employment, but they do significantly suffer from social problems of housing deprivation, poor living conditions, and crime. These results could not have been produced without analyzing life off- and online: that is, without considering the interactions between urban places in the physical world and their virtual presence on platforms such as Flickr and Foursquare. This line of work is at the crossroad of two emerging themes in computing research -- a crossroad where "web science" meets the "smart city" agenda.
Towards realistic team formation in social networks based on densest subgraphs BIBAFull-Text 1077-1088
  Syama Sundar Rangapuram; Thomas Bühler; Matthias Hein
Given a task T, a set of experts V with multiple skills and a social network G(V, W) reflecting the compatibility among the experts, team formation is the problem of identifying a team C ⊆ V that is both competent in performing the task T and compatible in working together. Existing methods for this problem make too restrictive assumptions and thus cannot model practical scenarios. The goal of this paper is to consider the team formation problem in a realistic setting and present a novel formulation based on densest subgraphs. Our formulation allows modeling of many natural requirements such as (i) inclusion of a designated team leader and/or a group of given experts, (ii) restriction of the size or more generally cost of the team (iii) enforcing locality of the team, e.g., in a geographical sense or social sense, etc. The proposed formulation leads to a generalized version of the classical densest subgraph problem with cardinality constraints (DSP), which is an NP hard problem and has many applications in social network analysis. In this paper, we present a new method for (approximately) solving the generalized DSP (GDSP). Our method, FORTE, is based on solving an equivalent continuous relaxation of GDSP. The solution found by our method has a quality guarantee and always satisfies the constraints of GDSP. Experiments show that the proposed formulation (GDSP) is useful in modeling a broader range of team formation problems and that our method produces more coherent and compact teams of high quality. We also show, with the help of an LP relaxation of GDSP, that our method gives close to optimal solutions to GDSP.
Efficient community detection in large networks using content and links BIBAFull-Text 1089-1098
  Yiye Ruan; David Fuhry; Srinivasan Parthasarathy
In this paper we discuss a very simple approach of combining content and link information in graph structures for the purpose of community discovery, a fundamental task in network analysis. Our approach hinges on the basic intuition that many networks contain noise in the link structure and that content information can help strengthen the community signal. This enables ones to eliminate the impact of noise (false positives and false negatives), which is particularly prevalent in online social networks and Web-scale information networks.
   Specifically we introduce a measure of signal strength between two nodes in the network by fusing their link strength with content similarity. Link strength is estimated based on whether the link is likely (with high probability) to reside within a community. Content similarity is estimated through cosine similarity or Jaccard coefficient.
   We discuss a simple mechanism for fusing content and link similarity. We then present a biased edge sampling procedure which retains edges that are locally relevant for each graph node. The resulting backbone graph can be clustered using standard community discovery algorithms such as Metis and Markov clustering.
   Through extensive experiments on multiple real-world datasets (Flickr, Wikipedia and CiteSeer) with varying sizes and characteristics, we demonstrate the effectiveness and efficiency of our methods over state-of-the-art learning and mining approaches several of which also attempt to combine link and content analysis for the purposes of community discovery. Specifically we always find a qualitative benefit when combining content with link analysis. Additionally our biased graph sampling approach realizes a quantitative benefit in that it is typically several orders of magnitude faster than competing approaches.
Learning joint query interpretation and response ranking BIBAFull-Text 1099-1110
  Uma Sawant; Soumen Chakrabarti
Thanks to information extraction and semantic Web efforts, search on unstructured text is increasingly refined using semantic annotations and structured knowledge bases. However, most users cannot become familiar with the schema of knowledge bases and ask structured queries. Interpreting free-format queries into a more structured representation is of much current interest. The dominant paradigm is to segment or partition query tokens by purpose (references to types, entities, attribute names, attribute values, relations) and then launch the interpreted query on structured knowledge bases. Given that structured knowledge extraction is never complete, here we choose a less trodden path: a data representation that retains the unstructured text corpus, along with structured annotations (mentions of entities and relationships) on it. We propose two new, natural formulations for joint query interpretation and response ranking that exploit bidirectional flow of information between the knowledge base and the corpus. One, inspired by probabilistic language models, computes expected response scores over the uncertainties of query interpretation. The other is based on max-margin discriminative learning, with latent variables representing those uncertainties. In the context of typed entity search, both formulations bridge a considerable part of the accuracy gap between a generic query that does not constrain the type at all, and the upper bound where the "perfect" target entity type of each query is provided by humans. Our formulations are also superior to a two-stage approach of first choosing a target type using recent query type prediction techniques, and then launching a type-restricted entity search query.
A model for green design of online news media services BIBAFull-Text 1111-1122
  Daniel Schien; Paul Shabajee; Stephen G. Wood; Chris Preist
The use of information and communication technology and the web-based products it provides is responsible for significant emissions of greenhouse gases. In order to enable the reduction of emissions during the design of such products, it is necessary to estimate as accurately as possible their carbon impact over the entire product system. In this work we describe a new method which combines models of energy consumption during the use of digital media with models of the behavior of the audience. We apply this method to conduct an assessment of the annual carbon emissions for the product suite of a major international news organization. We then demonstrate its use for green design by evaluating the impacts of five different interventions on the product suite. We find that carbon footprint of the online newspaper amounts to approximately 7700 tCO2e per year, of which 75% are caused by the user devices. Among the evaluated scenarios a significant uptake of eReaders in favor of PCs has the greatest reduction potential. Our results also show that even a significant reduction of data volume on a web page would only result in small overall energy savings.
Potential networks, contagious communities, and understanding social network structure BIBAFull-Text 1123-1132
  Grant Schoenebeck
In this paper we study how the network of agents adopting a particular technology relates to the structure of the underlying network over which the technology adoption spreads. We develop a model and show that the network of agents adopting a particular technology may have characteristics that differ significantly from the social network of agents over which the technology spreads. For example, the network induced by a cascade may have a heavy-tailed degree distribution even if the original network does not.
   This provides evidence that online social networks created by technology adoption over an underlying social network may look fundamentally different from social networks and indicates that using data from many online social networks may mislead us if we try to use it to directly infer the structure of social networks. Our results provide an alternate explanation for certain properties repeatedly observed in data sets, for example: heavy-tailed degree distribution, network densification, shrinking diameter, and network community profile. These properties could be caused by a sort of sampling bias rather than by attributes of the underlying social structure. By generating networks using cascades over traditional network models that do not themselves contain these properties, we can nevertheless reliably produce networks that contain all these properties.
   An opportunity for interesting future research is developing new methods that correctly infer underlying network structure from data about a network that is generated via a cascade spread over the underlying network.
Do social explanations work?: studying and modeling the effects of social explanations in recommender systems BIBAFull-Text 1133-1144
  Amit Sharma; Dan Cosley
Recommender systems associated with social networks often use social explanations (e.g. "X, Y and 2 friends like this") to support the recommendations. We present a study of the effects of these social explanations in a music recommendation context. We start with an experiment with 237 users, in which we show explanations with varying levels of social information and analyze their effect on users' decisions. We distinguish between two key decisions: the likelihood of checking out the recommended artist, and the actual rating of the artist based on listening to several songs. We find that while the explanations do have some influence on the likelihood, there is little correlation between the likelihood and actual (listening) rating for the same artist. Based on these insights, we present a generative probabilistic model that explains the interplay between explanations and background information on music preferences, and how that leads to a final likelihood rating for an artist. Acknowledging the impact of explanations, we discuss a general recommendation framework that models external informational elements in the recommendation interface, in addition to inherent preferences of users.
Question answering on interlinked data BIBAFull-Text 1145-1156
  Saeedeh Shekarpour; Axel-Cyrille Ngonga Ngomo; Sören Auer
The Data Web contains a wealth of knowledge on a large number of domains. Question answering over interlinked data sources is challenging due to two inherent characteristics. First, different datasets employ heterogeneous schemas and each one may only contain a part of the answer for a certain question. Second, constructing a federated formal query across different datasets requires exploiting links between the different datasets on both the schema and instance levels. We present a question answering system, which transforms user supplied queries (i.e. natural language sentences or keywords) into conjunctive SPARQL queries over a set of interlinked data sources. The contribution of this paper is two-fold: Firstly, we introduce a novel approach for determining the most suitable resources for a user-supplied query from different datasets (disambiguation). We employ a hidden Markov model, whose parameters were bootstrapped with different distribution functions. Secondly, we present a novel method for constructing a federated formal queries using the disambiguated resources and leveraging the linking structure of the underlying datasets. This approach essentially relies on a combination of domain and range inference as well as a link traversal method for constructing a connected graph which ultimately renders a corresponding SPARQL query. The results of our evaluation with three life-science datasets and 25 benchmark queries demonstrate the effectiveness of our approach.
Pricing mechanisms for crowdsourcing markets BIBAFull-Text 1157-1166
  Yaron Singer; Manas Mittal
Every day millions of crowdsourcing tasks are performed in exchange for payments. Despite the important role pricing plays in crowdsourcing campaigns and the complexity of the market, most platforms do not provide requesters appropriate tools for effective pricing and allocation of tasks.
   In this paper, we introduce a framework for designing mechanisms with provable guarantees in crowdsourcing markets. The framework enables automating the process of pricing and allocation of tasks for requesters in complex markets like Amazon's Mechanical Turk where workers arrive in an online fashion and requesters face budget constraints and task completion deadlines. We present constant-competitive incentive compatible mechanisms for maximizing the number of tasks under a budget, and for minimizing payments given a fixed number of tasks to complete. To demonstrate the effectiveness of this framework we created a platform that enables applying pricing mechanisms in markets like Mechanical Turk. The platform allows us to show that the mechanisms we present here work well in practice, as well as to give experimental evidence to workers' strategic behavior in absence of appropriate incentive schemes.
Truthful incentives in crowdsourcing tasks using regret minimization mechanisms BIBAFull-Text 1167-1178
  Adish Singla; Andreas Krause
What price should be offered to a worker for a task in an online labor market? How can one enable workers to express the amount they desire to receive for the task completion? Designing optimal pricing policies and determining the right monetary incentives is central to maximizing requester's utility and workers' profits. Yet, current crowdsourcing platforms only offer a limited capability to the requester in designing the pricing policies and often rules of thumb are used to price tasks. This limitation could result in inefficient use of the requester's budget or workers becoming disinterested in the task.
   In this paper, we address these questions and present mechanisms using the approach of regret minimization in online learning. We exploit a link between procurement auctions and multi-armed bandits to design mechanisms that are budget feasible, achieve near-optimal utility for the requester, are incentive compatible (truthful) for workers and make minimal assumptions about the distribution of workers' true costs. Our main contribution is a novel, no-regret posted price mechanism, BP-UCB, for budgeted procurement in stochastic online settings. We prove strong theoretical guarantees about our mechanism, and extensively evaluate it in simulations as well as on real data from the Mechanical Turk platform. Compared to the state of the art, our approach leads to a 180% increase in utility.
A predictive model for advertiser value-per-click in sponsored search BIBAFull-Text 1179-1190
  Eric Sodomka; Sébastien Lahaie; Dustin Hillard
Sponsored search is a form of online advertising where advertisers bid for placement next to search engine results for specific keywords. As search engines compete for the growing share of online ad spend, it becomes important for them to understand what keywords advertisers value most, and what characteristics of keywords drive value. In this paper we propose an approach to keyword value prediction that draws on advertiser bidding behavior across the terms and campaigns in an account. We provide original insights into the structure of sponsored search accounts that motivate the use of a hierarchical modeling strategy. We propose an economically meaningful loss function which allows us to implicitly fit a linear model for values given observables such as bids and click-through rates. The model draws on demographic and textual features of keywords and takes advantage of the hierarchical structure of sponsored search accounts. Its predictive quality is evaluated on several high-revenue and high-exposure advertising accounts on a major search engine. Besides the general evaluation of advertiser welfare, our approach has potential applications to keyword and bid suggestion.
I know the shortened URLs you clicked on Twitter: inference attack using public click analytics and Twitter metadata BIBAFull-Text 1191-1200
  Jonghyuk Song; Sangho Lee; Jong Kim
Twitter is a popular social network service for sharing messages among friends. Because Twitter restricts the length of messages, many Twitter users use URL shortening services, such as bit.ly and goo.gl, to share long URLs with friends. Some URL shortening services also provide click analytics of the shortened URLs, including the number of clicks, countries, platforms, browsers and referrers. To protect visitors' privacy, they do not reveal identifying information about individual visitors. In this paper, we propose a practical attack technique that can infer who clicks what shortened URLs on Twitter. Unlike the conventional browser history stealing attacks, our attack methods only need publicly available information provided by URL shortening services and Twitter. Evaluation results show that our attack technique can compromise Twitter users' privacy with high accuracy.
Exploring and exploiting user search behavior on mobile and tablet devices to improve search relevance BIBAFull-Text 1201-1212
  Yang Song; Hao Ma; Hongning Wang; Kuansan Wang
In this paper, we present a log-based study on user search behavior comparisons on three different platforms: desktop, mobile and tablet. We use three-month search logs in 2012 from a commercial search engine for our study. Our objective is to better understand how and to what extent mobile and tablet searchers behave differently than desktop users. Our study spans a variety of aspects including query categorization, query length, search time distribution, search location distribution, user click patterns and so on. From our data set, we reveal that there are significant differences between user search patterns in these three platforms, and therefore use the same ranking system is not an optimal solution for all of them. Consequently, we propose a framework that leverages a set of domain-specific features, along with the training data from desktop search, to further improve the search relevance for mobile and tablet platforms. Experimental results demonstrate that by transferring knowledge from desktop search, search relevance on mobile and tablet can be greatly improved.
Evaluating and predicting user engagement change with degraded search relevance BIBAFull-Text 1213-1224
  Yang Song; Xiaolin Shi; Xin Fu
User engagement in search refers to the frequency for users (re-)using the search engine to accomplish their tasks. Among factors that affected users' visit frequency, relevance of search results is believed to play a pivotal role. While multiple work in the past has demonstrated the correlation between search success and user engagement based on longitudinal analysis, we examine this problem from a different perspective in this work. Specifically, we carefully designed a large-scale controlled experiment on users of a large commercial Web search engine, in which users were separated into control and treatment groups, where users in treatment group were presented with search results which are deliberate degraded in relevance. We studied users' responses to the relevance degradation through tracking several behavioral metrics (such as query per user, click per session) over an extended period of time both during and following the experiment. By quantifying the relationship between user engagement and search relevance, we observe significant differences between user's short-term search behavior and long-term engagement change. By leveraging some of the key findings from the experiment, we developed a machine learning model to predict the long term impact of relevance degradation on user engagement. Overall, our model achieves over 67% of accuracy in predicting user engagement drop. Besides, our model is also capable of predicting engagement change for low-frequency users with very few user signals. We believe that insights from this study can be leveraged by search engine companies to detect and intervene search relevance degradation and to prevent long term user engagement drop.
Data-Fu: a language and an interpreter for interaction with read/write linked data BIBAFull-Text 1225-1236
  Steffen Stadtmüller; Sebastian Speiser; Andreas Harth; Rudi Studer
An increasing amount of applications build their functionality on the utilisation and manipulation of web resources. Consequently REST gains popularity with a resource-centric interaction architecture that draws its flexibility from links between resources. Linked Data offers a uniform data model for REST with self-descriptive resources that can be leveraged to avoid a manual ad-hoc development of web-based applications. For declaratively specifying interactions between web resources we introduce Data-Fu, a lightweight declarative rule language with state transition systems as formal grounding. Data-Fu enables the development of data-driven applications that facilitate the RESTful manipulation of read/write Linked Data resources. Furthermore, we describe an interpreter for Data-Fu as a general purpose engine that allows to perform described interactions with web resources by orders of magnitude faster than a comparable Linked Data processor.
NIFTY: a system for large scale information flow tracking and clustering BIBAFull-Text 1237-1248
  Caroline Suen; Sandy Huang; Chantat Eksombatchai; Rok Sosic; Jure Leskovec
The real-time information on news sites, blogs and social networking sites changes dynamically and spreads rapidly through the Web. Developing methods for handling such information at a massive scale requires that we think about how information content varies over time, how it is transmitted, and how it mutates as it spreads.
   We describe the News Information Flow Tracking, Yay! (NIFTY) system for large scale real-time tracking of "memes" -- short textual phrases that travel and mutate through the Web. NIFTY is based on a novel highly-scalable incremental meme-clustering algorithm that efficiently extracts and identifies mutational variants of a single meme. NIFTY runs orders of magnitude faster than our previous Memetracker system, while also maintaining better consistency and quality of extracted memes.
   We demonstrate the effectiveness of our approach by processing a 20 terabyte dataset of 6.1 billion blog posts and news articles that we have been continuously collecting for the last four years. NIFTY extracted 2.9 billion unique textual phrases and identified more than 9 million memes. Our meme-tracking algorithm was able to process the entire dataset in less than five days using a single machine. Furthermore, we also provide a live deployment of the NIFTY system that allows users to explore the dynamics of online news in near real-time.
When relevance is not enough: promoting diversity and freshnessin personalized question recommendation BIBAFull-Text 1249-1260
  Idan Szpektor; Yoelle Maarek; Dan Pelleg
What makes a good question recommendation system for community question-answering sites? First, to maintain the health of the ecosystem, it needs to be designed around answerers, rather than exclusively for askers. Next, it needs to scale to many questions and users, and be fast enough to route a newly-posted question to potential answerers within the few minutes before the asker's patience runs out. It also needs to show each answerer questions that are relevant to his or her interests. We have designed and built such a system for Yahoo! Answers, but realized, when testing it with live users, that it was not enough.
   We found that those drawing-board requirements fail to capture user's interests. The feature that they really missed was diversity. In other words, showing them just the main topics they had previously expressed interest in was simply too dull. Adding the spice of topics slightly outside the core of their past activities significantly improved engagement. We conducted a large-scale online experiment in production in Yahoo! Answers that showed that recommendations driven by relevance alone perform worse than a control group without question recommendations, which is the current behavior. However, an algorithm promoting both diversity and freshness improved the number of answers by 17%, daily session length by 10%, and had a significant positive impact on peripheral activities such as voting.
Mining acronym expansions and their meanings using query click log BIBAFull-Text 1261-1272
  Bilyana Taneva; Tao Cheng; Kaushik Chakrabarti; Yeye He
Acronyms are abbreviations formed from the initial components of words or phrases. Acronym usage is becoming more common in web searches, email, text messages, tweets, blogs and posts. Acronyms are typically ambiguous and often disambiguated by context words. Given either just an acronym as a query or an acronym with a few context words, it is immensely useful for a search engine to know the most likely intended meanings, ranked by their likelihood. To support such online scenarios, we study the offline mining of acronyms and their meanings in this paper. For each acronym, our goal is to discover all distinct meanings and for each meaning, compute the expanded string, its popularity score and a set of context words that indicate this meaning. Existing approaches are inadequate for this purpose. Our main insight is to leverage "co-clicks" in search engine query click log to mine expansions of acronyms. There are several technical challenges such as ensuring 1:1 mapping between expansions and meanings, handling of "tail meanings" and extracting context words. We present a novel, end-to-end solution that addresses the above challenges. We further describe how web search engines can leverage the mined information for prediction of intended meaning for queries containing acronyms. Our experiments show that our approach (i) discovers the meanings of acronyms with high precision and recall, (ii) significantly complements existing meanings in Wikipedia and (iii) accurately predicts intended meaning for online queries with over 90% precision.
Groundhog day: near-duplicate detection on Twitter BIBAFull-Text 1273-1284
  Ke Tao; Fabian Abel; Claudia Hauff; Geert-Jan Houben; Ujwal Gadiraju
With more than 340~million messages that are posted on Twitter every day, the amount of duplicate content as well as the demand for appropriate duplicate detection mechanisms is increasing tremendously. Yet there exists little research that aims at detecting near-duplicate content on microblogging platforms. We investigate the problem of near-duplicate detection on Twitter and introduce a framework that analyzes the tweets by comparing (i) syntactical characteristics, (ii) semantic similarity, and (iii) contextual information. Our framework provides different duplicate detection strategies that, among others, make use of external Web resources which are referenced from microposts. Machine learning is exploited in order to learn patterns that help identifying duplicate content. We put our duplicate detection framework into practice by integrating it into Twinder, a search engine for Twitter streams. An in-depth analysis shows that it allows Twinder to diversify search results and improve the quality of Twitter search. We conduct extensive experiments in which we (1) evaluate the quality of different strategies for detecting duplicates, (2) analyze the impact of various features on duplicate detection, (3) investigate the quality of strategies that classify to what exact level two microposts can be considered as duplicates and (4) optimize the process of identifying duplicate content on Twitter. Our results prove that semantic features which are extracted by our framework can boost the performance of detecting duplicates.
Uncovering locally characterizing regions within geotagged data BIBAFull-Text 1285-1296
  Bart Thomee; Adam Rae
We propose a novel algorithm for uncovering the colloquial boundaries of locally characterizing regions present in collections of labeled geospatial data. We address the problem by first modeling the data using scale-space theory, allowing us to represent it simultaneously across different scales as a family of increasingly smoothed density distributions. We then derive region boundaries by applying localized label weighting and image processing techniques to the scale-space representation of each label. Important insights into the data can be acquired by visualizing the shape and size of the resulting boundaries for each label at multiple scales. We demonstrate our technique operating at scale by discovering the boundaries of the most geospatially salient tags associated with a large collection of georeferenced photos from Flickr and compare our characterizing regions that emerge from the data with those produced by a recent technique from the research literature.
Spectral analysis of communication networks using Dirichlet eigenvalues BIBAFull-Text 1297-1306
  Alexander Tsiatas; Iraj Saniee; Onuttom Narayan; Matthew Andrews
Good clustering can provide critical insight into potential locations where congestion in a network may occur. A natural measure of congestion for a collection of nodes in a graph is its Cheeger ratio, defined as the ratio of the size of its boundary to its volume. Spectral methods provide effective means to estimate the smallest Cheeger ratio via the spectral gap of the graph Laplacian. Here, we compute the spectral gap of the truncated graph Laplacian, with the so-called Dirichlet boundary condition, for the graphs of a dozen communication networks at the IP-layer, which are subgraphs of the much larger global IP-layer network. We show that i) the Dirichlet spectral gap of these networks is substantially larger than the standard spectral gap and is therefore a better indicator of the true expansion properties of the graph, ii) unlike the standard spectral gap, the Dirichlet spectral gaps of progressively larger subgraphs converge to that of the global network, thus allowing properties of the global network to be efficiently obtained from them, and (iii) the (first two) eigenvectors of the Dirichlet graph Laplacian can be used for spectral clustering with arguably better results than standard spectral clustering. We first demonstrate these results analytically for finite regular trees. We then perform spectral clustering on the IP-layer networks using Dirichlet eigenvectors and show that it yields cuts near the network core, thus creating genuine single-component clusters. This is much better than traditional spectral clustering where several disjoint fragments near the network periphery are liable to be misleadingly classified as a single cluster. Since congestion in communication networks is known to peak at the core due to large-scale curvature and geometry, identification of core congestion and its localization are important steps in analysis and improved engineering of networks. Thus, spectral clustering with Dirichlet boundary condition is seen to be more effective at finding bona-fide bottlenecks and congestion than standard spectral clustering.
Subgraph frequencies: mapping the empirical and extremal geography of large graph collections BIBAFull-Text 1307-1318
  Johan Ugander; Lars Backstrom; Jon Kleinberg
A growing set of on-line applications are generating data that can be viewed as very large collections of small, dense social graphs -- these range from sets of social groups, events, or collaboration projects to the vast collection of graph neighborhoods in large social networks. A natural question is how to usefully define a domain-independent 'coordinate system' for such a collection of graphs, so that the set of possible structures can be compactly represented and understood within a common space. In this work, we draw on the theory of graph homomorphisms to formulate and analyze such a representation, based on computing the frequencies of small induced subgraphs within each graph. We find that the space of subgraph frequencies is governed both by its combinatorial properties -- based on extremal results that constrain all graphs -- as well as by its empirical properties -- manifested in the way that real social graphs appear to lie near a simple one-dimensional curve through this space.
   We develop flexible frameworks for studying each of these aspects. For capturing empirical properties, we characterize a simple stochastic generative model, a single-parameter extension of Erdos-Renyi random graphs, whose stationary distribution over subgraphs closely tracks the one-dimensional concentration of the real social graph families. For the extremal properties, we develop a tractable linear program for bounding the feasible space of subgraph frequencies by harnessing a toolkit of known extremal graph theory. Together, these two complementary frameworks shed light on a fundamental question pertaining to social graphs: what properties of social graphs are 'social' properties and what properties are 'graph' properties?
   We conclude with a brief demonstration of how the coordinate system we examine can also be used to perform classification tasks, distinguishing between structures arising from different types of social graphs.
The self-feeding process: a unifying model for communication dynamics in the web BIBAFull-Text 1319-1330
  Pedro Olmo S. Vaz de Melo; Christos Faloutsos; Renato Assunção; Antonio Loureiro
How often do individuals perform a given communication activity in the Web, such as posting comments on blogs or news? Could we have a generative model to create communication events with realistic inter-event time distributions (IEDs)? Which properties should we strive to match? Current literature has seemingly contradictory results for IED: some studies claim good fits with power laws; others with non-homogeneous Poisson processes. Given these two approaches, we ask: which is the correct one? Can we reconcile them all? We show here that, surprisingly, both approaches are correct, being corner cases of the proposed Self-Feeding Process (SFP). We show that the SFP (a) exhibits a unifying power, which generates power law tails (including the so-called "top-concavity" that real data exhibits), as well as short-term Poisson behavior; (b) avoids the "i.i.d. fallacy", which none of the prevailing models have studied before; and (c) is extremely parsimonious, requiring usually only one, and in general, at most two parameters. Experiments conducted on eight large, diverse real datasets (e.g., Youtube and blog comments, e-mails, SMSs, etc) reveal that the SFP mimics their properties very well.
Whom to mention: expand the diffusion of tweets by @ recommendation on micro-blogging systems BIBAFull-Text 1331-1340
  Beidou Wang; Can Wang; Jiajun Bu; Chun Chen; Wei Vivian Zhang; Deng Cai; Xiaofei He
Nowadays, micro-blogging systems like Twitter have become one of the most important ways for information sharing. In Twitter, a user posts a message (tweet) and the others can forward the message (retweet). Mention is a new feature in micro-blogging systems. By mentioning users in a tweet, they will receive notifications and their possible retweets may help to initiate large cascade diffusion of the tweet. To enhance a tweet's diffusion by finding the right persons to mention, we propose in this paper a novel recommendation scheme named as whom-to-mention. Specifically, we present an in-depth study of mention mechanism and propose a recommendation scheme to solve the essential question of whom to mention in a tweet. In this paper, whom-to-mention is formulated as a ranking problem and we try to address several new challenges which are not well studied in the traditional information retrieval tasks. By adopting features including user interest match, content-dependent user relationship and user influence, a machine learned ranking function is trained based on newly defined information diffusion based relevance. The extensive evaluation using data gathered from real users demonstrates the advantage of our proposed algorithm compared with the traditional recommendation methods.
Wisdom in the social crowd: an analysis of quora BIBAFull-Text 1341-1352
  Gang Wang; Konark Gill; Manish Mohanlal; Haitao Zheng; Ben Y. Zhao
Efforts such as Wikipedia have shown the ability of user communities to collect, organize and curate information on the Internet. Recently, a number of question and answer (Q&A) sites have successfully built large growing knowledge repositories, each driven by a wide range of questions and answers from its users community. While sites like Yahoo Answers have stalled and begun to shrink, one site still going strong is Quora, a rapidly growing service that augments a regular Q&A system with social links between users. Despite its success, however, little is known about what drives Quora's growth, and how it continues to connect visitors and experts to the right questions as it grows.
   In this paper, we present results of a detailed analysis of Quora using measurements. We shed light on the impact of three different connection networks (or graphs) inside Quora, a graph connecting topics to users, a social graph connecting users, and a graph connecting related questions. Our results show that heterogeneity in the user and question graphs are significant contributors to the quality of Quora's knowledge base. One drives the attention and activity of users, and the other directs them to a small set of popular and interesting questions.
Learning to extract cross-session search tasks BIBAFull-Text 1353-1364
  Hongning Wang; Yang Song; Ming-Wei Chang; Xiaodong He; Ryen W. White; Wei Chu
Search tasks, comprising a series of search queries serving the same information need, have recently been recognized as an accurate atomic unit for modeling user search intent. Most prior research in this area has focused on short-term search tasks within a single search session, and heavily depend on human annotations for supervised classification model learning. In this work, we target the identification of long-term, or cross-session, search tasks (transcending session boundaries) by investigating inter-query dependencies learned from users' searching behaviors. A semi-supervised clustering model is proposed based on the latent structural SVM framework, and a set of effective automatic annotation rules are proposed as weak supervision to release the burden of manual annotation. Experimental results based on a large-scale search log collected from Bing.com confirms the effectiveness of the proposed model in identifying cross-session search tasks and the utility of the introduced weak supervision signals. Our learned model enables a more comprehensive understanding of users' search behaviors via search logs and facilitates the development of dedicated search-engine support for long-term tasks.
Content-aware click modeling BIBAFull-Text 1365-1376
  Hongning Wang; ChengXiang Zhai; Anlei Dong; Yi Chang
Click models aim at extracting intrinsic relevance of documents to queries from biased user clicks. One basic modeling assumption made in existing work is to treat such intrinsic relevance as an atomic query-document-specific parameter, which is solely estimated from historical clicks without using any content information about a document or relationship among the clicked/skipped documents under the same query. Due to this overly simplified assumption, existing click models can neither fully explore the information about a document's relevance quality nor make predictions of relevance for any unseen documents.
   In this work, we proposed a novel Bayesian Sequential State model for modeling the user click behaviors, where the document content and dependencies among the sequential click events within a query are characterized by a set of descriptive features via a probabilistic graphical model. By applying the posterior regularized Expectation Maximization algorithm for parameter learning, we tailor the model to meet specific ranking-oriented properties, e.g., pairwise click preferences, so as to exploit richer information buried in the user clicks. Experiment results on a large set of real click logs demonstrate the effectiveness of the proposed model compared with several state-of-the-art click models.
Is it time for a career switch? BIBAFull-Text 1377-1388
  Jian Wang; Yi Zhang; Christian Posse; Anmol Bhasin
Tenure is a critical factor for an individual to consider when making a job transition. For instance, software engineers make a job transition to senior software engineers in a span of 2 years on average, or it takes for approximately 3 years for realtors to switch to brokers. While most existing work on recommender systems focuses on finding what to recommend to a user, this paper places emphasis on when to make appropriate recommendations and its impact on the item selection in the context of a job recommender system. The approach we propose, however, is general and can be applied to any recommendation scenario where the decision-making process is dependent on the tenure (i.e., the time interval) between successive decisions.
   Our approach is inspired by the proportional hazards model in statistics. It models the tenure between two successive decisions and related factors. We further extend the model with a hierarchical Bayesian framework to address the problem of data sparsity. The proposed model estimates the likelihood of a user's decision to make a job transition at a certain time, which is denoted as the tenure-based decision probability. New and appropriate evaluation metrics are designed to analyze the model's performance on deciding when is the right time to recommend a job to a user. We validate the soundness of our approach by evaluating it with an anonymous job application dataset across 140+ industries on LinkedIn. Experimental results show that the hierarchical proportional hazards model has better predictability of the user's decision time, which in turn helps the recommender system to achieve higher utility/user satisfaction.
Google+Ripples: a native visualization of information flow BIBAFull-Text 1389-1398
  Fernanda Viégas; Martin Wattenberg; Jack Hebert; Geoffrey Borggaard; Alison Cichowlas; Jonathan Feinberg; Jon Orwant; Christopher Wren
G+ Ripples is a visualization of information flow that shows users how public posts are shared on Google+. Unlike other social network visualizations, Ripples exists as a "native" visualization: it is directly accessible from public posts on Google+. This unique position leads to both new constraints and new possibilities for design. We describe the visualization technique, which is a new mix of node-and-link and circular treemap metaphors. We then describe user reactions as well as some of the patterns of sharing that are made evident by the Ripples visualization.
From cookies to cooks: insights on dietary patterns via analysis of web usage logs BIBAFull-Text 1399-1410
  Robert West; Ryen W. White; Eric Horvitz
Nutrition is a key factor in people's overall health. Hence, understanding the nature and dynamics of population-wide dietary preferences over time and space can be valuable in public health. To date, studies have leveraged small samples of participants via food intake logs or treatment data. We propose a complementary source of population data on nutrition obtained via Web logs. Our main contribution is a spatiotemporal analysis of population-wide dietary preferences through the lens of logs gathered by a widely distributed Web-browser add-on, using the access volume of recipes that users seek via search as a proxy for actual food consumption. We discover that variation in dietary preferences as expressed via recipe access has two main periodic components, one yearly and the other weekly, and that there exist characteristic regional differences in terms of diet within the United States. In a second study, we identify users who show evidence of having made an acute decision to lose weight. We characterize the shifts in interests that they express in their search queries and focus on changes in their recipe queries in particular. Last, we correlate nutritional time series obtained from recipe queries with time-aligned data on hospital admissions, aimed at understanding how behavioral data captured in Web logs might be harnessed to identify potential relationships between diet and acute health problems. In this preliminary study, we focus on patterns of sodium identified in recipes over time and patterns of admission for congestive heart failure, a chronic illness that can be exacerbated by increases in sodium intake.
Enhancing personalized search by mining and modeling task behavior BIBAFull-Text 1411-1420
  Ryen W. White; Wei Chu; Ahmed Hassan; Xiaodong He; Yang Song; Hongning Wang
Personalized search systems tailor search results to the current user intent using historic search interactions. This relies on being able to find pertinent information in that user's search history, which can be challenging for unseen queries and for new search scenarios. Building richer models of users' current and historic search tasks can help improve the likelihood of finding relevant content and enhance the relevance and coverage of personalization methods. The task-based approach can be applied to the current user's search history, or as we focus on here, all users' search histories as so-called "groupization" (a variant of personalization whereby other users' profiles can be used to personalize the search experience). We describe a method whereby we mine historic search-engine logs to find other users performing similar tasks to the current user and leverage their on-task behavior to identify Web pages to promote in the current ranking. We investigate the effectiveness of this approach versus query-based matching and finding related historic activity from the current user (i.e., group versus individual). As part of our studies we also explore the use of the on-task behavior of particular user cohorts, such as people who are expert in the topic currently being searched, rather than all other users. Our approach yields promising gains in retrieval performance, and has direct implications for improving personalization in search systems.
Inferring dependency constraints on parameters for web services BIBAFull-Text 1421-1432
  Qian Wu; Ling Wu; Guangtai Liang; Qianxiang Wang; Tao Xie; Hong Mei
Recently many popular websites such as Twitter and Flickr expose their data through web service APIs, enabling third-party organizations to develop client applications that provide functionalities beyond what the original websites offer. These client applications should follow certain constraints in order to correctly interact with the web services. One common type of such constraints is Dependency Constraints on Parameters. Given a web service operation O and its parameters Pi, Pj, these constraints describe the requirement on one parameter Pi that is dependent on the conditions of some other parameter(s) Pj. For example, when requesting the Twitter operation "GET statuses/user_timeline", a user_id parameter must be provided if a screen_name parameter is not provided. Violations of such constraints can cause fatal errors or incorrect results in the client applications. However, these constraints are often not formally specified and thus not available for automatic verification of client applications. To address this issue, we propose a novel approach, called INDICATOR, to automatically infer dependency constraints on parameters for web services, via a hybrid analysis of heterogeneous web service artifacts, including the service documentation, the service SDKs, and the web services themselves. To evaluate our approach, we applied INDICATOR to infer dependency constraints for four popular web services. The results showed that INDICATOR effectively infers constraints with an average precision of 94.4% and recall of 95.5%.
Predicting advertiser bidding behaviors in sponsored search by rationality modeling BIBAFull-Text 1433-1444
  Haifeng Xu; Bin Gao; Diyi Yang; Tie-Yan Liu
We study how an advertiser changes his/her bid prices in sponsored search, by modeling his/her rationality. Predicting the bid changes of advertisers with respect to their campaign performances is a key capability of search engines, since it can be used to improve the offline evaluation of new advertising technologies and the forecast of future revenue of the search engine. Previous work on advertiser behavior modeling heavily relies on the assumption of perfect advertiser rationality; however, in most cases, this assumption does not hold in practice. Advertisers may be unwilling, incapable, and/or constrained to achieve their best response. In this paper, we explicitly model these limitations in the rationality of advertisers, and build a probabilistic advertiser behavior model from the perspective of a search engine. We then use the expected payoff to define the objective function for an advertiser to optimize given his/her limited rationality. By solving the optimization problem with Monte Carlo, we get a prediction of mixed bid strategy for each advertiser in the next period of time. We examine the effectiveness of our model both directly using real historical bids and indirectly using revenue prediction and click number prediction. Our experimental results based on the sponsored search logs from a commercial search engine show that the proposed model can provide a more accurate prediction of advertiser bid behaviors than several baseline methods.
A biterm topic model for short texts BIBAFull-Text 1445-1456
  Xiaohui Yan; Jiafeng Guo; Yanyan Lan; Xueqi Cheng
Uncovering the topics within short texts, such as tweets and instant messages, has become an important task for many content analysis applications. However, directly applying conventional topic models (e.g. LDA and PLSA) on such short texts may not work well. The fundamental reason lies in that conventional topic models implicitly capture the document-level word co-occurrence patterns to reveal topics, and thus suffer from the severe data sparsity in short documents. In this paper, we propose a novel way for modeling topics in short texts, referred as biterm topic model (BTM). Specifically, in BTM we learn the topics by directly modeling the generation of word co-occurrence patterns (i.e. biterms) in the whole corpus. The major advantages of BTM are that 1) BTM explicitly models the word co-occurrence patterns to enhance the topic learning; and 2) BTM uses the aggregated patterns in the whole corpus for learning topics to solve the problem of sparse word co-occurrence patterns at document-level. We carry out extensive experiments on real-world short text collections. The results demonstrate that our approach can discover more prominent and coherent topics, and significantly outperform baseline methods on several evaluation metrics. Furthermore, we find that BTM can outperform LDA even on normal texts, showing the potential generality and wider usage of the new topic model.
Unified entity search in social media community BIBAFull-Text 1457-1466
  Ting Yao; Yuan Liu; Chong-Wah Ngo; Tao Mei
The search for entities is the most common search behavior on the Web, especially in social media communities where entities (such as images, videos, people, locations, and tags) are highly heterogeneous and correlated. While previous research usually deals with these social media entities separately, we are investigating in this paper a unified, multi-level, and correlative entity graph to represent the unstructured social media data, through which various applications (e.g., friend suggestion, personalized image search, image tagging, etc.) can be realized more effectively in one single framework. We regard the social media objects equally as "entities" and all of these applications as "entity search" problem which searches for entities with different types. We first construct a multi-level graph which organizes the heterogeneous entities into multiple levels, with one type of entities as vertices in each level. The edges between graphs pairwisely connect the entities weighted by intra-relations in the same level and inter-links across two different levels distilled from the social behaviors (e.g., tagging, commenting, and joining communities). To infer the strength of intra-relations, we propose a circular propagation scheme, which reinforces the mutual exchange of information across different entity types in a cyclic manner. Based on the constructed unified graph, we explicitly formulate entity search as a global optimization problem in a unified Bayesian framework, in which various applications are efficiently realized. Empirically, we validate the effectiveness of our unified entity graph for various social media applications on million-scale real-world dataset.
MATRI: a multi-aspect and transitive trust inference model BIBAFull-Text 1467-1476
  Yuan Yao; Hanghang Tong; Xifeng Yan; Feng Xu; Jian Lu
Trust inference, which is the mechanism to build new pair-wise trustworthiness relationship based on the existing ones, is a fundamental integral part in many real applications, e.g., e-commerce, social networks, peer-to-peer networks, etc. State-of-the-art trust inference approaches mainly employ the transitivity property of trust by propagating trust along connected users (a.k.a. trust propagation), but largely ignore other important properties, e.g., prior knowledge, multi-aspect, etc.
   In this paper, we propose a multi-aspect trust inference model by exploring an equally important property of trust, i.e., the multi-aspect property. The heart of our method is to view the problem as a recommendation problem, and hence opens the door to the rich methodologies in the field of collaborative filtering. The proposed multi-aspect model directly characterizes multiple latent factors for each trustor and trustee from the locally-generated trust relationships. Moreover, we extend this model to incorporate the prior knowledge as well as trust propagation to further improve inference accuracy. We conduct extensive experimental evaluations on real data sets, which demonstrate that our method achieves significant improvement over several existing benchmark approaches. Overall, the proposed method (MaTrI) leads to 26.7% -- 40.7% improvement over its best known competitors in prediction accuracy; and up to 7 orders of magnitude speedup with linear scalability.
Predicting positive and negative links in signed social networks by transfer learning BIBAFull-Text 1477-1488
  Jihang Ye; Hong Cheng; Zhe Zhu; Minghua Chen
Different from a large body of research on social networks that has focused almost exclusively on positive relationships, we study signed social networks with both positive and negative links. Specifically, we focus on how to reliably and effectively predict the signs of links in a newly formed signed social network (called a target network). Since usually only a very small amount of edge sign information is available in such newly formed networks, this small quantity is not adequate to train a good classifier. To address this challenge, we need assistance from an existing, mature signed network (called a source network) which has abundant edge sign information. We adopt the transfer learning approach to leverage the edge sign information from the source network, which may have a different yet related joint distribution of the edge instances and their class labels.
   As there is no predefined feature vector for the edge instances in a signed network, we construct generalizable features that can transfer the topological knowledge from the source network to the target. With the extracted features, we adopt an AdaBoost-like transfer learning algorithm with instance weighting to utilize more useful training instances in the source network for model learning. Experimental results on three real large signed social networks demonstrate that our transfer learning algorithm can improve the prediction accuracy by 40% over baseline methods.
Sparse online topic models BIBAFull-Text 1489-1500
  Aonan Zhang; Jun Zhu; Bo Zhang
Topic models have shown great promise in discovering latent semantic structures from complex data corpora, ranging from text documents and web news articles to images, videos, and even biological data. In order to deal with massive data collections and dynamic text streams, probabilistic online topic models such as online latent Dirichlet allocation (OLDA) have recently been developed. However, due to normalization constraints, OLDA can be ineffective in controlling the sparsity of discovered representations, a desirable property for learning interpretable semantic patterns, especially when the total number of topics is large. In contrast, sparse topical coding (STC) has been successfully introduced as a non-probabilistic topic model for effectively discovering sparse latent patterns by using sparsity-inducing regularization. But, unfortunately STC cannot scale to very large datasets or deal with online text streams, partly due to its batch learning procedure. In this paper, we present a sparse online topic model, which directly controls the sparsity of latent semantic patterns by imposing sparsity-inducing regularization and learns the topical dictionary by an online algorithm. The online algorithm is efficient and guaranteed to converge. Extensive empirical results of the sparse online topic model as well as its collapsed and supervised extensions on a large-scale Wikipedia dataset and the medium-sized 20Newsgroups dataset demonstrate appealing performance.
TopRec: domain-specific recommendation through community topic mining in social network BIBAFull-Text 1501-1510
  Xi Zhang; Jian Cheng; Ting Yuan; Biao Niu; Hanqing Lu
Traditionally, Collaborative Filtering assumes that similar users have similar responses to similar items. However, human activities exhibit heterogenous features across multiple domains such that users own similar tastes in one domain may behave quite differently in other domains. Moreover, highly sparse data presents crucial challenge in preference prediction. Intuitively, if users' interested domains are captured first, the recommender system is more likely to provide the enjoyed items while filter out those uninterested ones. Therefore, it is necessary to learn preference profiles from the correlated domains instead of the entire user-item matrix. In this paper, we propose a unified framework, TopRec, which detects topical communities to construct interpretable domains for domain-specific collaborative filtering. In order to mine communities as well as the corresponding topics, a semi-supervised probabilistic topic model is utilized by integrating user guidance with social network. Experimental results on real-world data from Epinions and Ciao demonstrate the effectiveness of the proposed framework.
Localized matrix factorization for recommendation based on matrix block diagonal forms BIBAFull-Text 1511-1520
  Yongfeng Zhang; Min Zhang; Yiqun Liu; Shaoping Ma; Shi Feng
Matrix factorization on user-item rating matrices has achieved significant success in collaborative filtering based recommendation tasks. However, it also encounters the problems of data sparsity and scalability when applied in real-world recommender systems. In this paper, we present the Localized Matrix Factorization (LMF) framework, which attempts to meet the challenges of sparsity and scalability by factorizing Block Diagonal Form (BDF) matrices. In the LMF framework, a large sparse matrix is first transformed into Recursive Bordered Block Diagonal Form (RBBDF), which is an intuitionally interpretable structure for user-item rating matrices. Smaller and denser submatrices are then extracted from this RBBDF matrix to construct a BDF matrix for more effective collaborative prediction. We show formally that the LMF framework is suitable for matrix factorization and that any decomposable matrix factorization algorithm can be integrated into this framework. It has the potential to improve prediction accuracy by factorizing smaller and denser submatrices independently, which is also suitable for parallelization and contributes to system scalability at the same time. Experimental results based on a number of real-world public-access benchmarks show the effectiveness and efficiency of the proposed LMF framework.
Predicting purchase behaviors from social media BIBAFull-Text 1521-1532
  Yongzheng Zhang; Marco Pennacchiotti
In the era of social commerce, users often connect from e-commerce websites to social networking venues such as Facebook and Twitter. However, there have been few efforts on understanding the correlations between users' social media profiles and their e-commerce behaviors. This paper presents a system for predicting a user's purchase behaviors on e-commerce websites from the user's social media profile. We specifically aim at understanding if the user's profile information in a social network (for example Facebook) can be leveraged to predict what categories of products the user will buy from (for example eBay Electronics). The paper provides an extensive analysis on how users' Facebook profile information correlates to purchases on eBay, and analyzes the performance of different feature sets and learning algorithms on the task of purchase behavior prediction.
Anatomy of a web-scale resale market: a data mining approach BIBAFull-Text 1533-1544
  Yuchen Zhao; Neel Sundaresan; Zeqian Shen; Philip S. Yu
Reuse and remarketing of content and products is an integral part of the internet. As E-commerce has grown, online resale and secondary markets form a significant part of the commerce space. The intentions and methods for reselling are diverse. In this paper, we study an instance of such markets that affords interesting data at large scale for mining purposes to understand the properties and patterns of this online market. As part of knowledge discovery of such a market, we first formally propose criteria to reveal unseen resale behaviors by elastic matching identification (EMI) based on the account transfer and item similarity properties of transactions. Then, we present a large-scale system that leverages MapReduce paradigm to mine millions of online resale activities from petabyte scale heterogeneous e-commerce data. With the collected data, we show that the number of resale activities leads to a power law distribution with a 'long tail', where a significant share of users only resell in very low numbers and a large portion of resales come from a small number of highly active resellers. We further conduct a comprehensive empirical study from different aspects of resales, including the temporal, spatial patterns, user demographics, reputation and the content of sale postings. Based on these observations, we explore the features related to "successful" resale transactions and evaluate if they can be predictable. We also discuss uses of this information mining for business insights and user experience on a real-world online marketplace.
Questions about questions: an empirical analysis of information needs on Twitter BIBAFull-Text 1545-1556
  Zhe Zhao; Qiaozhu Mei
Conventional studies of online information seeking behavior usually focus on the use of search engines or question answering (Q&A) websites. Recently, the fast growth of online social platforms such as Twitter and Facebook has made it possible for people to utilize them for information seeking by asking questions to their friends or followers. We anticipate a better understanding of Web users' information needs by investigating research questions about these questions. How are they distinctive from daily tweeted conversations? How are they related to search queries? Can users' information needs on one platform predict those on the other?
   In this study, we take the initiative to extract and analyze information needs from billions of online conversations collected from Twitter. With an automatic text classifier, we can accurately detect real questions in tweets (i.e., tweets conveying real information needs). We then present a comprehensive analysis of the large-scale collection of information needs we extracted. We found that questions being asked on Twitter are substantially different from the topics being tweeted in general. Information needs detected on Twitter have a considerable power of predicting the trends of Google queries. Many interesting signals emerge through longitudinal analysis of the volume, spikes, and entropy of questions on Twitter, which provide insights to the understanding of the impact of real world events and user behavioral patterns in social platforms.
Which vertical search engines are relevant? BIBAFull-Text 1557-1568
  Ke Zhou; Ronan Cummins; Mounia Lalmas; Joemon M. Jose
Aggregating search results from a variety of heterogeneous sources, so-called verticals, such as news, image and video, into a single interface is a popular paradigm in web search. Current approaches that evaluate the effectiveness of aggregated search systems are based on rewarding systems that return highly relevant verticals for a given query, where this relevance is assessed under different assumptions. It is difficult to evaluate or compare those systems without fully understanding the relationship between those underlying assumptions. To address this, we present a formal analysis and a set of extensive user studies to investigate the effects of various assumptions made for assessing query vertical relevance. A total of more than 20,000 assessments on 44 search tasks across 11 verticals are collected through Amazon Mechanical Turk and subsequently analysed. Our results provide insights into various aspects of query vertical relevance and allow us to explain in more depth as well as questioning the evaluation results published in the literature.
Making the most of your triple store: query answering in OWL 2 using an RL reasoner BIBAFull-Text 1569-1580
  Yujiao Zhou; Bernardo Cuenca Grau; Ian Horrocks; Zhe Wu; Jay Banerjee
Triple stores implementing the RL profile of OWL 2 are becoming increasingly popular. In contrast to unrestricted OWL 2, the RL profile is known to enjoy favourable computational properties for query answering, and state-of-the-art RL reasoners such as OWLim and Oracle's native inference engine of Oracle Spatial and Graph have proved extremely successful in industry-scale applications. The expressive restrictions imposed by OWL 2 RL may, however, be problematical for some applications. In this paper, we propose novel techniques that allow us (in many cases) to compute exact query answers using an off-the-shelf RL reasoner, even when the ontology is outside the RL profile. Furthermore, in the cases where exact query answers cannot be computed, we can still compute both lower and upper bounds on the exact answers. These bounds allow us to estimate the degree of incompleteness of the RL reasoner on the given query, and to optimise the computation of exact answers using a fully-fledged OWL 2 reasoner. A preliminary evaluation using the RDF Semantic Graph feature in Oracle Database has shown very promising results with respect to both scalability and tightness of the bounds.
Security implications of password discretization for click-based graphical passwords BIBAFull-Text 1581-1591
  Bin B. Zhu; Dongchen Wei; Maowei Yang; Jeff Yan
Discretization is a standard technique used in click-based graphical passwords for tolerating input variance so that approximately correct passwords are accepted by the system. In this paper, we show for the first time that two representative discretization schemes leak a significant amount of password information, undermining the security of such graphical passwords. We exploit such information leakage for successful dictionary attacks on Persuasive Cued Click Points (PCCP), which is to date the most secure click-based graphical password scheme and was considered to be resistant to such attacks. In our experiments, our purely automated attack successfully guessed 69.2% of the passwords when Centered Discretization was used to implement PCCP, and 39.4% of the passwords when Robust Discretization was used. Each attack dictionary we used was of approximately 235 entries, whereas the full password space was of 243 entries. For Centered Discretization, our attack still successfully guessed 50% of the passwords when the dictionary size was reduced to approximately 230 entries. Our attack is also applicable to common implementations of other click-based graphical password systems such as PassPoints and Cued Click Points -- both have been extensively studied in the research communities.