HCI Bibliography Home | HCI Conferences | WWW Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
WWW Tables of Contents: 04-205-105-2060708091011-111-212-112-213-113-214-114-215-115-2

Proceedings of the 2012 International Conference on the World Wide Web

Fullname:Proceedings of the 21st International Conference on World Wide Web
Editors:Alain Mille; Fabien Gandon; Jacques Misselis; Michael Rabinovich; Steffen Staab
Location:Lyon, France
Dates:2012-Apr-16 to 2012-Apr-20
Volume:1
Publisher:ACM
Standard No:ISBN 1-4503-1229-2, 978-1-4503-1229-5; ACM DL: Table of Contents hcibib: WWW12-1
Papers:107
Pages:1078
Links:Conference Home Page
Summary:It is our great pleasure to present you the technical programme of WWW 2012.
    The call for papers attracted a breath-taking number of 885 full paper submissions. 521 PC members and 235 supporting reviewers evaluated these papers. Based on 3247 reviews and meta-reviews as well as many discussions held online, the 27 track chairs and deputy chairs met with us for two full days in order to discuss and decide upon final acceptance of full papers. In the end, 108 papers -- only 12% of the submissions -- could be accommodated in the technical programme.
    These papers represent a well-balanced mix over the 13 tracks listed in the call for papers. Though the different tracks received very different numbers of submissions, our principal selection criterion was quality; hence, we neither enforced a proportional share of acceptances among the tracks nor did we favor tracks with lower numbers of submissions.
  1. WWW 2012-04-16 Volume 1
    1. Recommender systems
    2. Mobile Web Performance
    3. Security and fraud in social networks
    4. Advertising on the web 1
    5. Semantic web approaches in search
    6. Web performance
    7. Fraud and bias in user ratings
    8. Mobile and file-sharing users
    9. Behavioral analysis and content characterization in social media
    10. Creating and using links between data objects
    11. Security 1
    12. Community detection in social networks
    13. Advertising on the web 2
    14. Search
    15. Obtaining and leveraging user comments
    16. Entity linking
    17. Search: evaluation and ranking
    18. Information diffusion in social networks
    19. Game theory and the web
    20. Leveraging user actions in search
    21. Web user behavioral analysis and modeling
    22. Ontology representation and querying: RDF and SPARQL
    23. Security 2
    24. Social interactions and the web
    25. Entity and taxonomy extraction
    26. Leveraging user-generated content
    27. Data and content management 1
    28. Web engineering 1
    29. Collaboration in social networks
    30. Information extraction
    31. Web mining
    32. Data and content management 2
    33. Web engineering 2
    34. Crowdsourcing
    35. Social networks
    36. User interfaces and human factors

WWW 2012-04-16 Volume 1

Recommender systems

Build your own music recommender by modeling internet radio streams BIBAFull-Text 1-10
  Natalie Aizenberg; Yehuda Koren; Oren Somekh
In the Internet music scene, where recommendation technology is key for navigating huge collections, large market players enjoy a considerable advantage. Accessing a wider pool of user feedback leads to an increasingly more accurate analysis of user tastes, effectively creating a "rich get richer" effect. This work aims at significantly lowering the entry barrier for creating music recommenders, through a paradigm coupling a public data source and a new collaborative filtering (CF) model. We claim that Internet radio stations form a readily available resource of abundant fresh human signals on music through their playlists, which are essentially cohesive sets of related tracks. In a way, our models rely on the knowledge of a diverse group of experts in lieu of the commonly used wisdom of crowds. Over several weeks, we aggregated publicly available playlists of thousands of Internet radio stations, resulting in a dataset encompassing millions of plays, and hundreds of thousands of tracks and artists. This provides the large scale ground data necessary to mitigate the cold start problem of new items at both mature and emerging services.
   Furthermore, we developed a new probabilistic CF model, tailored to the Internet radio resource. The success of the model was empirically validated on the collected dataset. Moreover, we tested the model at a cross-source transfer learning manner -- the same model trained on the Internet radio data was used to predict behavior of Yahoo! Music users. This demonstrates the ability to tap the Internet radio signals in other music recommendation setups. Based on encouraging empirical results, our hope is that the proposed paradigm will make quality music recommendation accessible to all interested parties in the community.
Using control theory for stable and efficient recommender systems BIBAFull-Text 11-20
  Tamas Jambor; Jun Wang; Neal Lathia
The aim of a web-based recommender system is to provide highly accurate and up-to-date recommendations to its users; in practice, it will hope to retain its users over time. However, this raises unique challenges. To achieve complex goals such as keeping the recommender model up-to-date over time, we need to consider a number of external requirements. Generally, these requirements arise from the physical nature of the system, for instance the available computational resources. Ideally, we would like to design a system that does not deviate from the required outcome. Modeling such a system over time requires to describe the internal dynamics as a combination of the underlying recommender model and the its users' behavior. We propose to solve this problem by applying the principles of modern control theory -- a powerful set of tools to deal with dynamical systems -- to construct and maintain a stable and robust recommender system for dynamically evolving environments. In particular, we introduce a design principle by focusing on the dynamic relationship between the recommender system's performance and the number of new training samples the system requires. This enables us to automate the control other external factors such as the system's update frequency. We show that, by using a Proportional-Integral-Derivative controller, a recommender system is able to automatically and accurately estimate the required input to keep the output close to a pre-defined requirements. Our experiments on a standard rating dataset show that, by using a feedback loop between system performance and training, the trade-off between the effectiveness and efficiency of the system can be well maintained. We close by discussing the widespread applicability of our approach to a variety of scenarios that recommender systems face.
An exploration of improving collaborative recommender systems via user-item subgroups BIBAFull-Text 21-30
  Bin Xu; Jiajun Bu; Chun Chen; Deng Cai
Collaborative filtering (CF) is one of the most successful recommendation approaches. It typically associates a user with a group of like-minded users based on their preferences over all the items, and recommends to the user those items enjoyed by others in the group. However we find that two users with similar tastes on one item subset may have totally different tastes on another set. In other words, there exist many user-item subgroups each consisting of a subset of items and a group of like-minded users on these items. It is more natural to make preference predictions for a user via the correlated subgroups than the entire user-item matrix. In this paper, to find meaningful subgroups, we formulate the Multiclass Co-Clustering (MCoC) problem and propose an effective solution to it. Then we propose an unified framework to extend the traditional CF algorithms by utilizing the subgroups information for improving their top-N recommendation performance. Our approach can be seen as an extension of traditional clustering CF models. Systematic experiments on three real world data sets have demonstrated the effectiveness of our proposed approach.

Mobile Web Performance

How far can client-only solutions go for mobile browser speed? BIBAFull-Text 31-40
  Zhen Wang; Felix Xiaozhu Lin; Lin Zhong; Mansoor Chishtie
Mobile browser is known to be slow because of the bottleneck in resource loading. Client-only solutions to improve resource loading are attractive because they are immediately deployable, scalable, and secure. We present the first publicly known treatment of client-only solutions to understand how much they can improve mobile browser speed without infrastructure support. Leveraging an unprecedented set of web usage data collected from 24 iPhone users continuously over one year, we examine the three fundamental, orthogonal approaches a client-only solution can take: caching, prefetching, and speculative loading. Speculative loading, as is firstly proposed and studied in this work, predicts and speculatively loads the subresources needed to open a webpage once its URL is given. We show that while caching and prefetching are highly limited for mobile browsing, speculative loading can be significantly more effective. Empirically, we show that client-only solutions can improve the browser speed by about 1.4 second on average for websites visited by the 24 iPhone users. We also report the design, realization, and evaluation of speculative loading in a WebKit-based browser called Tempo. On average, Tempo can reduce browser delay by 1 second (~20%).
Who killed my battery?: analyzing mobile browser energy consumption BIBAFull-Text 41-50
  Narendran Thiagarajan; Gaurav Aggarwal; Angela Nicoara; Dan Boneh; Jatinder Pal Singh
Despite the growing popularity of mobile web browsing, the energy consumed by a phone browser while surfing the web is poorly understood. We present an infrastructure for measuring the precise energy used by a mobile browser to render web pages. We then measure the energy needed to render financial, e-commerce, email, blogging, news and social networking sites. Our tools are sufficiently precise to measure the energy needed to render individual web elements, such as cascade style sheets (CSS), Javascript, images, and plug-in objects. Our results show that for popular sites, downloading and parsing cascade style sheets and Javascript consumes a significant fraction of the total energy needed to render the page. Using the data we collected we make concrete recommendations on how to design web pages so as to minimize the energy needed to render the page. As an example, by modifying scripts on the Wikipedia mobile site we reduced by 30% the energy needed to download and render Wikipedia pages with no change to the user experience. We conclude by estimating the point at which offloading browser computations to a remote proxy can save energy on the phone.
Periodic transfers in mobile applications: network-wide origin, impact, and optimization BIBAFull-Text 51-60
  Feng Qian; Zhaoguang Wang; Yudong Gao; Junxian Huang; Alexandre Gerber; Zhuoqing Mao; Subhabrata Sen; Oliver Spatscheck
Cellular networks employ a specific radio resource management policy distinguishing them from wired and Wi-Fi networks. A lack of awareness of this important mechanism potentially leads to resource-inefficient mobile applications. We perform the first network-wide, large-scale investigation of a particular type of application traffic pattern called periodic transfers where a handset periodically exchanges some data with a remote server every t seconds. Using packet traces containing 1.5 billion packets collected from a commercial cellular carrier, we found that periodic transfers are very prevalent in today's smartphone traffic. However, they are extremely resource-inefficient for both the network and end-user devices even though they predominantly generate very little traffic. This somewhat counter-intuitive behavior is a direct consequence of the adverse interaction between such periodic transfer patterns and the cellular network radio resource management policy. For example, for popular smartphone applications such as Facebook, periodic transfers account for only 1.7% of the overall traffic volume but contribute to 30% of the total handset radio energy consumption. We found periodic transfers are generated for various reasons such as keep-alive, polling, and user behavior measurements. We further investigate the potential of various traffic shaping and resource control algorithms. Depending on their traffic patterns, applications exhibit disparate responses to optimization strategies. Jointly using several strategies with moderate aggressiveness can eliminate almost all energy impact of periodic transfers for popular applications such as Facebook and Pandora.

Security and fraud in social networks

Understanding and combating link farming in the twitter social network BIBAFull-Text 61-70
  Saptarshi Ghosh; Bimal Viswanath; Farshad Kooti; Naveen Kumar Sharma; Gautam Korlam; Fabricio Benevenuto; Niloy Ganguly; Krishna Phani Gummadi
Recently, Twitter has emerged as a popular platform for discovering real-time information on the Web, such as news stories and people's reaction to them. Like the Web, Twitter has become a target for link farming, where users, especially spammers, try to acquire large numbers of follower links in the social network. Acquiring followers not only increases the size of a user's direct audience, but also contributes to the perceived influence of the user, which in turn impacts the ranking of the user's tweets by search engines.
   In this paper, we first investigate link farming in the Twitter network and then explore mechanisms to discourage the activity. To this end, we conducted a detailed analysis of links acquired by over 40,000 spammer accounts suspended by Twitter. We find that link farming is wide spread and that a majority of spammers' links are farmed from a small fraction of Twitter users, the social capitalists, who are themselves seeking to amass social capital and links by following back anyone who follows them. Our findings shed light on the social dynamics that are at the root of the link farming problem in Twitter network and they have important implications for future designs of link spam defenses. In particular, we show that a simple user ranking scheme that penalizes users for connecting to spammers can effectively address the problem by disincentivizing users from linking with other users simply to gain influence.
Analyzing spammers' social networks for fun and profit: a case study of cyber criminal ecosystem on twitter BIBAFull-Text 71-80
  Chao Yang; Robert Harkreader; Jialong Zhang; Seungwon Shin; Guofei Gu
In this paper, we perform an empirical analysis of the cyber criminal ecosystem on Twitter. Essentially, through analyzing inner social relationships in the criminal account community, we find that criminal accounts tend to be socially connected, forming a small-world network. We also find that criminal hubs, sitting in the center of the social graph, are more inclined to follow criminal accounts. Through analyzing outer social relationships between criminal accounts and their social friends outside the criminal account community, we reveal three categories of accounts that have close friendships with criminal accounts. Through these analyses, we provide a novel and effective criminal account inference algorithm by exploiting criminal accounts' social relationships and semantic coordinations.
Branded with a scarlet "C": cheaters in a gaming social network BIBAFull-Text 81-90
  Jeremy Blackburn; Ramanuja Simha; Nicolas Kourtellis; Xiang Zuo; Matei Ripeanu; John Skvoretz; Adriana Iamnitchi
Online gaming is a multi-billion dollar industry that entertains a large, global population. One unfortunate phenomenon, however, poisons the competition and the fun: cheating. The costs of cheating span from industry-supported expenditures to detect and limit cheating, to victims' monetary losses due to cyber crime. This paper studies cheaters in the Steam Community, an online social network built on top of the world's dominant digital game delivery platform. We collected information about more than 12 million gamers connected in a global social network, of which more than 700 thousand have their profiles flagged as cheaters. We also collected in-game interaction data of over 10 thousand players from a popular multiplayer gaming server. We show that cheaters are well embedded in the social and interaction networks: their network position is largely indistinguishable from that of fair players. We observe that the cheating behavior appears to spread through a social mechanism: the presence and the number of cheater friends of a fair player is correlated with the likelihood of her becoming a cheater in the future. Also, we observe that there is a social penalty involved with being labeled as a cheater: cheaters are likely to switch to more restrictive privacy settings once they are tagged and they lose more friends than fair players. Finally, we observe that the number of cheaters is not correlated with the geographical, real-world population density, or with the local popularity of the Steam Community.

Advertising on the web 1

Risk-aware revenue maximization in display advertising BIBAFull-Text 91-100
  Ana Radovanovic; William D. Heavlin
Display advertising is the graphical advertising on the World Wide Web (WWW) that appears next to content on web pages, instant messaging (IM) applications, email, etc. Over the past decade, display ads have evolved from simple banner and pop-up ads to include various combinations of text, images, audio, video, and animations. As a market segment, display continues to show substantial growth potential, as evidenced by companies such as Microsoft, Yahoo, and Google actively vying for market share. As a sales process, display ads are typically sold in packages, the result of negotiations between sales and advertising agents. A key component to any successful business model in display advertising is sound pricing. Main objectives for on-line publishers (e.g. Amazon, YouTube, CNN) are maximizing revenue while managing their available inventory appropriately, and pricing must reflect these considerations.
   This paper addresses the problem of maximizing revenue by adjusting prices of display inventory. We cast this as an inventory allocation problem. Our formal objective (a) maximizes revenue using (b) iterative price adjustments in the direction of the gradient of an appropriately constructed Lagrangian relaxation. We show that our optimization approach drives the revenue towards local maximum under mild conditions on the properties of the (unknown) demand curve.
   The major unknown for optimizing revenue in display environment is how the demand for display ads changes to prices, the classical demand curve. This we address directly, by way of a factorial pricing experiment. This enables us to estimate the gradient of the revenue function with respect to inventory prices. Overall, the result is a principled, risk-aware, and empirically efficient methodology.
   This paper is based on research undertaken on behalf of one of Google's clients.
Targeting converters for new campaigns through factor models BIBAFull-Text 101-110
  Deepak Agarwal; Sandeep Pandey; Vanja Josifovski
In performance based display advertising, campaign effectiveness is often measured in terms of conversions that represent some desired user actions like purchases and product information requests on advertisers' website. Hence, identifying and targeting potential converters is of vital importance to boost campaign performance. This is often accomplished by marketers who define the user base of campaigns based on behavioral, demographic, search, social, purchase, and other characteristics. Such a process is manual and subjective, it often fails to utilize the full potential of targeting. In this paper we show that by using past converted users of campaigns and campaign meta-data (e.g., ad creatives, landing pages), we can combine disparate user information in a principled way to effectively and automatically target converters for new/existing campaigns. At the heart of our approach is a factor model that estimates the affinity of each user feature to a campaign using historical conversion data. In fact, our approach allows building a conversion model for a brand new campaign through campaign meta-data alone, and hence targets potential converters even before the campaign is run. Through extensive experiments, we show the superiority of our factor model approach relative to several other baselines. Moreover, we show that the performance of our approach at the beginning of a campaign's life is typically better than the other models even when they are trained using all conversion data after the campaign has completed. This clearly shows the importance and value of using historical campaign data in constructing an effective audience selection strategy for display advertising.
How effective is targeted advertising? BIBAFull-Text 111-120
  Ayman Farahat; Michael C. Bailey
Advertisers are demanding more accurate estimates of the impact of targeted advertisements, yet no study proposes an appropriate methodology to analyze the effectiveness of a targeted advertising campaign, and there is a dearth of empirical evidence on the effectiveness of targeted advertising as a whole. The targeted population is more likely to convert from advertising so the response lift between the targeted and untargeted group to the advertising is likely an overestimate of the impact of targeted advertising. We propose a difference-in-differences estimator to account for this selection bias by decomposing the impact of targeting into selection bias and treatment effects components. Using several large-scale online advertising campaigns, we test the effectiveness of targeted advertising on brand-related searches and clickthrough rates. We find that the treatment effect on the targeted group is about twice as large for brand-related searches, but naively estimating this effect without taking into account selection bias leads to an overestimation of the lift from targeting on brand-related searches by almost 1,000%.

Semantic web approaches in search

Compressed data structures for annotated web search BIBAFull-Text 121-130
  Soumen Chakrabarti; Sasidhar Kasturi; Bharath Balakrishnan; Ganesh Ramakrishnan; Rohit Saraf
Entity relationship search at Web scale depends on adding dozens of entity annotations to each of billions of crawled pages and indexing the annotations at rates comparable to regular text indexing. Even small entity search benchmarks from TREC and INEX suggest that the entity catalog support thousands of entity types and tens to hundreds of millions of entities. The above targets raise many challenges, major ones being the design of highly compressed data structures in RAM for spotting and disambiguating entity mentions, and highly compressed disk-based annotation indices. These data structures cannot be readily built upon standard inverted indices. Here we present a Web scale entity annotator and annotation index. Using a new workload-sensitive compressed multilevel map, we fit statistical disambiguation models for millions of entities within 1.15GB of RAM, and spend about 0.6 core-milliseconds per disambiguation. In contrast, DBPedia Spotlight spends 158 milliseconds, Wikipedia Miner spends 21 milliseconds, and Zemanta spends 9.5 milliseconds. Our annotation indices use ideas from vertical databases to reduce storage by 30%. On 40x8 cores with 40x3 disk spindles, we can annotate and index, in about a day, a billion Web pages with two million entities and 200,000 types from Wikipedia. Index decompression and scan speed are comparable to MG4J.
The SemSets model for ad-hoc semantic list search BIBAFull-Text 131-140
  Marek Ciglan; Kjetil Nørvåg; Ladislav Hluchý
The amount of semantic data on the web has been growing rapidly in recent years. One of the key challenges triggered by this growth is the ad-hoc querying, i.e., the ability to retrieve answers from semantic resources using natural language queries. This facilitates interaction with semantic resources for the users so they can benefit from the knowledge covered by semantic data without the complexities of semantic query languages. In this paper, we focus on semantic queries, where the aim is to retrieve objects belonging to a set of semantically related entities. An example of such an ad-hoc type query is "Apollo astronauts who walked on the Moon". In order to address the task, we propose the SemSets retrieval model that exploits and combines traditional document-based information retrieval, link structure of the semantic data and entity membership in semantic sets, in order to provide the answers. The novelty of the approach lies in the utilization of semantic sets, i.e., groups of semantically related entities. We propose two approaches to identify such semantic sets from the knowledge bases; the first one requires involvement of an expert user knowledgeable of the data set structure, the second one is fully automatic and provides results that are comparable with those delivered by the expert users. As demonstrated in the experimental evaluation, the proposed model has the state-of-the-art performance on the SemSearch2011 data set, which has been designed especially for the semantic list search evaluation.
Heterogeneous web data search using relevance-based on the fly data integration BIBAFull-Text 141-150
  Daniel M. Herzig; Thanh Tran
Searching over heterogeneous structured data on the Web is challenging due to vocabulary and structure mismatches among different data sources. In this paper, we study two existing strategies and present a new approach to integrate additional data sources into the search process. The first strategy relies on data integration to mediate mismatches through upfront computation of mappings, based on which queries are rewritten to fit individual sources. The other extreme is keyword search, which does not require any up-front investment, but ignores structure information. Building on these strategies, we present a hybrid approach, which combines the advantages of both. Our approach does not require any upfront data integration, but also leverages the fine grained structure of the underlying data. For a structured query adhering to the vocabulary of just one source, the so-called seed query, we construct an entity relevance model (ERM), which captures the content and the structure of the seed query results. This ERM is then aligned on the fly with keyword search results retrieved from other sources and also used to rank these results. The outcome of our experiments using large-scale real-world data sets suggests that data integration leads to higher search effectiveness compared to keyword search and that our new hybrid approach consistently exceeds both strategies.

Web performance

TailGate: handling long-tail content with a little help from friends BIBAFull-Text 151-160
  Stefano Traverso; Kévin Huguenin; Ionut Triestan; Vijay Erramilli; Nikolaos Laoutaris; Kostantina Papagiannaki
Distributing long-tail content is an inherently difficult task due to the low amortization of bandwidth transfer costs as such content has limited number of views. Two recent trends are making this problem harder. First, the increasing popularity of user-generated content (UGC) and online social networks (OSNs) create and reinforce such popularity distributions. Second, the recent trend of geo-replicating content across multiple PoPs spread around the world, done for improving quality of experience (QoE) for users and for redundancy reasons, can lead to unnecessary bandwidth costs. We build TailGate, a system that exploits social relationships, regularities in read access patterns, and time-zone differences to efficiently and selectively distribute long-tail content across PoPs. We evaluate TailGate using large traces from an OSN and show that it can decrease WAN bandwidth costs by as much as 80% as well as reduce latency, improving QoE. We deploy TailGate on PlanetLab and show that even in the case when imprecise social information is available, TailGate can still decrease the latency for accessing long-tail YouTube videos by a factor of 2.
DOHA: scalable real-time web applications through adaptive concurrent execution BIBAFull-Text 161-170
  Aiman Erbad; Norman C. Hutchinson; Charles Krasic
Browsers have become mature execution platforms enabling web applications to rival their desktop counterparts. An important class of such applications is interactive multimedia: games, animations, and interactive visualizations. Unlike many early web applications, these applications are latency sensitive and processing (CPU and graphics) intensive. When demands exceed available resources, application quality (e.g., frame rate) diminishes because it is hard to balance timeliness and utilization. The quality of ambitious web applications is also limited by single-threaded execution prevalent in the Web. Applications need to scale their quality, and thereby scale processing load, based on the resources that are available. We refer to this as scalable quality.
   DOHA is an execution layer written entirely in JavaScript to enable scalable quality in web applications. DOHA favors important computations with more influence over quality based on hints from application-specific adaptation policies. To utilize widely available multi-core resources, DOHA augments HTML5 web workers with mechanisms to facilitate state management and load-balancing. We evaluate DOHA with an award-winning web-based game. When resources are limited, the modified game has better timing and overall quality. More importantly, quality scales linearly with a small number of cores and the game is playable in challenging scenarios that are beyond the scope of the original game.
Surviving a search engine overload BIBAFull-Text 171-180
  Aaron Koehl; Haining Wang
Search engines are an essential component of the web, but their web crawling agents can impose a significant burden on heavily loaded web servers. Unfortunately, blocking or deferring web crawler requests is not a viable solution due to economic consequences. We conduct a quantitative measurement study on the impact and cost of web crawling agents, seeking optimization points for this class of request. Based on our measurements, we present a practical caching approach for mitigating search engine overload, and implement the two-level cache scheme on a very busy web server. Our experimental results show that the proposed caching framework can effectively reduce the impact of search engine overload on service quality.

Fraud and bias in user ratings

Semi-supervised correction of biased comment ratings BIBAFull-Text 181-190
  Abhinav Mishra; Rajeev Rastogi
In many instances, offensive comments on the internet attract a disproportionate number of positive ratings from highly biased users. This results in an undesirable scenario where these offensive comments are the top rated ones. In this paper, we develop semi-supervised learning techniques to correct the bias in user ratings of comments. Our scheme uses a small number of comment labels in conjunction with user rating information to iteratively compute user bias and unbiased ratings for unlabeled comments. We show that the running time of each iteration is linear in the number of ratings, and the system converges to a unique fixed point. To select the comments to label, we devise an active learning algorithm based on empirical risk minimization. Our active learning method incrementally updates the risk for neighboring comments each time a comment is labeled, and thus can easily scale to large comment datasets. On real-life comments from Yahoo! News, our semi-supervised and active learning algorithms achieve higher accuracy than simple baselines, with few labeled examples.
Spotting fake reviewer groups in consumer reviews BIBAFull-Text 191-200
  Arjun Mukherjee; Bing Liu; Natalie Glance
Opinionated social media such as product reviews are now widely used by individuals and organizations for their decision making. However, due to the reason of profit or fame, people try to game the system by opinion spamming (e.g., writing fake reviews) to promote or demote some target products. For reviews to reflect genuine user experiences and opinions, such spam reviews should be detected. Prior works on opinion spam focused on detecting fake reviews and individual fake reviewers. However, a fake reviewer group (a group of reviewers who work collaboratively to write fake reviews) is even more damaging as they can take total control of the sentiment on the target product due to its size. This paper studies spam detection in the collaborative setting, i.e., to discover fake reviewer groups. The proposed method first uses a frequent itemset mining method to find a set of candidate groups. It then uses several behavioral models derived from the collusion phenomenon among fake reviewers and relation models based on the relationships among groups, individual reviewers, and products they reviewed to detect fake reviewer groups. Additionally, we also built a labeled dataset of fake reviewer groups. Although labeling individual fake reviews and reviewers is very hard, to our surprise labeling fake reviewer groups is much easier. We also note that the proposed technique departs from the traditional supervised learning approach for spam detection because of the inherent nature of our problem which makes the classic supervised learning approach less effective. Experimental results show that the proposed method outperforms multiple strong baselines including the state-of-the-art supervised classification, regression, and learning to rank algorithms.
Estimating the prevalence of deception in online review communities BIBAFull-Text 201-210
  Myle Ott; Claire Cardie; Jeff Hancock
Consumers' purchase decisions are increasingly influenced by user-generated online reviews. Accordingly, there has been growing concern about the potential for posting deceptive opinion spam -- fictitious reviews that have been deliberately written to sound authentic, to deceive the reader. But while this practice has received considerable public attention and concern, relatively little is known about the actual prevalence, or rate, of deception in online review communities, and less still about the factors that influence it.
   We propose a generative model of deception which, in conjunction with a deception classifier, we use to explore the prevalence of deception in six popular online review communities: Expedia, Hotels.com, Orbitz, Priceline, TripAdvisor, and Yelp. We additionally propose a theoretical model of online reviews based on economic signaling theory, in which consumer reviews diminish the inherent information asymmetry between consumers and producers, by acting as a signal to a product's true, unknown quality. We find that deceptive opinion spam is a growing problem overall, but with different growth rates across communities. These rates, we argue, are driven by the different signaling costs associated with deception for each review community, e.g., posting requirements. When measures are taken to increase signaling cost, e.g., filtering reviews written by first-time reviewers, deception prevalence is effectively reduced.

Mobile and file-sharing users

Musubi: disintermediated interactive social feeds for mobile devices BIBAFull-Text 211-220
  Ben Dodson; Ian Vo; T. J. Purtell; Aemon Cannon; Monica Lam
This paper presents Musubi, a mobile social application platform that enables users to share any data type in real-time feeds created by any application on the phone. Musubi is unique in providing a disintermediated service to end users; all communication is supported using public key encryption thus leaking no user information to a third party. Despite the heavy use of cryptography to provide user authentication and access control, users found Musubi simple to use. We embed key exchange within familiar friending actions, and allow users to interact with any friend in their address books without requiring them to join a common network a priori. Our feed abstraction allows users to easily exercise access control. All data reside on the phone, granting users the freedom to apply applications of their choice.
   In addition to disintermediating personal messaging, we have created an application platform to support multi-party software with the same respect for personal data. The SocialKit library we created on top of Musubi's trusted communication protocol facilitates the development of multi-party applications and integrates with Musubi to provide a compelling group application experience. SocialKit allows developers to make social, interactive, privacy-honoring applications without needing to host their own servers.
Economics of BitTorrent communities BIBAFull-Text 221-230
  Ian A. Kash; John K. Lai; Haoqi Zhang; Aviv Zohar
Over the years, private file-sharing communities built on the BitTorrent protocol have developed their own policies and mechanisms for motivating members to share content and contribute resources. By requiring members to maintain a minimum ratio between uploads and downloads, private communities effectively establish credit systems, and with them full-fledged economies. We report on a half-year-long measurement study of DIME -- a community for sharing live concert recordings -- that sheds light on the economic forces affecting users in such communities. A key observation is that while the download of files is priced only according to the size of the file, the rate of return for seeding new files is significantly greater than for seeding old files. We find via a natural experiment that users react to such differences in resale value by preferentially consuming older files during a 'free leech' period. We consider implications of these finding on a user's ability to earn credits and meet ratio enforcements, focusing in particular on the relationship between visitation frequency and wealth and on low bandwidth users. We then share details from an interview with DIME moderators, which highlights the goals of the community based on which we make suggestions for possible improvement.
A habit mining approach for discovering similar mobile users BIBAFull-Text 231-240
  Haiping Ma; Huanhuan Cao; Qiang Yang; Enhong Chen; Jilei Tian
Discovering similar users with respect to their habits plays an important role in a wide range of applications, such as collaborative filtering for recommendation, user segmentation for market analysis, etc. Recently, the progressing ability to sense user contexts of smart mobile devices makes it possible to discover mobile users with similar habits by mining their habits from their mobile devices. However, though some researchers have proposed effective methods for mining user habits such as behavior pattern mining, how to leverage the mined results for discovering similar users remains less explored. To this end, we propose a novel approach for conquering the sparseness of behavior pattern space and thus make it possible to discover similar mobile users with respect to their habits by leveraging behavior pattern mining. To be specific, first, we normalize the raw context log of each user by transforming the location-based context data and user interaction records to more general representations. Second, we take advantage of a constraint-based Bayesian Matrix Factorization model for extracting the latent common habits among behavior patterns and then transforming behavior pattern vectors to the vectors of mined common habits which are in a much more dense space. The experiments conducted on real data sets show that our approach outperforms three baselines in terms of the effectiveness of discovering similar mobile users with respect to their habits.

Behavioral analysis and content characterization in social media

YouTube around the world: geographic popularity of videos BIBAFull-Text 241-250
  Anders Brodersen; Salvatore Scellato; Mirjam Wattenhofer
One of the most popular user activities on the Web is watching videos. Services like YouTube, Vimeo, and Hulu host and stream millions of videos, providing content that is on par with TV. While some of this content is popular all over the globe, some videos might be only watched in a confined, local region.
   In this work we study the relationship between popularity and locality of online YouTube videos. We investigate whether YouTube videos exhibit geographic locality of interest, with views arising from a confined spatial area rather than from a global one. Our analysis is done on a corpus of more than 20 millions YouTube videos, uploaded over one year from different regions. We find that about 50% of the videos have more than 70% of their views in a single region. By relating locality to viralness we show that social sharing generally widens the geographic reach of a video. If, however, a video cannot carry its social impulse over to other means of discovery, it gets stuck in a more confined geographic region. Finally, we analyze how the geographic properties of a video's views evolve on a daily basis during its lifetime, providing new insights on how the geographic reach of a video changes as its popularity peaks and then fades away.
   Our results demonstrate how, despite the global nature of the Web, online video consumption appears constrained by geographic locality of interest: this has a potential impact on a wide range of systems and applications, spanning from delivery networks to recommendation and discovery engines, providing new directions for future research.
Dynamical classes of collective attention in twitter BIBAFull-Text 251-260
  Janette Lehmann; Bruno Gonçalves; José J. Ramasco; Ciro Cattuto
Micro-blogging systems such as Twitter expose digital traces of social discourse with an unprecedented degree of resolution of individual behaviors. They offer an opportunity to investigate how a large-scale social system responds to exogenous or endogenous stimuli, and to disentangle the temporal, spatial and topical aspects of users' activity. Here we focus on spikes of collective attention in Twitter, and specifically on peaks in the popularity of hashtags. Users employ hashtags as a form of social annotation, to define a shared context for a specific event, topic, or meme. We analyze a large-scale record of Twitter activity and find that the evolution of hashtag popularity over time defines discrete classes of hashtags. We link these dynamical classes to the events the hashtags represent and use text mining techniques to provide a semantic characterization of the hashtag classes. Moreover, we track the propagation of hashtags in the Twitter social network and find that epidemic spreading plays a minor role in hashtag popularity, which is mostly driven by exogenous factors.
We know what @you #tag: does the dual role affect hashtag adoption? BIBAFull-Text 261-270
  Lei Yang; Tao Sun; Ming Zhang; Qiaozhu Mei
Researchers and social observers have both believed that hashtags, as a new type of organizational objects of information, play a dual role in online microblogging communities (e.g., Twitter). On one hand, a hashtag serves as a bookmark of content, which links tweets with similar topics; on the other hand, a hashtag serves as the symbol of a community membership, which bridges a virtual community of users. Are the real users aware of this dual role of hashtags? Is the dual role affecting their behavior of adopting a hashtag? Is hashtag adoption predictable? We take the initiative to investigate and quantify the effects of the dual role on hashtag adoption. We propose comprehensive measures to quantify the major factors of how a user selects content tags as well as joins communities. Experiments using large scale Twitter datasets prove the effectiveness of the dual role, where both the content measures and the community measures significantly correlate to hashtag adoption on Twitter. With these measures as features, a machine learning model can effectively predict the future adoption of hashtags that a user has never used before.

Creating and using links between data objects

Factorizing YAGO: scalable machine learning for linked data BIBAFull-Text 271-280
  Maximilian Nickel; Volker Tresp; Hans-Peter Kriegel
Vast amounts of structured information have been published in the Semantic Web's Linked Open Data (LOD) cloud and their size is still growing rapidly. Yet, access to this information via reasoning and querying is sometimes difficult, due to LOD's size, partial data inconsistencies and inherent noisiness. Machine Learning offers an alternative approach to exploiting LOD's data with the advantages that Machine Learning algorithms are typically robust to both noise and data inconsistencies and are able to efficiently utilize non-deterministic dependencies in the data. From a Machine Learning point of view, LOD is challenging due to its relational nature and its scale. Here, we present an efficient approach to relational learning on LOD data, based on the factorization of a sparse tensor that scales to data consisting of millions of entities, hundreds of relations and billions of known facts. Furthermore, we show how ontological knowledge can be incorporated in the factorization to improve learning results and how computation can be distributed across multiple nodes. We demonstrate that our approach is able to factorize the YAGO 2 core ontology and globally predict statements for this large knowledge base using a single dual-core desktop computer. Furthermore, we show experimentally that our approach achieves good results in several relational learning tasks that are relevant to Linked Data. Once a factorization has been computed, our model is able to predict efficiently, and without any additional training, the likelihood of any of the 4.3 * 10^14 possible triples in the YAGO 2 core ontology.
Semantic navigation on the web of data: specification of routes, web fragments and actions BIBAFull-Text 281-290
  Valeria Fionda; Claudio Gutierrez; Giuseppe Pirró
The massive semantic data sources linked in the Web of Data give new meaning to old features like navigation; introduce new challenges like semantic specification of Web fragments; and make it possible to specify actions relying on semantic data. In this paper we introduce a declarative language to face these challenges. Based on navigational features, it is designed to specify fragments of the Web of Data and actions to be performed based on these data. We implement it in a centralized fashion, and show its power and performance. Finally, we explore the same ideas in a distributed setting, showing their feasibility, potentialities and challenges.
Understanding web images by object relation network BIBAFull-Text 291-300
  Na Chen; Qian-Yi Zhou; Viktor Prasanna
This paper presents an automatic method for understanding and interpreting the semantics of unannotated web images. We observe that the relations between objects in an image carry important semantics about the image. To capture and describe such semantics, we propose Object Relation Network (ORN), a graph model representing the most probable meaning of the objects and their relations in an image. Guided and constrained by an ontology, ORN transfers the rich semantics in the ontology to image objects and the relations between them, while maintaining semantic consistency (e.g., a soccer player can kick a soccer ball, but cannot ride it). We present an automatic system which takes a raw image as input and creates an ORN based on image visual appearance and the guide ontology. We demonstrate various useful web applications enabled by ORNs, such as automatic image tagging, automatic image description generation, and image search by image.

Security 1

Investigating the distribution of password choices BIBAFull-Text 301-310
  David Malone; Kevin Maher
The distribution of passwords chosen by users has implications for site security, password-handling algorithms and even how users are permitted to select passwords. Using password lists from four different web sites, we investigate if Zipf's law is a good description of the frequency with which passwords are chosen. We use a number of standard statistics, which measure the security of password distributions, to see if modelling the data using a simple distribution is effective. We then consider how much the password distributions from each site have in common, using password cracking as a metric. This shows that these distributions have enough high-frequency passwords in common to provide effective speed-ups for cracking passwords. Finally, as an alternative to a deterministic banned list, we will show how to stochastically shape the distribution of passwords, by occasionally asking users to choose a different password.
Is this app safe?: a large scale study on application permissions and risk signals BIBAFull-Text 311-320
  Pern Hui Chia; Yusuke Yamamoto; N. Asokan
Third-party applications (apps) drive the attractiveness of web and mobile application platforms. Many of these platforms adopt a decentralized control strategy, relying on explicit user consent for granting permissions that the apps request. Users have to rely primarily on community ratings as the signals to identify the potentially harmful and inappropriate apps even though community ratings typically reflect opinions about perceived functionality or performance rather than about risks. With the arrival of HTML5 web apps, such user-consent permission systems will become more widespread. We study the effectiveness of user-consent permission systems through a large scale data collection of Facebook apps, Chrome extensions and Android apps. Our analysis confirms that the current forms of community ratings used in app markets today are not reliable indicators of privacy risks of an app. We find some evidence indicating attempts to mislead or entice users into granting permissions: free applications and applications with mature content request more permissions than is typical; 'look-alike' applications which have names similar to popular applications also request more permissions than is typical. We also find that across all three platforms popular applications request more permissions than average.
SessionJuggler: secure web login from an untrusted terminal using session hijacking BIBAFull-Text 321-330
  Elie Bursztein; Chinmay Soman; Dan Boneh; John C. Mitchell
We use modern features of web browsers to develop a secure login system from an untrusted terminal. The system, called Session Juggler, requires no server-side changes and no special software on the terminal beyond a modern web browser. This important property makes adoption much easier than with previous proposals. With Session Juggler users never enter their long term credential on the untrusted terminal. Instead, users log in to a web site using a smartphone app and then transfer the entire session, including cookies and all other session state, to the untrusted terminal. We show that Session Juggler works on all the Alexa top 100 sites except eight. Of those eight, five failures were due to the site enforcing IP session binding. We also show that Session Juggler works flawlessly with Facebook connect. Beyond login, Session Juggler also provides a secure logout mechanism where the trusted phone is used to kill the session. To validate the session juggling concept we conducted a number of web site surveys that are of independent interest. First, we survey how web sites bind a session token to a specific device and show that most use fairly basic techniques that are easily defeated. Second, we survey how web sites handle logout and show that many popular sites surprisingly do not properly handle logout requests.

Community detection in social networks

Using content and interactions for discovering communities in social networks BIBAFull-Text 331-340
  Mrinmaya Sachan; Danish Contractor; Tanveer A. Faruquie; L. Venkata Subramaniam
In recent years, social networking sites have not only enabled people to connect with each other using social links but have also allowed them to share, communicate and interact over diverse geographical regions. Social network provide a rich source of heterogeneous data which can be exploited to discover previously unknown relationships and interests among groups of people. In this paper, we address the problem of discovering topically meaningful communities from a social network. We assume that a persons' membership in a community is conditioned on its social relationship, the type of interaction and the information communicated with other members of that community. We propose generative models that can discover communities based on the discussed topics, interaction types and the social connections among people. In our models a person can belong to multiple communities and a community can participate in multiple topics. This allows us to discover both community interests and user interests based on the information and linked associations. We demonstrate the effectiveness of our model on two real word data sets and show that it performs better than existing community discovery models.
Community detection in incomplete information networks BIBAFull-Text 341-350
  Wangqun Lin; Xiangnan Kong; Philip S. Yu; Quanyuan Wu; Yan Jia; Chuan Li
With the recent advances in information networks, the problem of community detection has attracted much attention in the last decade. While network community detection has been ubiquitous, the task of collecting complete network data remains challenging in many real-world applications. Usually the collected network is incomplete with most of the edges missing. Commonly, in such networks, all nodes with attributes are available while only the edges within a few local regions of the network can be observed. In this paper, we study the problem of detecting communities in incomplete information networks with missing edges. We first learn a distance metric to reproduce the link-based distance between nodes from the observed edges in the local information regions. We then use the learned distance metric to estimate the distance between any pair of nodes in the network. A hierarchical clustering approach is proposed to detect communities within the incomplete information networks. Empirical studies on real-world information networks demonstrate that our proposed method can effectively detect community structures within incomplete information networks.
QUBE: a quick algorithm for updating betweenness centrality BIBAFull-Text 351-360
  Min-Joong Lee; Jungmin Lee; Jaimie Yejean Park; Ryan Hyun Choi; Chin-Wan Chung
The betweenness centrality of a vertex in a graph is a measure for the participation of the vertex in the shortest paths in the graph. The Betweenness centrality is widely used in network analyses. Especially in a social network, the recursive computation of the betweenness centralities of vertices is performed for the community detection and finding the influential user in the network. Since a social network graph is frequently updated, it is necessary to update the betweenness centrality efficiently. When a graph is changed, the betweenness centralities of all the vertices should be recomputed from scratch using all the vertices in the graph. To the best of our knowledge, this is the first work that proposes an efficient algorithm which handles the update of the betweenness centralities of vertices in a graph. In this paper, we propose a method that efficiently reduces the search space by finding a candidate set of vertices whose betweenness centralities can be updated and computes their betweenness centeralities using candidate vertices only. As the cost of calculating the betweenness centrality mainly depends on the number of vertices to be considered, the proposed algorithm significantly reduces the cost of calculation. The proposed algorithm allows the transformation of an existing algorithm which does not consider the graph update. Experimental results on large real datasets show that the proposed algorithm speeds up the existing algorithm 2 to 2418 times depending on the dataset.

Advertising on the web 2

On revenue in the generalized second price auction BIBAFull-Text 361-370
  Brendan Lucier; Renato Paes Leme; Eva Tardos
The Generalized Second Price (GSP) auction is the primary auction used for selling sponsored search advertisements. In this paper we consider the revenue of this auction at equilibrium. We prove that if agent values are drawn from identical regular distributions, then the GSP auction paired with an appropriate reserve price generates a constant fraction (1/6th) of the optimal revenue. In the full-information game, we show that at any Nash equilibrium of the GSP auction obtains at least half of the revenue of the VCG mechanism excluding the payment of a single participant. This bound holds also with any reserve price, and is tight.
   Finally, we consider the tradeoff between maximizing revenue and social welfare. We introduce a natural convexity assumption on the click-through rates and show that it implies that the revenue-maximizing equilibrium of GSP in the full information model will necessarily be envy-free. In particular, it is always possible to maximize revenue and social welfare simultaneously when click-through rates are convex. Without this convexity assumption, however, we demonstrate that revenue may be maximized at a non-envy-free equilibrium that generates a socially inefficient allocation.
Handling forecast errors while bidding for display advertising BIBAFull-Text 371-380
  Kevin J. Lang; Benjamin Moseley; Sergei Vassilvitskii
Most of the online advertising today is sold via an auction, which requires the advertiser to respond with a valid bid within a fraction of a second. As such, most advertisers employ bidding agents to submit bids on their behalf. The architecture of such agents typically has (1) an offline optimization phase which incorporates the bidder's knowledge about the market and (2) an online bidding strategy which simply executes the offline strategy. The online strategy is typically highly dependent on both supply and expected price distributions, both of which are forecast using traditional machine learning methods. In this work we investigate the optimum strategy of the bidding agent when faced with incorrect forecasts. At a high level, the agent can invest resources in improving the forecasts, or can tighten the loop between successive offline optimization cycles in order to detect errors more quickly. We show analytically that the latter strategy, while simple, is extremely effective in dealing with forecast errors, and confirm this finding with experimental evaluations.
Optimizing budget allocation among channels and influencers BIBAFull-Text 381-388
  Noga Alon; Iftah Gamzu; Moshe Tennenholtz
Brands and agencies use marketing as a tool to influence customers. One of the major decisions in a marketing plan deals with the allocation of a given budget among media channels in order to maximize the impact on a set of potential customers. A similar situation occurs in a social network, where a marketing budget needs to be distributed among a set of potential influencers in a way that provides high-impact.
   We introduce several probabilistic models to capture the above scenarios. The common setting of these models consists of a bipartite graph of source and target nodes. The objective is to allocate a fixed budget among the source nodes to maximize the expected number of influenced target nodes. The concrete way in which source nodes influence target nodes depends on the underlying model. We primarily consider two models: a source-side influence model, in which a source node that is allocated a budget of k makes k independent trials to influence each of its neighboring target nodes, and a target-side influence model, in which a target node becomes influenced according to a specified rule that depends on the overall budget allocated to its neighbors. Our main results are an optimal (1-1/e)-approximation algorithm for the source-side model, and several inapproximability results for the target-side model, establishing that influence maximization in the latter model is provably harder.

Search

Structured query suggestion for specialization and parallel movement: effect on search behaviors BIBAFull-Text 389-398
  Makoto P. Kato; Tetsuya Sakai; Katsumi Tanaka
Query suggestion, which enables the user to revise a query with a single click, has become one of the most fundamental features of Web search engines. However, it is often difficult for the user to choose from a list of query suggestions, and to understand the relation between an input query and suggested ones. In this paper, we propose a new method to present query suggestions to the user, which has been designed to help two popular query reformulation actions, namely, specialization (e.g. from "nikon" to "nikon camera") and parallel movement (e.g. from "nikon camera" to "canon camera"). Using a query log collected from a popular commercial Web search engine, our prototype called SParQS classifies query suggestions into automatically generated categories and generates a label for each category. Moreover, SParQS presents some new entities as alternatives to the original query (e.g. "canon" in response to the query "nikon"), together with their query suggestions classified in the same way as the original query's suggestions. We conducted a task-based user study to compare SParQS with a traditional "flat list" query suggestion interface. Our results show that the SParQS interface enables subjects to search more successfully than the flat list case, even though query suggestions presented were exactly the same in the two interfaces. In addition, the subjects found the query suggestions more helpful when they were presented in the SParQS interface rather than in a flat list.
Partitioned multi-indexing: bringing order to social search BIBAFull-Text 399-408
  Bahman Bahmani; Ashish Goel
To answer search queries on a social network rich with user-generated content, it is desirable to give a higher ranking to content that is closer to the individual issuing the query. Queries occur at nodes in the network, documents are also created by nodes in the same network, and the goal is to find the document that matches the query and is closest in network distance to the node issuing the query. In this paper, we present the "Partitioned Multi-Indexing" scheme, which provides an approximate solution to this problem. With m links in the network, after an offline O(m) pre-processing time, our scheme allows for social index operations (i.e., social search queries, as well as insertion and deletion of words into and from a document at any node), all in time O(1). Further, our scheme can be implemented on open source distributed streaming systems such as Yahoo! S4 or Twitter's Storm so that every social index operation takes O(1) processing time and network queries in the worst case, and just two network queries in the common case where the reverse index corresponding to the query keyword is much smaller than the memory available at any distributed compute node. Building on Das Sarma et al.'s approximate distance oracle, the worst-case approximation ratio of our scheme is O(1) for undirected networks. Our simulations on the social network Twitter as well as synthetic networks show that in practice, the approximation ratio is actually close to 1 for both directed and undirected networks. We believe that this work is the first demonstration of the feasibility of social search with real-time text updates at large scales.
Unsupervised extraction of template structure in web search queries BIBAFull-Text 409-418
  Sandeep Pandey; Kunal Punera
Web search queries are an encoding of the user's search intent and extracting structured information from them can facilitate central search engine operations like improving the ranking of search results and advertisements. Not surprisingly, this area has attracted a lot of attention in the research community in the last few years. The problem is, however, made challenging by the fact that search queries tend to be extremely succinct; a condensation of user search needs to the bare-minimum set of keywords. In this paper we consider the problem of extracting, with no manual intervention, the hidden structure behind the observed search queries in a domain: the origins of the constituent keywords as well as the manner the individual keywords are assembled together. We formalize important properties of the problem and then give a principled solution based on generative models that satisfies these properties. Using manually labeled data we show that the query templates extracted by our solution are superior to those discovered by strong baseline methods.
   The query templates extracted by our approach have potential uses in many search engine tasks; query answering, advertisement matching and targeting, to name a few. In this paper we study one such task, estimating Query-Advertisability, and empirically demonstrate that using extracted template information can improve performance over and above the current state-of-the-art.

Obtaining and leveraging user comments

Multi-objective ranking of comments on web BIBAFull-Text 419-428
  Onkar Dalal; Srinivasan H. Sengemedu; Subhajit Sanyal
With the explosion of information on any topic, the need for ranking is becoming very critical. Ranking typically depends on several aspects. Products, for example, have several aspects like price, recency, rating, etc. Product ranking has to bring the "best" product which is recent and highly rated. Hence ranking has to satisfy multiple objectives. In this paper, we explore multi-objective ranking of comments using Hodge decomposition. While Hodge decomposition produces a globally consistent ranking, a globally inconsistent component is also present. We propose an active learning strategy for the reduction of this component. Finally, we develop techniques for online Hodge decomposition. We experimentally validate the ideas presented in this paper.
Care to comment?: recommendations for commenting on news stories BIBAFull-Text 429-438
  Erez Shmueli; Amit Kagian; Yehuda Koren; Ronny Lempel
Many websites provide commenting facilities for users to express their opinions or sentiments with regards to content items, such as, videos, news stories, blog posts, etc. Previous studies have shown that user comments contain valuable information that can provide insight on Web documents and may be utilized for various tasks. This work presents a model that predicts, for a given user, suitable news stories for commenting. The model achieves encouraging results regarding the ability to connect users with stories they are likely to comment on. This provides grounds for personalized recommendations of stories to users who may want to take part in their discussion. We combine a content-based approach with a collaborative-filtering approach (utilizing users' co-commenting patterns) in a latent factor modeling framework. We experiment with several variations of the model's loss function in order to adjust it to the problem domain. We evaluate the results on two datasets and show that employing co-commenting patterns improves upon using content features alone, even with as few as two available comments per story. Finally, we try to incorporate available social network data into the model. Interestingly, the social data does not lead to substantial performance gains, suggesting that the value of social data for this task is quite negligible.
Leveraging user comments for aesthetic aware image search reranking BIBAFull-Text 439-448
  Jose San Pedro; Tom Yeh; Nuria Oliver
The increasing number of images available online has created a growing need for efficient ways to search for relevant content. Text-based query search is the most common approach to retrieve images from the Web. In this approach, the similarity between the input query and the metadata of images is used to find relevant information. However, as the amount of available images grows, the number of relevant images also increases, all of them sharing very similar metadata but differing in other visual characteristics. This paper studies the influence of visual aesthetic quality in search results as a complementary attribute to relevance. By considering aesthetics, a new ranking parameter is introduced aimed at improving the quality at the top ranks when large amounts of relevant results exist. Two strategies for aesthetic rating inference are proposed: one based on visual content, another based on the analysis of user comments to detect opinions about the quality of images. The results of a user study with $58$ participants show that the comment-based aesthetic predictor outperforms the visual content-based strategy, and reveals that aesthetic-aware rankings are preferred by users searching for photographs on the Web.

Entity linking

LINDEN: linking named entities with knowledge base via semantic knowledge BIBAFull-Text 449-458
  Wei Shen; Jianyong Wang; Ping Luo; Min Wang
Integrating the extracted facts with an existing knowledge base has raised an urgent need to address the problem of entity linking. Specifically, entity linking is the task to link the entity mention in text with the corresponding real world entity in the existing knowledge base. However, this task is challenging due to name ambiguity, textual inconsistency, and lack of world knowledge in the knowledge base. Several methods have been proposed to tackle this problem, but they are largely based on the co-occurrence statistics of terms between the text around the entity mention and the document associated with the entity. In this paper, we propose LINDEN, a novel framework to link named entities in text with a knowledge base unifying Wikipedia and WordNet, by leveraging the rich semantic knowledge embedded in the Wikipedia and the taxonomy of the knowledge base. We extensively evaluate the performance of our proposed LINDEN over two public data sets and empirical results show that LINDEN significantly outperforms the state-of-the-art methods in terms of accuracy.
Cross-lingual knowledge linking across wiki knowledge bases BIBAFull-Text 459-468
  Zhichun Wang; Juanzi Li; Zhigang Wang; Jie Tang
Wikipedia becomes one of the largest knowledge bases on the Web. It has attracted 513 million page views per day in January 2012. However, one critical issue for Wikipedia is that articles in different language are very unbalanced. For example, the number of articles on Wikipedia in English has reached 3.8 million, while the number of Chinese articles is still less than half million and there are only 217 thousand cross-lingual links between articles of the two languages. On the other hand, there are more than 3.9 million Chinese Wiki articles on Baidu Baike and Hudong.com, two popular encyclopedias in Chinese. One important question is how to link the knowledge entries distributed in different knowledge bases. This will immensely enrich the information in the online knowledge bases and benefit many applications. In this paper, we study the problem of cross-lingual knowledge linking and present a linkage factor graph model. Features are defined according to some interesting observations. Experiments on the Wikipedia data set show that our approach can achieve a high precision of 85.8% with a recall of 88.1%. The approach found 202,141 new cross-lingual links between English Wikipedia and Baidu Baike.
ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking BIBAFull-Text 469-478
  Gianluca Demartini; Djellel Eddine Difallah; Philippe Cudré-Mauroux
We tackle the problem of entity linking for large collections of online pages; Our system, ZenCrowd, identifies entities from natural language text using state of the art techniques and automatically connects them to the Linked Open Data cloud. We show how one can take advantage of human intelligence to improve the quality of the links by dynamically generating micro-tasks on an online crowdsourcing platform. We develop a probabilistic framework to make sensible decisions about candidate links and to identify unreliable human workers. We evaluate ZenCrowd in a real deployment and show how a combination of both probabilistic reasoning and crowdsourcing techniques can significantly improve the quality of the links, while limiting the amount of work performed by the crowd.

Search: evaluation and ranking

A flexible generative model for preference aggregation BIBAFull-Text 479-488
  Maksims N. Volkovs; Richard S. Zemel
Many areas of study, such as information retrieval, collaborative filtering, and social choice face the preference aggregation problem, in which multiple preferences over objects must be combined into a consensus ranking. Preferences over items can be expressed in a variety of forms, which makes the aggregation problem difficult. In this work we formulate a flexible probabilistic model over pairwise comparisons that can accommodate all these forms. Inference in the model is very fast, making it applicable to problems with hundreds of thousands of preferences. Experiments on benchmark datasets demonstrate superior performance to existing methods.
Evaluating the effectiveness of search task trails BIBAFull-Text 489-498
  Zhen Liao; Yang Song; Li-wei He; Yalou Huang
In this paper, we introduce "task trail" as a new concept to understand user search behaviors. We define task to be an atomic user information need. Web search logs have been studied mainly at session or query level where users may submit several queries within one task and handle several tasks within one session. Although previous studies have addressed the problem of task identification, little is known about the advantage of using task over session and query for search applications. In this paper, we conduct extensive analyses and comparisons to evaluate the effectiveness of task trails in three search applications: determining user satisfaction, predicting user search interests, and query suggestion. Experiments are conducted on large scale datasets from a commercial search engine. Experimental results show that: (1) Sessions and queries are not as precise as tasks in determining user satisfaction. (2) Task trails provide higher web page utilities to users than other sources. (3) Tasks represent atomic user information needs, and therefore can preserve topic similarity between query pairs. (4) Task-based query suggestion can provide complementary results to other models. The findings in this paper verify the need to extract task trails from web search logs and suggest potential applications in search and recommendation systems.
Evaluation with informational and navigational intents BIBAFull-Text 499-508
  Tetsuya Sakai
Given an ambiguous or underspecified query, search result diversification aims at accommodating different user intents within a single "entry-point" result page. However, some intents are informational, for which many relevant pages may help, while others are navigational, for which only one web page is required. We propose new evaluation metrics for search result diversification that considers this distinction, as well as a simple method for comparing the intuitiveness of a given pair of metrics quantitatively. Our main experimental findings are: (a) In terms of discriminative power which reflects statistical reliability, the proposed metrics, DIN#-nDCG and P+Q#, are comparable to intent recall and D#-nDCG, and possibly superior to α-nDCG; (b) In terms of preference agreement with intent recall, P+Q# is superior to other diversity metrics and therefore may be the most intuitive as a metric that emphasises diversity; and (c) In terms of preference agreement with effective precision, DIN#-nDCG is superior to other diversity metrics and therefore may be the most intuitive as a metric that emphasises relevance. Moreover, DIN#-nDCG may be the most intuitive as a metric that considers both diversity and relevance. In addition, we demonstrate that the randomised Tukey's Honestly Significant Differences test that takes the entire set of available runs into account is substantially more conservative than the paired bootstrap test that only considers one run pair at a time, and therefore recommend the former approach for significance testing when a set of runs is available for evaluation.

Information diffusion in social networks

Information transfer in social media BIBAFull-Text 509-518
  Greg Ver Steeg; Aram Galstyan
Recent research has explored the increasingly important role of social media by examining the dynamics of individual and group behavior, characterizing patterns of information diffusion, and identifying influential individuals. In this paper we suggest a measure of causal relationships between nodes based on the information-theoretic notion of transfer entropy, or information transfer. This theoretically grounded measure is based on dynamic information, captures fine-grain notions of influence, and admits a natural, predictive interpretation. Networks inferred by transfer entropy can differ significantly from static friendship networks because most friendship links are not useful for predicting future dynamics. We demonstrate through analysis of synthetic and real-world data that transfer entropy reveals meaningful hidden network structures. In addition to altering our notion of who is influential, transfer entropy allows us to differentiate between weak influence over large groups and strong influence over small groups.
The role of social networks in information diffusion BIBAFull-Text 519-528
  Eytan Bakshy; Itamar Rosenn; Cameron Marlow; Lada Adamic
Online social networking technologies enable individuals to simultaneously share information with any number of peers. Quantifying the causal effect of these mediums on the dissemination of information requires not only identification of who influences whom, but also of whether individuals would still propagate information in the absence of social signals about that information. We examine the role of social networks in online information diffusion with a large-scale field experiment that randomizes exposure to signals about friends' information sharing among 253 million subjects in situ. Those who are exposed are significantly more likely to spread information, and do so sooner than those who are not exposed. We further examine the relative role of strong and weak ties in information propagation. We show that, although stronger ties are individually more influential, it is the more abundant weak ties who are responsible for the propagation of novel information. This suggests that weak ties may play a more dominant role in the dissemination of information online than currently believed.
Recommendations to boost content spread in social networks BIBAFull-Text 529-538
  Vineet Chaoji; Sayan Ranu; Rajeev Rastogi; Rushi Bhatt
Content sharing in social networks is a powerful mechanism for discovering content on the Internet. The degree to which content is disseminated within the network depends on the connectivity relationships among network nodes. Existing schemes for recommending connections in social networks are based on the number of common neighbors, similarity of user profiles, etc. However, such similarity-based connections do not consider the amount of content discovered.
   In this paper, we propose novel algorithms for recommending connections that boost content propagation in a social network without compromising on the relevance of the recommendations. Unlike existing work on influence propagation, in our environment, we are looking for edges instead of nodes, with a bound on the number of incident edges per node. We show that the content spread function is not submodular, and develop approximation algorithms for computing a near-optimal set of edges. Through experiments on real-world social graphs such as Flickr and Twitter, we show that our approximation algorithms achieve content spreads that are as much as 90 times higher compared to existing heuristics for recommending connections.

Game theory and the web

Implementing optimal outcomes in social computing: a game-theoretic approach BIBAFull-Text 539-548
  Arpita Ghosh; Patrick Hummel
In many social computing applications such as online Q&A forums, the best contribution for each task receives some high reward, while all remaining contributions receive an identical, lower reward irrespective of their actual qualities. Suppose a mechanism designer (site owner) wishes to optimize an objective that is some function of the number and qualities of received contributions. When potential contributors are strategic agents, who decide whether to contribute or not to selfishly maximize their own utilities, is such a "best contribution" mechanism, Mb, adequate to implement an outcome that is optimal for the mechanism designer? We first show that in settings where a contribution's value is determined primarily by an agent's expertise, and agents only strategically choose whether to contribute or not, contests can implement optimal outcomes: for any reasonable objective, the rewards for the best and remaining contributions in Mb can always be chosen so that the outcome in the unique symmetric equilibrium of Mb maximizes the mechanism designer's utility. We also show how the mechanism designer can learn these optimal rewards when she does not know the parameters of the agents' utilities, as might be the case in practice. We next consider settings where a contribution's value depends on both the contributor's expertise as well as her effort, and agents endogenously choose how much effort to exert in addition to deciding whether to contribute. Here, we show that optimal outcomes can never be implemented by contests if the system can rank the qualities of contributions perfectly. However, if there is noise in the contributions' rankings, then the mechanism designer can again induce agents to follow strategies that maximize his utility. Thus imperfect rankings can actually help achieve implementability of optimal outcomes when effort is endogenous and influences quality.
Lattice games and the economics of aggregators BIBAFull-Text 549-558
  Patrick R. Jordan; Uri Nadav; Kunal Punera; Andrzej Skrzypacz; George Varghese
We model the strategic decisions of web sites in content markets, where sites may reduce user search cost by aggregating content. Example aggregations include political news, technology, and other niche-topic websites. We model this market scenario as an extensive form game of complete information, where sites choose a set of content to aggregate and users associate with sites that are nearest to their interests. Thus, our scenario is a location game in which sites choose to aggregate content at a certain point in user-preference space, and our choice of distance metric, Jacquard distance, induces a lattice structure on the game. We provide two variants of this scenario: one where users associate with the first site to enter amongst sites of equal distances, and a second where users choose uniformly between sites at equal distances. We show that subgame perfect Nash equilibria exist for both games. While it appears to be computationally hard to compute equilibria in both games, we show a polynomial-time satisficing strategy called Frontier Descent for the first game. A satisficing strategy is not a best response, but ensures that earlier sites will have positive profits, assuming all subsequent sites also have positive profits. By contrast, we show that the second game has no satisficing solution.
Strategic formation of credit networks BIBAFull-Text 559-568
  Pranav Dandekar; Ashish Goel; Michael P. Wellman; Bryce Wiedenbeck
Credit networks are an abstraction for modeling trust between agents in a network. Agents who do not directly trust each other can transact through exchange of IOUs (obligations) along a chain of trust in the network. Credit networks are robust to intrusion, can enable transactions between strangers in exchange economies, and have the liquidity to support a high rate of transactions. We study the formation of such networks when agents strategically decide how much credit to extend each other. When each agent trusts a fixed set of other agents, and transacts directly only with those it trusts, the formation game is a potential game and all Nash equilibria are social optima. Moreover, the Nash equilibria of this game are equivalent in a very strong sense: the sequences of transactions that can be supported from each equilibrium credit network are identical. When we allow transactions over longer paths, the game may not admit a Nash equilibrium, and even when it does, the price of anarchy may be unbounded. Hence, we study two special cases. First, when agents have a shared belief about the trustworthiness of each agent, the networks formed in equilibrium have a star-like structure. Though the price of anarchy is unbounded, myopic best response quickly converges to a social optimum. Similar star-like structures are found in equilibria of heuristic strategies found via simulation. In addition, we simulate a second case where agents may have varying information about each others' trustworthiness based on their distance in a social network. Empirical game analysis of these scenarios suggests that star structures arise only when defaults are relatively rare, and otherwise, credit tends to be issued over short social distances conforming to the locality of information.

Leveraging user actions in search

Beyond dwell time: estimating document relevance from cursor movements and other post-click searcher behavior BIBAFull-Text 569-578
  Qi Guo; Eugene Agichtein
Result clickthrough statistics and dwell time on clicked results have been shown valuable for inferring search result relevance, but the interpretation of these signals can vary substantially for different tasks and users. This paper shows that that post-click searcher behavior, such as cursor movement and scrolling, provides additional clues for better estimating document relevance. To this end, we identify patterns of examination and interaction behavior that correspond to viewing a relevant or non-relevant document, and design a new Post-Click Behavior (PCB) model to capture these patterns. To our knowledge, PCB is the first to successfully incorporate post-click searcher interactions such as cursor movements and scrolling on a landing page for estimating document relevance. We evaluate PCB on a dataset collected from a controlled user study that contains interactions gathered from hundreds of unique queries, result clicks, and page examinations. The experimental results show that PCB is significantly more effective than using page dwell time information alone, both for estimating the explicit judgments of each user, and for re-ranking the results using the estimated relevance.
Joint relevance and freshness learning from clickthroughs for news search BIBAFull-Text 579-588
  Hongning Wang; Anlei Dong; Lihong Li; Yi Chang; Evgeniy Gabrilovich
In contrast to traditional Web search, where topical relevance is often the main selection criterion, news search is characterized by the increased importance of freshness. However, the estimation of relevance and freshness, and especially the relative importance of these two aspects, are highly specific to the query and the time when the query was issued. In this work, we propose a unified framework for modeling the topical relevance and freshness, as well as their relative importance, based on click logs. We use click statistics and content analysis techniques to define a set of temporal features, which predict the right mix of freshness and relevance for a given query. Experimental results on both historical click data and editorial judgments demonstrate the effectiveness of the proposed approach.
Active objects: actions for entity-centric search BIBAFull-Text 589-598
  Thomas Lin; Patrick Pantel; Michael Gamon; Anitha Kannan; Ariel Fuxman
We introduce an entity-centric search experience, called Active Objects, in which entity-bearing queries are paired with actions that can be performed on the entities. For example, given a query for a specific flashlight, we aim to present actions such as reading reviews, watching demo videos, and finding the best price online. In an annotation study conducted over a random sample of user query sessions, we found that a large proportion of queries in query logs involve actions on entities, calling for an automatic approach to identifying relevant actions for entity-bearing queries. In this paper, we pose the problem of finding actions that can be performed on entities as the problem of probabilistic inference in a graphical model that captures how an entity bearing query is generated. We design models of increasing complexity that capture latent factors such as entity type and intended actions that determine how a user writes a query in a search box, and the URL that they click on. Given a large collection of real-world queries and clicks from a commercial search engine, the models are learned efficiently through maximum likelihood estimation using an EM algorithm. Given a new query, probabilistic inference enables recommendation of a set of pertinent actions and hosts. We propose an evaluation methodology for measuring the relevance of our recommended actions, and show empirical evidence of the quality and the diversity of the discovered actions.

Web user behavioral analysis and modeling

Modeling and predicting behavioral dynamics on the web BIBAFull-Text 599-608
  Kira Radinsky; Krysta Svore; Susan Dumais; Jaime Teevan; Alex Bocharov; Eric Horvitz
User behavior on the Web changes over time. For example, the queries that people issue to search engines, and the underlying informational goals behind the queries vary over time. In this paper, we examine how to model and predict this temporal user behavior. We develop a temporal modeling framework adapted from physics and signal processing that can be used to predict time-varying user behavior using smoothing and trends. We also explore other dynamics of Web behaviors, such as the detection of periodicities and surprises. We develop a learning procedure that can be used to construct models of users' activities based on features of current and historical behaviors. The results of experiments indicate that by using our framework to predict user behavior, we can achieve significant improvements in prediction compared to baseline models that weight historical evidence the same for all queries. We also develop a novel learning algorithm that explicitly learns when to apply a given prediction model among a set of such models. Our improved temporal modeling of user behavior can be used to enhance query suggestions, crawling policies, and result ranking.
Are web users really Markovian? BIBAFull-Text 609-618
  Flavio Chierichetti; Ravi Kumar; Prabhakar Raghavan; Tamas Sarlos
User modeling on the Web has rested on the fundamental assumption of Markovian behavior -- a user's next action depends only on her current state, and not the history leading up to the current state. This forms the underpinning of PageRank web ranking, as well as a number of techniques for targeting advertising to users. In this work we examine the validity of this assumption, using data from a number of Web settings. Our main result invokes statistical order estimation tests for Markov chains to establish that Web users are not, in fact, Markovian. We study the extent to which the Markovian assumption is invalid, and derive a number of avenues for further research.
Human wayfinding in information networks BIBAFull-Text 619-628
  Robert West; Jure Leskovec
Navigating information spaces is an essential part of our everyday lives, and in order to design efficient and user-friendly information systems, it is important to understand how humans navigate and find the information they are looking for. We perform a large-scale study of human wayfinding, in which, given a network of links between the concepts of Wikipedia, people play a game of finding a short path from a given start to a given target concept by following hyperlinks. What distinguishes our setup from other studies of human Web-browsing behavior is that in our case people navigate a graph of connections between concepts, and that the exact goal of the navigation is known ahead of time. We study more than 30,000 goal-directed human search paths and identify strategies people use when navigating information spaces. We find that human wayfinding, while mostly very efficient, differs from shortest paths in characteristic ways. Most subjects navigate through high-degree hubs in the early phase, while their search is guided by content features thereafter. We also observe a trade-off between simplicity and efficiency: conceptually simple solutions are more common but tend to be less efficient than more complex ones. Finally, we consider the task of predicting the target a user is trying to reach. We design a model and an efficient learning algorithm. Such predictive models of human wayfinding can be applied in intelligent browsing interfaces.

Ontology representation and querying: RDF and SPARQL

Counting beyond a Yottabyte, or how SPARQL 1.1 property paths will prevent adoption of the standard BIBAFull-Text 629-638
  Marcelo Arenas; Sebastián Conca; Jorge Pérez
SPARQL -- the standard query language for querying RDF -- provides only limited navigational functionalities, although these features are of fundamental importance for graph data formats such as RDF. This has led the W3C to include the property path feature in the upcoming version of the standard, SPARQL 1.1.We tested several implementations of SPARQL 1.1 handling property path queries, and we observed that their evaluation methods for this class of queries have a poor performance even in some very simple scenarios. To formally explain this fact, we conduct a theoretical study of the computational complexity of property paths evaluation. Our results imply that the poor performance of the tested implementations is not a problem of these particular systems, but of the specification itself. In fact, we show that any implementation that adheres to the SPARQL 1.1 specification (as of November 2011) is doomed to show the same behavior, the key issue being the need for counting solutions imposed by the current specification. We provide several intractability results, that together with our empirical results, provide strong evidence against the current semantics of SPARQL 1.1 property paths. Finally, we put our results in perspective, and propose a natural alternative semantics with tractable evaluation, that we think may lead to a wide adoption of the language by practitioners, developers and theoreticians.
Template-based question answering over RDF data BIBAFull-Text 639-648
  Christina Unger; Lorenz Bühmann; Jens Lehmann; Axel-Cyrille Ngonga Ngomo; Daniel Gerber; Philipp Cimiano
As an increasing amount of RDF data is published as Linked Data, intuitive ways of accessing this data become more and more important. Question answering approaches have been proposed as a good compromise between intuitiveness and expressivity. Most question answering systems translate questions into triples which are matched against the RDF data to retrieve an answer, typically relying on some similarity metric. However, in many cases, triples do not represent a faithful representation of the semantic structure of the natural language question, with the result that more expressive queries can not be answered. To circumvent this problem, we present a novel approach that relies on a parse of the question to produce a SPARQL template that directly mirrors the internal structure of the question. This template is then instantiated using statistical entity identification and predicate detection. We show that this approach is competitive and discuss cases of questions that can be answered with our approach but not with competing approaches.
On directly mapping relational databases to RDF and OWL BIBAFull-Text 649-658
  Juan F. Sequeda; Marcelo Arenas; Daniel P. Miranker
Mapping relational databases to RDF is a fundamental problem for the development of the Semantic Web. We present a solution, inspired by draft methods defined by the W3C where relational databases are directly mapped to RDF and OWL. Given a relational database schema and its integrity constraints, this direct mapping produces an OWL ontology, which, provides the basis for generating RDF instances. The semantics of this mapping is defined using Datalog. Two fundamental properties are information preservation and query preservation. We prove that our mapping satisfies both conditions, even for relational databases that contain null values. We also consider two desirable properties: monotonicity and semantics preservation. We prove that our mapping is monotone and also prove that no monotone mapping, including ours, is semantic preserving. We realize that monotonicity is an obstacle for semantic preservation and thus present a non-monotone direct mapping that is semantics preserving.

Security 2

Practical end-to-end web content integrity BIBAFull-Text 659-668
  Kapil Singh; Helen J. Wang; Alexander Moshchuk; Collin Jackson; Wenke Lee
Widespread growth of open wireless hotspots has made it easy to carry out man-in-the-middle attacks and impersonate web sites. Although HTTPS can be used to prevent such attacks, its universal adoption is hindered by its performance cost and its inability to leverage caching at intermediate servers (such as CDN servers and caching proxies) while maintaining end-to-end security. To complement HTTPS, we revive an old idea from SHTTP, a protocol that offers end-to-end web integrity without confidentiality. We name the protocol HTTPi and give it an efficient design that is easy to deploy for today's web. In particular, we tackle several previously-unidentified challenges, such as supporting progressive page loading on the client's browser, handling mixed content, and defining access control policies among HTTP, HTTPi, and HTTPS content from the same domain. Our prototyping and evaluation experience show that HTTPi incurs negligible performance overhead over HTTP, can leverage existing web infrastructure such as CDNs or caching proxies without any modifications to them, and can make many of the mixed-content problems in existing HTTPS web sites easily go away. Based on this experience, we advocate browser and web server vendors to adopt HTTPi.
Online modeling of proactive moderation system for auction fraud detection BIBAFull-Text 669-678
  Liang Zhang; Jie Yang; Belle Tseng
We consider the problem of building online machine-learned models for detecting auction frauds in e-commence web sites. Since the emergence of the world wide web, online shopping and online auction have gained more and more popularity. While people are enjoying the benefits from online trading, criminals are also taking advantages to conduct fraudulent activities against honest parties to obtain illegal profit. Hence proactive fraud-detection moderation systems are commonly applied in practice to detect and prevent such illegal and fraud activities. Machine-learned models, especially those that are learned online, are able to catch frauds more efficiently and quickly than human-tuned rule-based systems. In this paper, we propose an online probit model framework which takes online feature selection, coefficient bounds from human knowledge and multiple instance learning into account simultaneously. By empirical experiments on a real-world online auction fraud detection data we show that this model can potentially detect more frauds and significantly reduce customer complaints compared to several baseline models and the human-tuned rule-based system.
Serf and turf: crowdturfing for fun and profit BIBAFull-Text 679-688
  Gang Wang; Christo Wilson; Xiaohan Zhao; Yibo Zhu; Manish Mohanlal; Haitao Zheng; Ben Y. Zhao
Popular Internet services in recent years have shown that remarkable things can be achieved by harnessing the power of the masses using crowd-sourcing systems. However, crowd-sourcing systems can also pose a real challenge to existing security mechanisms deployed to protect Internet services. Many of these security techniques rely on the assumption that malicious activity is generated automatically by automated programs. Thus they would perform poorly or be easily bypassed when attacks are generated by real users working in a crowd-sourcing system. Through measurements, we have found surprising evidence showing that not only do malicious crowd-sourcing systems exist, but they are rapidly growing in both user base and total revenue. We describe in this paper a significant effort to study and understand these "crowdturfing" systems in today's Internet. We use detailed crawls to extract data about the size and operational structure of these crowdturfing systems. We analyze details of campaigns offered and performed in these sites, and evaluate their end-to-end effectiveness by running active, benign campaigns of our own. Finally, we study and compare the source of workers on crowdturfing sites in different countries. Our results suggest that campaigns on these systems are highly effective at reaching users, and their continuing growth poses a concrete threat to online communities both in the US and elsewhere.

Social interactions and the web

Actions speak as loud as words: predicting relationships from social behavior data BIBAFull-Text 689-698
  Sibel Adali; Fred Sisenda; Malik Magdon-Ismail
In recent years, new studies concentrating on analyzing user personality and finding credible content in social media have become quite popular. Most such work augments features from textual content with features representing the user's social ties and the tie strength. Social ties are crucial in understanding the network the people are a part of. However, textual content is extremely useful in understanding topics discussed and the personality of the individual. We bring a new dimension to this type of analysis with methods to compute the type of ties individuals have and the strength of the ties in each dimension. We present a new genre of behavioral features that are able to capture the "function" of a specific relationship without the help of textual features. Our novel features are based on the statistical properties of communication patterns between individuals such as reciprocity, assortativity, attention and latency. We introduce a new methodology for determining how such features can be compared to textual features, and show, using Twitter data, that our features can be used to capture contextual information present in textual features very accurately. Conversely, we also demonstrate how textual features can be used to determine social attributes related to an individual.
Echoes of power: language effects and power differences in social interaction BIBAFull-Text 699-708
  Cristian Danescu-Niculescu-Mizil; Lillian Lee; Bo Pang; Jon Kleinberg
Understanding social interaction within groups is key to analyzing online communities. Most current work focuses on structural properties: who talks to whom, and how such interactions form larger network structures. The interactions themselves, however, generally take place in the form of natural language -- either spoken or written -- and one could reasonably suppose that signals manifested in language might also provide information about roles, status, and other aspects of the group's dynamics. To date, however, finding domain-independent language-based signals has been a challenge.
   Here, we show that in group discussions, power differentials between participants are subtly revealed by how much one individual immediately echoes the linguistic style of the person they are responding to. Starting from this observation, we propose an analysis framework based on linguistic coordination that can be used to shed light on power relationships and that works consistently across multiple types of power -- including a more "static" form of power based on status differences, and a more "situational" form of power in which one individual experiences a type of dependence on another. Using this framework, we study how conversational behavior can reveal power relationships in two very different settings: discussions among Wikipedians and arguments before the U. S. Supreme Court.
Bimodal invitation-navigation fair bets model for authority identification in a social network BIBAFull-Text 709-718
  Suratna Budalakoti; Ron Bekkerman
We consider the problem of identifying the most respected, authoritative members of a large-scale online social network (OSN) by constructing a global ranked list of its members. The problem is distinct from the problem of identifying influencers: we are interested in identifying members who are influential in the real world, even when not necessarily so on the OSN. We focus on two sources for information about user authority: (a) invitations to connect, which are usually sent to people whom the inviter respects, and (b) members' browsing behavior, as profiles of more important people are viewed more often than others'. We construct two directed graphs over the same set of nodes (representing member profiles): the invitation graph and the navigation graph respectively. We show that the standard PageRank algorithm, a baseline in web page ranking, is not effective in people ranking, and develop a social capital based model, called the fair bets model, as a viable solution. We then propose a novel approach, called bimodal fair bets, for combining information from two (or more) endorsement graphs drawn from the same OSN, by simultaneously using the authority scores of nodes in one graph to inform the other, and vice versa, in a mutually reinforcing fashion. We evaluate the ranking results on the LinkedIn social network using this model, where members who have Wikipedia profiles are assumed to be authoritative. Experimental results show that our approach outperforms the baseline approach by a large margin.

Entity and taxonomy extraction

Targeted disambiguation of ad-hoc, homogeneous sets of named entities BIBAFull-Text 719-728
  Chi Wang; Kaushik Chakrabarti; Tao Cheng; Surajit Chaudhuri
In many entity extraction applications, the entities to be recognized are constrained to be from a list of "target entities". In many cases, these target entities are (i) ad-hoc, i.e., do not exist in a knowledge base and (ii) homogeneous (e.g., all the entities are IT companies). We study the following novel disambiguation problem in this unique setting: given the candidate mentions of all the target entities, determine which ones are true mentions of a target entity. Prior techniques only consider target entities present in a knowledge base and/or having a rich set of attributes. In this paper, we develop novel techniques that require no knowledge about the entities except their names. Our main insight is to leverage the homogeneity constraint and disambiguate the candidate mentions collectively across all documents. We propose a graph-based model, called MentionRank, for that purpose. Furthermore, if additional knowledge is available for some or all of the entities, our model can leverage it to further improve quality. Our experiments demonstrate the effectiveness of our model. To the best of our knowledge, this is the first work on targeted entity disambiguation for ad-hoc entities.
Collective context-aware topic models for entity disambiguation BIBAFull-Text 729-738
  Prithviraj Sen
A crucial step in adding structure to unstructured data is to identify references to entities and disambiguate them. Such disambiguated references can help enhance readability and draw similarities across different pieces of running text in an automated fashion. Previous research has tackled this problem by first forming a catalog of entities from a knowledge base, such as Wikipedia, and then using this catalog to disambiguate references in unseen text. However, most of the previously proposed models either do not use all text in the knowledge base, potentially missing out on discriminative features, or do not exploit word-entity proximity to learn high-quality catalogs. In this work, we propose topic models that keep track of the context of every word in the knowledge base; so that words appearing within the same context as an entity are more likely to be associated with that entity. Thus, our topic models utilize all text present in the knowledge base and help learn high-quality catalogs. Our models also learn groups of co-occurring entities thus enabling collective disambiguation. Unlike most previous topic models, our models are non-parametric and do not require the user to specify the exact number of groups present in the knowledge base. In experiments performed on an extract of Wikipedia containing almost 60,000 references, our models outperform SVM-based baselines by as much as 18% in terms of disambiguation accuracy translating to an increment of almost 11,000 correctly disambiguated references.
Document hierarchies from text and links BIBAFull-Text 739-748
  Qirong Ho; Jacob Eisenstein; Eric P. Xing
Hierarchical taxonomies provide a multi-level view of large document collections, allowing users to rapidly drill down to fine-grained distinctions in topics of interest. We show that automatically induced taxonomies can be made more robust by combining text with relational links. The underlying mechanism is a Bayesian generative model in which a latent hierarchical structure explains the observed data -- thus, finding hierarchical groups of documents with similar word distributions and dense network connections. As a nonparametric Bayesian model, our approach does not require pre-specification of the branching factor at each non-terminal, but finds the appropriate level of detail directly from the data. Unlike many prior latent space models of network structure, the complexity of our approach does not grow quadratically in the number of documents, enabling application to networks with more than ten thousand nodes. Experimental results on hypertext and citation network corpora demonstrate the advantages of our hierarchical, multimodal approach.

Leveraging user-generated content

Mining photo-sharing websites to study ecological phenomena BIBAFull-Text 749-758
  Haipeng Zhang; Mohammed Korayem; David J. Crandall; Gretchen LeBuhn
The popularity of social media websites like Flickr and Twitter has created enormous collections of user-generated content online. Latent in these content collections are observations of the world: each photo is a visual snapshot of what the world looked like at a particular point in time and space, for example, while each tweet is a textual expression of the state of a person and his or her environment. Aggregating these observations across millions of social sharing users could lead to new techniques for large-scale monitoring of the state of the world and how it is changing over time. In this paper we step towards that goal, showing that by analyzing the tags and image features of geo-tagged, time-stamped photos we can measure and quantify the occurrence of ecological phenomena including ground snow cover, snow fall and vegetation density. We compare several techniques for dealing with the large degree of noise in the dataset, and show how machine learning can be used to reduce errors caused by misleading tags and ambiguous visual content. We evaluate the accuracy of these techniques by comparing to ground truth data collected both by surface stations and by Earth-observing satellites. Besides the immediate application to ecology, our study gives insight into how to accurately crowd-source other types of information from large, noisy social sharing datasets.
Learning from the past: answering new questions with past answers BIBAFull-Text 759-768
  Anna Shtok; Gideon Dror; Yoelle Maarek; Idan Szpektor
Community-based Question Answering sites, such as Yahoo! Answers or Baidu Zhidao, allow users to get answers to complex, detailed and personal questions from other users. However, since answering a question depends on the ability and willingness of users to address the asker's needs, a significant fraction of the questions remain unanswered. We measured that in Yahoo! Answers, this fraction represents 15% of all incoming English questions. At the same time, we discovered that around 25% of questions in certain categories are recurrent, at least at the question-title level, over a period of one year.
   We attempt to reduce the rate of unanswered questions in Yahoo! Answers by reusing the large repository of past resolved questions, openly available on the site. More specifically, we estimate the probability whether certain new questions can be satisfactorily answered by a best answer from the past, using a statistical model specifically trained for this task. We leverage concepts and methods from query-performance prediction and natural language processing in order to extract a wide range of features for our model. The key challenge here is to achieve a level of quality similar to the one provided by the best human answerers.
   We evaluated our algorithm on offline data extracted from Yahoo! Answers, but more interestingly, also on online data by using three "live" answering robots that automatically provide past answers to new questions when a certain degree of confidence is reached. We report the success rate of these robots in three active Yahoo! Answers categories in terms of both accuracy, coverage and askers' satisfaction. This work presents a first attempt, to the best of our knowledge, of automatic question answering to questions of social nature, by reusing past answers of high quality.
Discovering geographical topics in the twitter stream BIBAFull-Text 769-778
  Liangjie Hong; Amr Ahmed; Siva Gurumurthy; Alexander J. Smola; Kostas Tsioutsiouliklis
Micro-blogging services have become indispensable communication tools for online users for disseminating breaking news, eyewitness accounts, individual expression, and protest groups. Recently, Twitter, along with other online social networking services such as Foursquare, Gowalla, Facebook and Yelp, have started supporting location services in their messages, either explicitly, by letting users choose their places, or implicitly, by enabling geo-tagging, which is to associate messages with latitudes and longitudes. This functionality allows researchers to address an exciting set of questions: 1) How is information created and shared across geographical locations, 2) How do spatial and linguistic characteristics of people vary across regions, and 3) How to model human mobility. Although many attempts have been made for tackling these problems, previous methods are either complicated to be implemented or oversimplified that cannot yield reasonable performance. It is a challenge task to discover topics and identify users' interests from these geo-tagged messages due to the sheer amount of data and diversity of language variations used on these location sharing services. In this paper we focus on Twitter and present an algorithm by modeling diversity in tweets based on topical diversity, geographical diversity, and an interest distribution of the user. Furthermore, we take the Markovian nature of a user's location into account. Our model exploits sparse factorial coding of the attributes, thus allowing us to deal with a large and diverse set of covariates efficiently. Our approach is vital for applications such as user profiling, content recommendation and topic tracking. We show high accuracy in location estimation based on our model. Moreover, the algorithm identifies interesting topics based on location and language.

Data and content management 1

Declarative platform for data sourcing games BIBAFull-Text 779-788
  Daniel Deutch; Ohad Greenshpan; Boris Kostenko; Tova Milo
Harnessing a crowd of users for the collection of mass data (data sourcing) has recently become a wide-spread practice. One effective technique is based on games as a tool that attracts the crowd to contribute useful facts. We focus here on the data management layer of such games, and observe that the development of this layer involves challenges such as dealing with probabilistic data, combined with recursive manipulation of this data. These challenges are difficult to address using current declarative data management framework works, and we thus propose here a novel such framework, and demonstrate its usefulness in expressing different aspects in the data management of Trivia-like games. We have implemented a system prototype with our novel data management framework at its core, and we highlight key issues in the system design, as well as our experimentations that indicate the usefulness and scalability of the approach.
Information integration over time in unreliable and uncertain environments BIBAFull-Text 789-798
  Aditya Pal; Vibhor Rastogi; Ashwin Machanavajjhala; Philip Bohannon
Often an interesting true value such as a stock price, sports score, or current temperature is only available via the observations of noisy and potentially conflicting sources. Several techniques have been proposed to reconcile these conflicts by computing a weighted consensus based on source reliabilities, but these techniques focus on static values. When the real-world entity evolves over time, the noisy sources can delay, or even miss, reporting some of the real-world updates. This temporal aspect introduces two key challenges for consensus-based approaches: (i) due to delays, the mapping between a source's noisy observation and the real-world update it observes is unknown, and (ii) missed updates may translate to missing values for the consensus problem, even if the mapping is known. To overcome these challenges, we propose a formal approach that models the history of updates of the real-world entity as a hidden semi-Markovian process (HSMM). The noisy sources are modeled as observations of the hidden state, but the mapping between a hidden state (i.e. real-world update) and the observation (i.e. source value) is unknown. We propose algorithms based on Gibbs Sampling and EM to jointly infer both the history of real-world updates as well as the unknown mapping between them and the source values. We demonstrate using experiments on real-world datasets how our history-based techniques improve upon history-agnostic consensus-based approaches.
SAFE extensibility of data-driven web applications BIBAFull-Text 799-808
  Raphael M. Reischuk; Michael Backes; Johannes Gehrke
This paper presents a novel method for enabling fast development and easy customization of interactive data-intensive web applications. Our approach is based on a high-level hierarchical programming model that results in both a very clean semantics of the application while at the same time creating well-defined interfaces for customization of application components. A prototypical implementation of a conference management system shows the efficacy of our approach.

Web engineering 1

On the analysis of cascading style sheets BIBAFull-Text 809-818
  Pierre Geneves; Nabil Layaida; Vincent Quint
Developing and maintaining cascading style sheets (CSS) is an important issue to web developers as they suffer from the lack of rigorous methods. Most existing means rely on validators that check syntactic rules, and on runtime debuggers that check the behavior of a CSS style sheet on a particular document instance. However, the aim of most style sheets is to be applied to an entire set of documents, usually defined by some schema. To this end, a CSS style sheet is usually written w.r.t. a given schema. While usual debugging tools help reducing the number of bugs, they do not ultimately allow to prove properties over the whole set of documents to which the style sheet is intended to be applied. We propose a novel approach to fill this lack. We introduce ideas borrowed from the fields of logic and compile-time verification for the analysis of CSS style sheets. We present an original tool based on recent advances in tree logics. The tool is capable of statically detecting a wide range of errors (such as empty CSS selectors and semantically equivalent selectors), as well as proving properties related to sets of documents (such as coverage of styling information), in the presence or absence of schema information. This new tool can be used in addition to existing runtime debuggers to ensure a higher level of quality of CSS style sheets.
Extracting client-side web application code BIBAFull-Text 819-828
  Josip Maras; Jan Carlson; Ivica Crnkovi
The web application domain is one of the fastest growing and most wide-spread application domains today. By utilizing fast, modern web browsers and advanced scripting techniques, web developers are developing highly interactive applications that can, in terms of user-experience and responsiveness, compete with standard desktop applications. A web application is composed of two equally important parts: the server-side and the client-side. The client-side acts as a user-interface to the application, and can be viewed as a collection of behaviors. Similar behaviors are often used in a large number of applications, and facilitating their reuse offers considerable benefits. However, due to client-side specifics, such as multi-language implementation and extreme dynamicity, identifying and extracting code responsible for a certain behavior is difficult. In this paper we present a semi-automatic method for extracting client-side web application code implementing a certain behavior. We show how by analyzing the execution of a usage scenario, code responsible for a certain behavior can be identified, how dependencies between different parts of the application can be tracked, and how in the end only the code responsible for a certain behavior can be extracted. Our evaluation shows that the method is capable of extracting stand-alone behaviors, while achieving considerable savings in terms of code size and application performance.
OPAL: automated form understanding for the deep web BIBAFull-Text 829-838
  Tim Furche; Georg Gottlob; Giovanni Grasso; Xiaonan Guo; Giorgio Orsi; Christian Schallhart
Forms are our gates to the web. They enable us to access the deep content of web sites. Automatic form understanding unlocks this content for applications ranging from crawlers to meta-search engines and is essential for improving usability and accessibility of the web. Form understanding has received surprisingly little attention other than as component in specific applications such as crawlers. No comprehensive approach to form understanding exists and previous works disagree even in the definition of the problem. In this paper, we present OPAL, the first comprehensive approach to form understanding. We identify form labeling and form interpretation as the two main tasks involved in form understanding. On both problems OPAL pushes the state of the art: For form labeling, it combines signals from the text, structure, and visual rendering of a web page, yielding robust characterisations of common design patterns. In extensive experiments on the ICQ and TEL-8 benchmarks and a set of 200 modern web forms OPAL outperforms previous approaches by a significant margin. For form interpretation, we introduce a template language to describe frequent form patterns. These two parts of OPAL combined yield form understanding with near perfect accuracy (> 98%).

Collaboration in social networks

Online team formation in social networks BIBAFull-Text 839-848
  Aris Anagnostopoulos; Luca Becchetti; Carlos Castillo; Aristides Gionis; Stefano Leonardi
We study the problem of online team formation. We consider a setting in which people possess different skills and compatibility among potential team members is modeled by a social network. A sequence of tasks arrives in an online fashion, and each task requires a specific set of skills. The goal is to form a new team upon arrival of each task, so that (i) each team possesses all skills required by the task, (ii) each team has small communication overhead, and (iii) the workload of performing the tasks is balanced among people in the fairest possible way.
   We propose efficient algorithms that address all these requirements: our algorithms form teams that always satisfy the required skills, provide approximation guarantees with respect to team communication overhead, and they are online-competitive with respect to load balancing. Experiments performed on collaboration networks among film actors and scientists, confirm that our algorithms are successful at balancing these conflicting requirements.
   This is the first paper that simultaneously addresses all these aspects. Previous work has either focused on minimizing coordination for a single task or balancing the workload neglecting coordination costs.
Understanding task-driven information flow in collaborative networks BIBAFull-Text 849-858
  Gengxin Miao; Shu Tao; Winnie Cheng; Randy Moulic; Louise E. Moser; David Lo; Xifeng Yan
Collaborative networks are a special type of social network formed by members who collectively achieve specific goals, such as fixing software bugs and resolving customers' problems. In such networks, information flow among members is driven by the tasks assigned to the network, and by the expertise of its members to complete those tasks. In this work, we analyze real-life collaborative networks to understand their common characteristics and how information is routed in these networks. Our study shows that collaborative networks exhibit significantly different properties compared with other complex networks. Collaborative networks have truncated power-law node degree distributions and other organizational constraints. Furthermore, the number of steps along which information is routed follows a truncated power-law distribution. Based on these observations, we developed a network model that can generate synthetic collaborative networks subject to certain structure constraints. Moreover, we developed a routing model that emulates task-driven information routing conducted by human beings in a collaborative network. Together, these two models can be used to study the efficiency of information routing for different types of collaborative networks -- a problem that is important in practice yet difficult to solve without the method proposed in this paper.
New objective functions for social collaborative filtering BIBAFull-Text 859-868
  Joseph Noel; Scott Sanner; Khoi-Nguyen Tran; Peter Christen; Lexing Xie; Edwin V. Bonilla; Ehsan Abbasnejad; Nicolas Della Penna
This paper examines the problem of social collaborative filtering (CF) to recommend items of interest to users in a social network setting. Unlike standard CF algorithms using relatively simple user and item features, recommendation in social networks poses the more complex problem of learning user preferences from a rich and complex set of user profile and interaction information. Many existing social CF methods have extended traditional CF matrix factorization, but have overlooked important aspects germane to the social setting. We propose a unified framework for social CF matrix factorization by introducing novel objective functions for training. Our new objective functions have three key features that address main drawbacks of existing approaches: (a) we fully exploit feature-based user similarity, (b) we permit direct learning of user-to-user information diffusion, and (c) we leverage co-preference (dis)agreement between two users to learn restricted areas of common interest. We evaluate these new social CF objectives, comparing them to each other and to a variety of (social) CF baselines, and analyze user behavior on live user trials in a custom-developed Facebook App involving data collected over five months from over 100 App users and their 37,000+ friends.

Information extraction

Micropinion generation: an unsupervised approach to generating ultra-concise summaries of opinions BIBAFull-Text 869-878
  Kavita Ganesan; ChengXiang Zhai; Evelyne Viegas
This paper presents a new unsupervised approach to generating ultra-concise summaries of opinions. We formulate the problem of generating such a micropinion summary as an optimization problem, where we seek a set of concise and non-redundant phrases that are readable and represent key opinions in text. We measure representativeness based on a modified mutual information function and model readability with an n-gram language model. We propose some heuristic algorithms to efficiently solve this optimization problem. Evaluation results show that our unsupervised approach outperforms other state of the art summarization methods and the generated summaries are informative and readable.
Mr. LDA: a flexible large scale topic modeling package using variational inference in MapReduce BIBAFull-Text 879-888
  Ke Zhai; Jordan Boyd-Graber; Nima Asadi; Mohamad L. Alkhouja
Latent Dirichlet Allocation (LDA) is a popular topic modeling technique for exploring document collections. Because of the increasing prevalence of large datasets, there is a need to improve the scalability of inference for LDA. In this paper, we introduce a novel and flexible large scale topic modeling package in MapReduce (Mr. LDA). As opposed to other techniques which use Gibbs sampling, our proposed framework uses variational inference, which easily fits into a distributed environment. More importantly, this variational implementation, unlike highly tuned and specialized implementations based on Gibbs sampling, is easily extensible. We demonstrate two extensions of the models possible with this scalable framework: informed priors to guide topic discovery and extracting topics from a multilingual corpus. We compare the scalability of Mr. LDA against Mahout, an existing large scale topic modeling package. Mr. LDA out-performs Mahout both in execution speed and held-out likelihood.

Web mining

Trains of thought: generating information maps BIBAFull-Text 899-908
  Dafna Shahaf; Carlos Guestrin; Eric Horvitz
When information is abundant, it becomes increasingly difficult to fit nuggets of knowledge into a single coherent picture. Complex stories spaghetti into branches, side stories, and intertwining narratives. In order to explore these stories, one needs a map to navigate unfamiliar territory. We propose a methodology for creating structured summaries of information, which we call metro maps. Our proposed algorithm generates a concise structured set of documents maximizing coverage of salient pieces of information. Most importantly, metro maps explicitly show the relations among retrieved pieces in a way that captures story development. We first formalize characteristics of good maps and formulate their construction as an optimization problem. Then we provide efficient methods with theoretical guarantees for generating maps. Finally, we integrate user interaction into our framework, allowing users to alter the maps to better reflect their interests. Pilot user studies with a real-world dataset demonstrate that the method is able to produce maps which help users acquire knowledge efficiently.
Learning causality for news events prediction BIBAFull-Text 909-918
  Kira Radinsky; Sagie Davidovich; Shaul Markovitch
The problem we tackle in this work is, given a present news event, to generate a plausible future event that can be caused by the given event. We present a new methodology for modeling and predicting such future news events using machine learning and data mining techniques. Our Pundit algorithm generalizes examples of causality pairs to infer a causality predictor. To obtain precise labeled causality examples, we mine 150 years of news articles, and apply semantic natural language modeling techniques to titles containing certain predefined causality patterns. For generalization, the model uses a vast amount of world knowledge ontologies mined from LinkedData, containing 200 datasets with approximately 20 billion relations. Empirical evaluation on real news articles shows that our Pundit algorithm reaches a human-level performance.
Your two weeks of fame and your grandmother's BIBAFull-Text 919-928
  James Cook; Atish Das Sarma; Alex Fabrikant; Andrew Tomkins
Did celebrity last longer in 1929, 1992 or 2009? We investigate the phenomenon of fame by mining a collection of news articles that spans the twentieth century, and also perform a side study on a collection of blog posts from the last 10 years. By analyzing mentions of personal names, we measure each person's time in the spotlight, and watch the distribution change from a century ago to a year ago. We expected to find a trend of decreasing durations of fame as news cycles accelerated and attention spans became shorter. Instead, we find a remarkable consistency through most of the period we study. Through a century of rapid technological and societal change, through the appearance of Twitter, communication satellites and the Internet, we do not observe a significant change in typical duration of celebrity. We also study the most famous of the famous, and find different results depending on our method for measuring duration of fame. With a method that may be thought of as measuring a spike of attention around a single narrow news story, we see the same result as before: stories last as long now as they did in 1930. A second method, which may be thought of as measuring the duration of public interest in a person, indicates that famous people's presence in the news is becoming longer rather than shorter, an effect most likely driven by the wider distribution and higher volume of media in modern times. Similar studies have been done with much shorter timescales specifically in the context of information spreading on Twitter and similar social networking site. However, to the best of our knowledge, this is the first massive scale study of this nature that spans over a century of archived data, thereby allowing us to track changes across decades.

Data and content management 2

A unified approach to learning task-specific bit vector representations for fast nearest neighbor search BIBAFull-Text 929-938
  Vinod Nair; Dhruv Mahajan; Sundararajan Sellamanickam
Fast nearest neighbor search is necessary for a variety of large scale web applications such as information retrieval, nearest neighbor classification and nearest neighbor regression. Recently a number of machine learning algorithms have been proposed for representing the data to be searched as (short) bit vectors and then using hashing to do rapid search. These algorithms have been limited in their applicability in that they are suited for only one type of task -- e.g. Spectral Hashing learns bit vector representations for retrieval, but not say, classification. In this paper we present a unified approach to learning bit vector representations for many applications that use nearest neighbor search. The main contribution is a single learning algorithm that can be customized to learn a bit vector representation suited for the task at hand. This broadens the usefulness of bit vector representations to tasks beyond just conventional retrieval. We propose a learning-to-rank formulation to learn the bit vector representation of the data. LambdaRank algorithm is used for learning a function that computes a task-specific bit vector from an input data vector. Our approach outperforms state-of-the-art nearest neighbor methods on a number of real world text and image classification and retrieval datasets. It is scalable and learns a 32-bit representation on 1.46 million training cases in two days.
Lightweight automatic face annotation in media pages BIBAFull-Text 939-948
  Dmitri Perelman; Edward Bortnikov; Ronny Lempel; Roman Sandler
Labeling human faces in images contained in Web media stories enables enriching the user experience offered by media sites. We propose a lightweight framework for automatic image annotation that exploits named entities mentioned in the article to significantly boost the accuracy of face recognition. While previous works in the area labor to train comprehensive offline visual models for a pre-defined universe of candidates, our approach models the people mentioned in a given story on the y, using a standard Web image search engine as an image sampling mechanism. We overcome multiple sources of noise introduced by this ad-hoc process, to build a fast and robust end-to-end system from off-the-shelf error-prone text analysis and machine vision components. In experiments conducted on approximately 900 faces depicted in 500 stories from a major celebrity news website, we were able to correctly label 81.5% of the faces while mislabeling 14.8% of them.
Distributed graph pattern matching BIBAFull-Text 949-958
  Shuai Ma; Yang Cao; Jinpeng Huai; Tianyu Wo
Graph simulation has been adopted for pattern matching to reduce the complexity and capture the need of novel applications. With the rapid development of the Web and social networks, data is typically distributed over multiple machines. Hence a natural question raised is how to evaluate graph simulation on distributed data. To our knowledge, no such distributed algorithms are in place yet. This paper settles this question by providing evaluation algorithms and optimizations for graph simulation in a distributed setting. (1) We study the impacts of components and data locality on the evaluation of graph simulation. (2) We give an analysis of a large class of distributed algorithms, captured by a message-passing model, for graph simulation. We also identify three complexity measures: visit times, makespan and data shipment, for analyzing the distributed algorithms, and show that these measures are essentially controversial with each other. (3) We propose distributed algorithms and optimization techniques that exploit the properties of graph simulation and the analyses of distributed algorithms. (4) We experimentally verify the effectiveness and efficiency of these algorithms, using both real-life and synthetic data.

Web engineering 2

Towards network-aware service composition in the cloud BIBAFull-Text 959-968
  Adrian Klein; Fuyuki Ishikawa; Shinichi Honiden
Service-Oriented Computing (SOC) enables the composition of loosely coupled services provided with varying Quality of Service (QoS) levels. Selecting a (near-)optimal set of services for a composition in terms of QoS is crucial when many functionally equivalent services are available. With the advent of Cloud Computing, both the number of such services and their distribution across the network are rising rapidly, increasing the impact of the network on the QoS of such compositions. Despite this, current approaches do not differentiate between the QoS of services themselves and the QoS of the network. Therefore, the computed latency differs substantially from the actual latency, resulting in suboptimal QoS for service compositions in the cloud. Thus, we propose a network-aware approach that handles the QoS of services and the QoS of the network independently. First, we build a network model in order to estimate the network latency between arbitrary services and potential users. Our selection algorithm then leverages this model to find compositions that will result in a low latency given an employed execution policy. In our evaluation, we show that our approach efficiently computes compositions with much lower latency than current approaches.
Towards robust service compositions in the context of functionally diverse services BIBAFull-Text 969-978
  Florian Wagner; Benjamin Kloepper; Fuyuki Ishikawa; Shinichi Honiden
Web service composition provides a means of customized and flexible integration of service functionalities. Quality-of-Service (QoS) optimization algorithms select services in order to adapt workflows to the non-functional requirements of the user. With increasing number of services in a workflow, previous approaches fail to achieve a sufficient reliability. Moreover, expensive ad-hoc replanning is required to deal with service failures. The major problem with such sequential application of planning and replanning is that it ignores the potential costs during the initial planning and they consequently are hidden from the decision maker. Our basic idea to overcome this substantial problem is to compute a QoS optimized selection of service clusters that includes a sufficient number of backup services for each service employed. To support the human decision maker in the service selection task, our approach considers the possible repair costs directly in the initial composition. On the basis of a multi-objective approach and using a suitable service selection interface, the decision maker can select compositions in line with his/her personal risk preferences.
CloudGenius: decision support for web server cloud migration BIBAFull-Text 979-988
  Michael Menzel; Rajiv Ranjan
Cloud computing is the latest computing paradigm that delivers hardware and software resources as virtualized services in which users are free from the burden of worrying about the low-level system administration details. Migrating Web applications to Cloud services and integrating Cloud services into existing computing infrastructures is non-trivial. It leads to new challenges that often require innovation of paradigms and practices at all levels: technical, cultural, legal, regulatory, and social. The key problem in mapping Web applications to virtualized Cloud services is selecting the best and compatible mix of software images (e.g., Web server image) and infrastructure services to ensure that Quality of Service (QoS) targets of an application are achieved. The fact that, when selecting Cloud services, engineers must consider heterogeneous sets of criteria and complex dependencies between infrastructure services and software images, which are impossible to resolve manually, is a critical issue. To overcome these challenges, we present a framework (called CloudGenius) which automates the decision-making process based on a model and factors specifically for Web server migration to the Cloud. CloudGenius leverages a well known multi-criteria decision making technique, called Analytic Hierarchy Process, to automate the selection process based on a model, factors, and QoS parameters related to an application. An example application demonstrates the applicability of the theoretical CloudGenius approach. Moreover, we present an implementation of CloudGenius that has been validated through experiments.

Crowdsourcing

Max algorithms in crowdsourcing environments BIBAFull-Text 989-998
  Petros Venetis; Hector Garcia-Molina; Kerui Huang; Neoklis Polyzotis
Our work investigates the problem of retrieving the maximum item from a set in crowdsourcing environments. We first develop parameterized families of max algorithms, that take as input a set of items and output an item from the set that is believed to be the maximum. Such max algorithms could, for instance, select the best Facebook profile that matches a given person or the best photo that describes a given restaurant. Then, we propose strategies that select appropriate max algorithm parameters. Our framework supports various human error and cost models and we consider many of them for our experiments. We evaluate under many metrics, both analytically and via simulations, the tradeoff between three quantities: (1) quality, (2) monetary cost, and (3) execution time. Also, we provide insights on the effectiveness of the strategies in selecting appropriate max algorithm parameters and guidelines for choosing max algorithms and strategies for each application.
Crowdsourcing with endogenous entry BIBAFull-Text 999-1008
  Arpita Ghosh; Preston McAfee
We investigate the design of mechanisms to incentivize high quality outcomes in crowdsourcing environments with strategic agents, when entry is an endogenous, strategic choice. Modeling endogenous entry in crowdsourcing markets is important because there is a nonzero cost to making a contribution of any quality which can be avoided by not participating, and indeed many sites based on crowdsourced content do not have adequate participation. We use a mechanism with monotone, rank-based, rewards in a model where agents strategically make participation and quality choices to capture a wide variety of crowdsourcing environments, ranging from conventional crowdsourcing contests with monetary rewards such as TopCoder, to crowdsourced content as in online Q&A forums.
   We begin by explicitly constructing the unique mixed-strategy equilibrium for such monotone rank-order mechanisms, and use the participation probability and distribution of qualities from this construction to address the question of designing incentives for two kinds of rewards that arise in the context of crowdsourcing. We first show that for attention rewards that arise in the crowdsourced content setting, the entire equilibrium distribution and therefore every increasing statistic including the maximum and average quality (accounting for participation), improves when the rewards for every rank but the last are as high as possible. In particular, when the cost of producing the lowest possible quality content is low, the optimal mechanism displays all but the poorest contribution. We next investigate how to allocate rewards in settings where there is a fixed total reward that can be arbitrarily distributed amongst participants, as in crowdsourcing contests. Unlike models with exogenous entry, here the expected number of participants can be increased by subsidizing entry, which could potentially improve the expected value of the best contribution. However, we show that subsidizing entry does not improve the expected quality of the best contribution, although it may improve the expected quality of the average contribution. In fact, we show that free entry is dominated by taxing entry -- making all entrants pay a small fee, which is rebated to the winner along with whatever rewards were already assigned, can improve the quality of the best contribution over a winner-take-all contest with no taxes.
Answering search queries with CrowdSearcher BIBAFull-Text 1009-1018
  Alessandro Bozzon; Marco Brambilla; Stefano Ceri
Web users are increasingly relying on social interaction to complete and validate the results of their search activities. While search systems are superior machines to get world-wide information, the opinions collected within friends and expert/local communities can ultimately determine our decisions: human curiosity and creativity is often capable of going much beyond the capabilities of search systems in scouting "interesting" results, or suggesting new, unexpected search directions. Such personalized interaction occurs in most times aside of the search systems and processes, possibly instrumented and mediated by a social network; when such interaction is completed and users resort to the use of search systems, they do it through new queries, loosely related to the previous search or to the social interaction. In this paper we propose CrowdSearcher, a novel search paradigm that embodies crowds as first-class sources for the information seeking process. CrowdSearcher aims at filling the gap between generalized search systems, which operate upon world-wide information -- including facts and recommendations as crawled and indexed by computerized systems -- with social systems, capable of interacting with real people, in real time, to capture their opinions, suggestions, emotions. The technical contribution of this paper is the discussion of a model and architecture for integrating computerized search with human interaction, by showing how search systems can drive and encapsulate social systems. In particular we show how social platforms, such as Facebook, LinkedIn and Twitter, can be used for crowdsourcing search-related tasks; we demonstrate our approach with several prototypes and we report on our experiment upon real user communities.

Social networks

Vertex collocation profiles: subgraph counting for link analysis and prediction BIBAFull-Text 1019-1028
  Ryan N. Lichtenwalter; Nitesh V. Chawla
We introduce the concept of a vertex collocation profile (VCP) for the purpose of topological link analysis and prediction. VCPs provide nearly complete information about the surrounding local structure of embedded vertex pairs. The VCP approach offers a new tool for domain experts to understand the underlying growth mechanisms in their networks and to analyze link formation mechanisms in the appropriate sociological, biological, physical, or other context. The same resolution that gives VCP its analytical power also enables it to perform well when used in supervised models to discriminate potential new links. We first develop the theory, mathematics, and algorithms underlying VCPs. Then we demonstrate VCP methods performing link prediction competitively with unsupervised and supervised methods across several different network families. We conclude with timing results that introduce the comparative performance of several existing algorithms and the practicability of VCP computations on large networks.
Framework and algorithms for network bucket testing BIBAFull-Text 1029-1036
  Liran Katzir; Edo Liberty; Oren Somekh
Bucket testing, also known as split testing, A/B testing, or 0/1 testing, is a widely used method for evaluating users' satisfaction with new features, products, or services. In order not to expose the whole user base to the new service, the mean user satisfaction rate is estimated by exposing the service only to a few uniformly chosen random users. In a recent work, Backstrom and Kleinberg, defined the notion of network bucket testing for social services. In this context, users' interactions are only valid for measurement if some minimal number of their friends are also given the service. The goal is to estimate the mean user satisfaction rate while providing the service to the least number of users. This constraint makes uniform sampling, which is optimal for the traditional case, grossly inefficient. In this paper we introduce a simple general framework for designing and evaluating sampling techniques for network bucket testing. The framework is constructed in a way that sampling algorithms are only required to generate sets of users to which the service should be provided. Given an algorithm, the framework produces an unbiased user satisfaction rate estimator and a corresponding variance bound for any network and any user satisfaction function. Furthermore, we present several simple sampling algorithms that are evaluated using both synthetic and real social networks. Our experiments corroborate the theoretical results and demonstrate the effectiveness of the proposed framework and algorithms.
Winner takes all: competing viruses or ideas on fair-play networks BIBAFull-Text 1037-1046
  B. Aditya Prakash; Alex Beutel; Roni Rosenfeld; Christos Faloutsos
Given two competing products (or memes, or viruses etc.) spreading over a given network, can we predict what will happen at the end, that is, which product will 'win', in terms of highest market share? One may naively expect that the better product (stronger virus) will just have a larger footprint, proportional to the quality ratio of the products (or strength ratio of the viruses). However, we prove the surprising result that, under realistic conditions, for any graph topology, the stronger virus completely wipes-out the weaker one, thus not merely 'winning' but 'taking it all'. In addition to the proofs, we also demonstrate our result with simulations over diverse, real graph topologies, including the social-contact graph of the city of Portland OR (about 31 million edges and 1 million nodes) and internet AS router graphs. Finally, we also provide real data about competing products from Google-Insights, like Facebook-Myspace, and we show again that they agree with our analysis.

User interfaces and human factors

A dual-mode user interface for accessing 3D content on the world wide web BIBAFull-Text 1047-1056
  Jacek Jankowski; Stefan Decker
The Web evolved from a text-based system to the current rich and interactive medium that supports images, 2D graphics, audio and video. The major media type that is still missing is 3D graphics. Although various approaches have been proposed (most notably VRML/X3D), they have not been widely adopted. One reason for the limited acceptance is the lack of 3D interaction techniques that are optimal for the hypertext-based Web interface. We present a novel strategy for accessing integrated information spaces, where hypertext and 3D graphics data are simultaneously available and linked. We introduce a user interface that has two modes between which a user can switch anytime: the driven by simple hypertext-based interactions "don't-make-me-think" mode, where a 3D scene is embedded in hypertext and the more immersive 3D "take-me-to-the-Wonderland" mode, which immerses the hypertextual annotations into the 3D scene. A user study is presented, which characterizes the user interface in terms of its efficiency and usability.
Exploiting single-user web applications for shared editing: a generic transformation approach BIBAFull-Text 1057-1066
  Matthias Heinrich; Franz Lehmann; Thomas Springer; Martin Gaedke
In the light of the Web 2.0 movement, web-based collaboration tools such as Google Docs have become mainstream and in the meantime serve millions of users. Apart from established collaborative web applications, numerous web editors lack multi-user support even though they are suitable for collaborative work. Enhancing these single-user editors with shared editing capabilities is a costly endeavor since the implementation of a collaboration infrastructure (accommodating conflict resolution, document synchronization, etc.) is required. In this paper, we present a generic transformation approach capable of converting single-user web editors into multi-user editors. Since our approach only requires the configuration of a generic collaboration infrastructure (GCI), the effort to inject shared editing support is significantly reduced in contrast to conventional implementation approaches neglecting reuse. We also report on experimental results of a user study showing that converted editors meet user requirements with respect to software and collaboration qualities. Moreover, we define the characteristics that editors must adhere to in order to leverage the GCI.
"It's simply integral to what I do": enquiries into how the web is weaved into everyday life BIBAFull-Text 1067-1076
  Siân E. Lindley; Sam Meek; Abigail Sellen; Richard Harper
This paper presents findings from a field study of 24 individuals who kept diaries of their web use, across device and location, for a period of four days. Our focus was on how the web was used for non-work purposes, with a view to understanding how this is intertwined with everyday life. While our initial aim was to update existing frameworks of 'web activities', such as those described by Sellen et al. [25] and Kellar et al. [14], our data lead us to suggest that the notion of 'web activity' is only partially useful for an analytic understanding of what it is that people do when they go online. Instead, our analysis leads us to present five modes of web use, which can be used to frame and enrich interpretations of 'activity'. These are respite, orienting, opportunistic use, purposeful use and lean-back internet. We then consider two properties of the web that enable it to be tailored to these different modes, persistence and temporality, and close by suggesting ways of drawing upon these qualities in order to inform design.