HCI Bibliography Home | HCI Conferences | WWW Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
WWW Tables of Contents: 020304-104-205-105-2060708091011-111-212-112-213-113-214-114-215-115-2

Proceedings of the 2010 International Conference on the World Wide Web

Fullname:Proceedings of the 19th International Conference on World Wide Web
Editors:Michael Rappa; Paul Jones; Juliana Freire; Soumen Chakrabarti
Location:Raleigh, North Carolina, USA
Dates:2010-Apr-26 to 2010-Apr-30
Standard No:ISBN 978-1-60558-799-8; ACM DL: Table of Contents hcibib: WWW10
Links:Conference Home Page
  1. WWW 2010-04-26 Volume 1
    1. Full papers
    2. WWW posters
    3. WWW 2010 demos
    4. Panels
    5. Tutorials
    6. Developers' track

WWW 2010-04-26 Volume 1

Full papers

Towards rich query interpretation: walking back and forth for mining query templates BIBAKFull-Text 1-10
  Ganesh Agarwal; Govind Kabra; Kevin Chen-Chuan Chang
We propose to mine structured query templates from search logs, for enabling rich query interpretation that recognizes both query intents and associated attributes. We formalize the notion of template as a sequence of keywords and domain attributes, and our objective is to discover templates with high precision and recall for matching queries in a domain of interest. Our solution bootstraps from small seed input knowledge to discover relevant query templates, by harnessing the wealth of information available in search logs. We model this information in a tri-partite QueST network of queries, sites, and templates. We propose a probabilistic inferencing framework based on the dual metrics of precision and recall -- and we show that the dual inferencing correspond respectively to the random walks in backward and forward directions. We deployed and tested our algorithm over a real-world search log of 15 million queries. The algorithm achieved accuracy of as high as 90% (on F-measure), with little seed knowledge and even with incomplete domain schema.
Keywords: query attributes, query intents, query templates, search log mining
Selecting skyline services for QoS-based web service composition BIBAKFull-Text 11-20
  Mohammad Alrifai; Dimitrios Skoutas; Thomas Risse
Web service composition enables seamless and dynamic integration of business applications on the web. The performance of the composed application is determined by the performance of the involved web services. Therefore, non-functional, quality of service aspects are crucial for selecting the web services to take part in the composition. Identifying the best candidate web services from a set of functionally-equivalent services is a multi-criteria decision making problem. The selected services should optimize the overall QoS of the composed application, while satisfying all the constraints specified by the client on individual QoS parameters. In this paper, we propose an approach based on the notion of skyline to effectively and efficiently select services for composition, reducing the number of candidate services to be considered. We also discuss how a provider can improve its service to become more competitive and increase its potential of being included in composite applications. We evaluate our approach experimentally using both real and synthetically generated datasets.
Keywords: optimization, QoS, service composition, web services
Money, glory and cheap talk: analyzing strategic behavior of contestants in simultaneous crowdsourcing contests on TopCoder.com BIBAKFull-Text 21-30
  Nikolay Archak
Crowdsourcing is a new Web phenomenon, in which a firm takes a function once performed in-house and outsources it to a crowd, usually in the form of an open contest.
   Designing efficient crowdsourcing mechanisms is not possible without deep understanding of incentives and strategic choices of all participants.
   This paper presents an empirical analysis of determinants of individual performance in multiple simultaneous crowdsourcing contests using a unique dataset for the world's largest competitive software development portal: TopCoder.com. Special attention is given to studying the effects of the reputation system currently used by TopCoder.com on behavior of contestants. We find that individual specific traits together with the project payment and the number of project requirements are significant predictors of the final project quality. Furthermore, we find significant evidence of strategic behavior of contestants. High rated contestants face tougher competition from their opponents in the competition phase of the contest. In order to soften the competition, they move first in the registration phase of the contest, signing up early for particular projects. Although registration in TopCoder contests is non-binding, it deters entry of opponents in the same contest; our lower bound estimate shows that this strategy generates significant surplus gain to high rated contestants. We conjecture that the reputation + cheap talk mechanism employed by TopCoder has a positive effect on allocative efficiency of simultaneous all-pay contests and should be considered for adoption in other crowdsourcing platforms.
Keywords: all-pay auction, cheap talk, contest, crowdsourcing, electronic markets, entry deterrence, reputation
Mining advertiser-specific user behavior using adfactors BIBAKFull-Text 31-40
  Nikolay Archak; Vahab S. Mirrokni; S. Muthukrishnan
Consider an online ad campaign run by an advertiser. The ad serving companies that handle such campaigns record users' behavior that leads to impressions of campaign ads, as well as users' responses to such impressions. This is summarized and reported to the advertisers to help them evaluate the performance of their campaigns and make better budget allocation decisions.
   The most popular reporting statistics are the click-through rate and the conversion rate. While these are indicative of the effectiveness of an ad campaign, the advertisers often seek to understand more sophisticated long-term effects of their ads on the brand awareness and the user behavior that leads to the conversion, thus creating a need for the reporting measures that can capture both the duration and the frequency of the pathways to user conversions.
   In this paper, we propose an alternative data mining framework for analyzing user-level advertising data. In the aggregation step, we compress individual user histories into a graph structure, called the adgraph, representing local correlations between ad events. For the reporting step, we introduce several scoring rules, called the adfactors (AF), that can capture global role of ads and ad paths in the adgraph, in particular, the structural correlation between an ad impression and the user conversion. We present scalable local algorithms for computing the adfactors; all algorithms were implemented using the MapReduce programming model and the Pregel framework.
   Using an anonymous user-level dataset of sponsored search campaigns for eight different advertisers, we evaluate our framework with different adgraphs and adfactors in terms of their statistical fit to the data, and show its value for mining the long-term behavioral patterns in the advertising data.
Keywords: ad auctions, clickthrough rate, conversion rate, online advertising, pagerank, sponsored search, user behavior models
Matrix "Bit" loaded: a scalable lightweight join query processor for RDF data BIBAKFull-Text 41-50
  Medha Atre; Vineet Chaoji; Mohammed J. Zaki; James A. Hendler
The Semantic Web community, until now, has used traditional database systems for the storage and querying of RDF data. The SPARQL query language also closely follows SQL syntax. As a natural consequence, most of the SPARQL query processing techniques are based on database query processing and optimization techniques. For SPARQL join query optimization, previous works like RDF-3X and Hexastore have proposed to use 6-way indexes on the RDF data. Although these indexes speed up merge-joins by orders of magnitude, for complex join queries generating large intermediate join results, the scalability of the query processor still remains a challenge.
   In this paper, we introduce (i) BitMat -- a compressed bit-matrix structure for storing huge RDF graphs, and (ii) a novel, light-weight SPARQL join query processing method that employs an initial pruning technique, followed by a variable-binding-matching algorithm on BitMats to produce the final results. Our query processing method does not build intermediate join tables and works directly on the compressed data. We have demonstrated our method against RDF graphs of up to 1.33 billion triples -- the largest among results published until now (single-node, non-parallel systems), and have compared our method with the state-of-the-art RDF stores -- RDF-3X and MonetDB. Our results show that the competing methods are most effective with highly selective queries. On the other hand, BitMat can deliver 2-3 orders of magnitude better performance on complex, low-selectivity queries over massive data.
Keywords: data compression, query algorithm, RDF store
A comparison of visual and textual page previews in judging the helpfulness of web pages BIBAKFull-Text 51-60
  Anne Aula; Rehan M. Khan; Zhiwei Guan; Paul Fontes; Peter Hong
We investigated the efficacy of visual and textual web page previews in predicting the helpfulness of web pages related to a specific topic. We ran two studies in the usability lab and collected data through an online survey. Participants (total of 245) were asked to rate the expected helpfulness of a web page based on a preview (four different thumbnail variations: a textual web page summary, a thumbnail/title/URL combination, a title/URL combination). In the lab studies, the same participants also rated the helpfulness of the actual web pages themselves. In the online study, the web page ratings were collected from a separate group of participants. Our results show that thumbnails add information about the relevance of web pages that is not available in the textual summaries of web pages (title, snippet & URL). However, showing only thumbnails, with no textual information, results in poorer performance than showing only textual summaries. The prediction inaccuracy caused by textual vs. visual previews was different: textual previews tended to make users overestimate the helpfulness of web pages, whereas thumbnails made users underestimate the helpfulness of web pages in most cases. In our study, the best performance was obtained by combining sufficiently large thumbnails (at least 200x200 pixels) with page titles and URLs -- and it was better to make users focus primarily on the thumbnail by placing the title and URL below the thumbnail. Our studies highlighted four key aspects that affect the performance of previews: the visual/textual mode of the previews, the zoom level and size of the thumbnail, as well as the positioning of key information elements.
Keywords: snippets, thumbnails, user study, web page previews
Find me if you can: improving geographical prediction with social and spatial proximity BIBAKFull-Text 61-70
  Lars Backstrom; Eric Sun; Cameron Marlow
Geography and social relationships are inextricably intertwined; the people we interact with on a daily basis almost always live near us. As people spend more time online, data regarding these two dimensions -- geography and social relationships -- are becoming increasingly precise, allowing us to build reliable models to describe their interaction. These models have important implications in the design of location-based services, security intrusion detection, and social media supporting local communities.
   Using user-supplied address data and the network of associations between members of the Facebook social network, we can directly observe and measure the relationship between geography and friendship. Using these measurements, we introduce an algorithm that predicts the location of an individual from a sparse set of located users with performance that exceeds IP-based geolocation. This algorithm is efficient and scalable, and could be run on a network containing hundreds of millions of users.
Keywords: geolocation, propagation, social networks
AdHeat: an influence-based diffusion model for propagating hints to match ads BIBAKFull-Text 71-80
  Hongji Bao; Edward Y. Chang
In this paper, we present AdHeat, a social ad model considering user influence in addition to relevance for matching ads. Traditionally, ad placement employs the relevance model. Such a model matches ads with Web page content, user interests, or both. We have observed, however, on social networks that the relevance model suffers from two shortcomings. First, influential users (users who contribute opinions) seldom click ads that are highly relevant to their expertise. Second, because influential users' contents and activities are attractive to other users, hint words summarizing their expertise and activities may be widely preferred. Therefore, we propose AdHeat, which diffuses hint words of influential users to others and then matches ads for each user with aggregated hints. We performed experiments on a large online Q&A community with half a million users. The experimental results show that AdHeat outperforms the relevance model on CTR (click through rate) by significant margins.
Keywords: behavior targeting, contextual advertising, heat diffusion, influence propagation, online advertising, social network analysis
Dynamic and graphical web page breakpoints BIBAKFull-Text 81-90
  John J. Barton; Jan Odvarko
Breakpoints are perhaps the quintessential feature of a de-bugger: they allow a developer to stop time and study the program state. Breakpoints are typically specified by selecting a line of source code. For large, complex, web pages with multiple developers, the relevant source line for a given user interface problem may not be known to the developer. In this paper we describe the implementation of breakpoints in dynamically created source, and on error messages, network events, DOMmutation, DOMobject property changes, and CSS style rule updates. Adding these domain-specific breakpoints to a general-purpose debugger for Javascript allows the developer to initiate the debugging process via Web page abstractions rather than lower level source code views. The breakpoints are implemented in the open source Fire-bug project, version 1.5, for the Firefox Web browser.
Keywords: breakpoints, css, debugging, dynamic, firebug, html, javascript, web
Regular expressions considered harmful in client-side XSS filters BIBAKFull-Text 91-100
  Daniel Bates; Adam Barth; Collin Jackson
Cross-site scripting flaws have now surpassed buffer overflows as the world's most common publicly-reported security vulnerability. In recent years, browser vendors and researchers have tried to develop client-side filters to mitigate these attacks. We analyze the best existing filters and find them to be either unacceptably slow or easily circumvented. Worse, some of these filters could introduce vulnerabilities into sites that were previously bug-free. We propose a new filter design that achieves both high performance and high precision by blocking scripts after HTML parsing but before execution. Compared to previous approaches, our approach is faster, protects against more vulnerabilities, and is harder for attackers to abuse. We have contributed an implementation of our filter design to the WebKit open source rendering engine, and the filter is now enabled by default in the Google Chrome browser.
Keywords: XSS, browser, cross-site scripting, filter, web
The anatomy of an ad: structured indexing and retrieval for sponsored search BIBAKFull-Text 101-110
  Michael Bendersky; Evgeniy Gabrilovich; Vanja Josifovski; Donald Metzler
The core task of sponsored search is to retrieve relevant ads for the user's query. Ads can be retrieved either by exact match, when their bid term is identical to the query, or by advanced match, which indexes ads as documents and is similar to standard information retrieval (IR). Recently, there has been a great deal of research into developing advanced match ranking algorithms. However, no previous research has addressed the ad indexing problem. Unlike most traditional search problems, the ad corpus is defined hierarchically in terms of advertiser accounts, campaigns, and ad groups, which further consist of creatives and bid terms. This hierarchical structure makes indexing highly non-trivial, as naively indexing all possible "displayable" ads leads to a prohibitively large and ineffective index. We show that ad retrieval using such an index is not only slow, but its precision is suboptimal as well. We investigate various strategies for compact, hierarchy-aware indexing of sponsored search ads through adaptation of standard IR indexing techniques. We also propose a new ad retrieval method that yields more relevant ads by exploiting the structured nature of the ad corpus. Experiments carried out over a large ad test collection from a commercial search engine show that our proposed methods are highly effective and efficient compared to more standard indexing and retrieval approaches.
Keywords: sponsored search, structured retrieval
Classification-enhanced ranking BIBAKFull-Text 111-120
  Paul N. Bennett; Krysta Svore; Susan T. Dumais
Many have speculated that classifying web pages can improve a search engine's ranking of results. Intuitively results should be more relevant when they match the class of a query. We present a simple framework for classification-enhanced ranking that uses clicks in combination with the classification of web pages to derive a class distribution for the query. We then go on to define a variety of features that capture the match between the class distributions of a web page and a query, the ambiguity of a query, and the coverage of a retrieved result relative to a query's set of classes. Experimental results demonstrate that a ranker learned with these features significantly improves ranking over a competitive baseline. Furthermore, our methodology is agnostic with respect to the classification space and can be used to derive query classes for a variety of different taxonomies.
Keywords: learning to rank, query classification
Sync kit: a persistent client-side database caching toolkit for data intensive websites BIBAKFull-Text 121-130
  Edward Benson; Adam Marcus; David Karger; Samuel Madden
We introduce a client-server toolkit called Sync Kit that demonstrates how client-side database storage can improve the performance of data intensive websites. Sync Kit is designed to make use of the embedded relational database defined in the upcoming HTML5 standard to offload some data storage and processing from a web server onto the web browsers to which it serves content. Our toolkit provides various strategies for synchronizing relational database tables between the browser and the web server, along with a client-side template library so that portions web applications may be executed client-side. Unlike prior work in this area, Sync Kit persists both templates and data in the browser across web sessions, increasing the number of concurrent connections a server can handle by up to a factor of four versus that of a traditional server-only web stack and a factor of three versus a recent template caching approach.
Keywords: cache, client-side, data base, database, performance, web
Ranking specialization for web search: a divide-and-conquer approach by using topical RankSVM BIBAKFull-Text 131-140
  Jiang Bian; Xin Li; Fan Li; Zhaohui Zheng; Hongyuan Zha
Many ranking algorithms applying machine learning techniques have been proposed in informational retrieval and Web search. However, most of existing approaches do not explicitly take into account the fact that queries vary significantly in terms of ranking and entail different treatments regarding the ranking models. In this paper, we apply a divide-and-conquer framework for ranking specialization, i.e. learning multiple ranking models by addressing query difference. We first generate query representation by aggregating ranking features through pseudo feedbacks, and employ unsupervised clustering methods to identify a set of ranking-sensitive query topics based on training queries. To learn multiple ranking models for respective ranking-sensitive query topics, we define a global loss function by combining the ranking risks of all query topics, and we propose a unified SVM-based learning process to minimize the global loss. Moreover, we employ an ensemble approach to generate the ranking result for each test query by applying a set of ranking models of the most appropriate query topics. We conduct experiments using a benchmark dataset for learning ranking functions as well as a dataset from a commercial search engine. Experimental results show that our proposed approach can significantly improve the ranking performance over existing single-model approaches as well as straightforward local ranking approaches, and the automatically identified ranking-sensitive topics are more useful for enhancing ranking performance than pre-defined query categorization.
Keywords: ranking specialization for web search, ranking-sensitive query topic, topical RankSVM
Automated performance assessment for service-oriented middleware: a case study on BPEL engines BIBAKFull-Text 141-150
  Domenico Bianculli; Walter Binder; Mauro Luigi Drago
Middleware for Web service compositions, such as BPEL engines, provides the execution environment for services as well as additional functionalities, such as monitoring and self-tuning. Given its role in service provisioning, it is very important to assess the performance of middleware in the context of a SOA. This paper presents SOABench, a framework for the automatic generation and execution of testbeds for benchmarking middleware for composite Web services and for assessing the performance of existing SOA infrastructures. SOABench defines a testbed model characterized by the composite services to execute, the workload to generate, the deployment configuration to use, the performance metrics to gather, the data analyses to perform on them, and the reports to produce. We have validated SOABench by benchmarking the performance of different BPEL engines.
Keywords: experiment automation, middleware, performance assessment, testbed generation, web service composition
Relational duality: unsupervised extraction of semantic relations between entities on the web BIBAKFull-Text 151-160
  Danushka Tarupathi Bollegala; Yutaka Matsuo; Mitsuru Ishizuka
Extracting semantic relations among entities is an important first step in various tasks in Web mining and natural language processing such as information extraction, relation detection, and social network mining. A relation can be expressed extensionally by stating all the instances of that relation or intensionally by defining all the paraphrases of that relation. For example, consider the ACQUISITION relation between two companies. An extensional definition of ACQUISITION contains all pairs of companies in which one company is acquired by another (e.g. (YouTube, Google) or (Powerset, Microsoft)). On the other hand we can intensionally define ACQUISITION as the relation described by lexical patterns such as X is acquired by Y, or Y purchased X, where X and Y denote two companies. We use this dual representation of semantic relations to propose a novel sequential co-clustering algorithm that can extract numerous relations efficiently from unlabeled data. We provide an efficient heuristic to find the parameters of the proposed coclustering algorithm. Using the clusters produced by the algorithm, we train an L1 regularized logistic regression model to identify the representative patterns that describe the relation expressed by each cluster. We evaluate the proposed method in three different tasks: measuring relational similarity between entity pairs, open information extraction (Open IE), and classifying relations in a social network system. Experiments conducted using a benchmark dataset show that the proposed method improves existing relational similarity measures. Moreover, the proposed method significantly outperforms the current state-of-the-art Open IE systems in terms of both precision and recall. The proposed method correctly classifies 53 relation types in an online social network containing 470; 671 nodes and 35; 652; 475 edges, thereby demonstrating its efficacy in real-world relation detection tasks.
Keywords: relation extraction, relational duality, relational similarity, web mining
Liquid query: multi-domain exploratory search on the web BIBAKFull-Text 161-170
  Alessandro Bozzon; Marco Brambilla; Stefano Ceri; Piero Fraternali
In this paper we propose the Liquid Query paradigm, to support users in finding responses to multi-domain queries through exploratory information seeking across structured information sources (Web documents, deep Web data, and personal data repositories), wrapped by means of a uniform notion of search service. Liquid Query aims at filling the gap between general-purpose search engines, which are unable to find information spanning multiple topics, and domain-specific search systems, which cannot go beyond their domain limits. The Liquid Query interface consists of interaction primitives that let users pose questions and explore results spanning over multiple sources incrementally, thus getting closer and closer to the sought information. We demonstrate our approach with a prototype built upon the YQL (Yahoo! Query Language) framework.
Keywords: exploratory search, federated search engine, multi-domain search, search computing, search engine, search service, web
Graph-based concept identification and disambiguation for enterprise search BIBAKFull-Text 171-180
  Falk Brauer; Michael Huber; Gregor Hackenbroich; Ulf Leser; Felix Naumann; Wojciech M. Barczynski
Enterprise Search (ES) is different from traditional IR due to a number of reasons, among which the high level of ambiguity of terms in queries and documents and existence of graph-structured enterprise data (ontologies) that describe the concepts of interest and their relationships to each other, are the most important ones.
   Our method identifies concepts from the enterprise ontology in the query and corpus. We propose a ranking scheme for ontology sub-graphs on top of approximately matched token q-grams. The ranking leverages the graph-structure of the ontology to incorporate not explicitly mentioned concepts. It improves previous solutions by using a fine-grained ranking function that is specifically designed to cope with high levels of ambiguity. This method is able to capture much more of the semantics of queries and documents than previous techniques. We prove this claim by an evaluation of our method in three real-life scenarios from two different domains, and found it to consistently be superior both in terms of precision and recall.
Keywords: enterprise search, graph-based disambiguation, information extraction
A refreshing perspective of search engine caching BIBAKFull-Text 181-190
  Berkant Barla Cambazoglu; Flavio P. Junqueira; Vassilis Plachouras; Scott Banachowski; Baoqiu Cui; Swee Lim; Bill Bridge
Commercial Web search engines have to process user queries over huge Web indexes under tight latency constraints. In practice, to achieve low latency, large result caches are employed and a portion of the query traffic is served using previously computed results. Moreover, search engines need to update their indexes frequently to incorporate changes to the Web. After every index update, however, the content of cache entries may become stale, thus decreasing the freshness of served results. In this work, we first argue that the real problem in today's caching for large-scale search engines is not eviction policies, but the ability to cope with changes to the index, i.e., cache freshness. We then introduce a novel algorithm that uses a time-to-live value to set cache entries to expire and selectively refreshes cached results by issuing refresh queries to back-end search clusters. The algorithm prioritizes the entries to refresh according to a heuristic that combines the frequency of access with the age of an entry in the cache. In addition, for setting the rate at which refresh queries are issued, we present a mechanism that takes into account idle cycles of back-end servers. Evaluation using a real workload shows that our algorithm can achieve hit rate improvements as well as reduction in average hit ages. An implementation of this algorithm is currently in production use at Yahoo!.
Keywords: caching, index updates, refresh, web search engines
Automated object persistence for JavaScript BIBAKFull-Text 191-200
  Brett Cannon; Eric Wohlstadter
Traditionally web applications have required an internet connection in order to work with data. Browsers have lacked any mechanisms to allow web applications to operate offline with a set of data to provide constant access to applications. Recently, through browser plug-ins such as Google Gears, browsers have gained the ability to persist data for offline use. However, until now it's been difficult for a web developer using these plug-ins to manage persisting data both locally for offline use and in the internet cloud due to: synchronization requirements, managing throughput and latency to the cloud, and making it work within the confines of a standards-compliant web browser. Historically in non-browser environments, programming language environments have offered automated object persistence to shield the developer from these complexities. In our research we have created a framework which introduces automated persistence of data objects for JavaScript utilizing the internet. Unlike traditional object persistence solutions, ours relies only on existing or forthcoming internet standards and does not rely upon specific runtime mechanisms such as OS or interpreter/compiler support. A new design was required in order to be suitable to the internet's unique characteristics of varying connection quality and a browser's specific restrictions. We validate our approach using benchmarks which show that our framework can handle thousands of data objects automatically, reducing the amount of work needed by developers to support offline Web applications.
Keywords: JavaScript, html5, json, object persistence, web storage
A generalized framework of exploring category information for question retrieval in community question answer archives BIBAKFull-Text 201-210
  Xin Cao; Gao Cong; Bin Cui; Christian S. Jensen
Community Question Answering (CQA) has emerged as a popular type of service where users ask and answer questions and access historical question-answer pairs. CQA archives contain very large volumes of questions organized into a hierarchy of categories. As an essential function of CQA services, question retrieval in a CQA archive aims to retrieve historical question-answer pairs that are relevant to a query question. In this paper, we present a new approach to exploiting category information of questions for improving the performance of question retrieval, and we apply the approach to existing question retrieval models, including a state-of-the-art question retrieval model. Experiments conducted on real CQA data demonstrate that the proposed techniques are capable of outperforming a variety of baseline methods significantly.
Keywords: categorization, question retrieval, question-answering services
The paths more taken: matching DOM trees to search logs for accurate webpage clustering BIBAKFull-Text 211-220
  Deepayan Chakrabarti; Rupesh Mehta
An unsupervised clustering of the webpages on a website is a primary requirement for most wrapper induction and automated data extraction methods. Since page content can vary drastically across pages of one cluster (e.g., all product pages on amazon.com), traditional clustering methods typically use some distance function between the DOM trees representing a pair of webpages. However, without knowing which portions of the DOM tree are "important," such distance functions might discriminate between similar pages based on trivial features (e.g., differing number of reviews on two product pages), or club together distinct types of pages based on superficial features present in the DOM trees of both (e.g., matching footer/copyright), leading to poor clustering performance.
   We propose using search logs to automatically find paths in the DOM trees that mark out important portions of pages, e.g., the product title in a product page. Such paths are identified via a global analysis of the entire website, whereby search data for popular pages can be used to infer good paths even for other pages that receive little or no search traffic. The webpages on the website are then clustered using these "key" paths. Our algorithm only requires information on search queries, and the webpages clicked in response to them; there is no need for human input, and it does not need to be told which portion of a webpage the user found interesting. The resulting clusterings achieve an adjusted RAND score of over 0.9 on half of the websites (a score of 1 indicating a perfect clustering), and 59% better scores on average than competing algorithms. Besides leading to refined clusterings, these key paths can be useful in the wrapper induction process itself, as shown by the high degree of match between the key paths and the manually identified paths used in existing wrappers for these sites (90% average precision).
Keywords: clustering, search logs, wrapper induction
Actively predicting diverse search intent from user browsing behaviors BIBAKFull-Text 221-230
  Zhicong Cheng; Bin Gao; Tie-Yan Liu
This paper is concerned with actively predicting search intent from user browsing behavior data. In recent years, great attention has been paid to predicting user search intent. However, the prediction was mostly passive because it was performed only after users submitted their queries to search engines. It is not considered why users issued these queries, and what triggered their information needs. According to our study, many information needs of users were actually triggered by what they have browsed. That is, after reading a page, if a user found something interesting or unclear, he/she might have the intent to obtain further information and accordingly formulate a search query. Actively predicting such search intent can benefit both search engines and their users. In this paper, we propose a series of technologies to fulfill this task. First, we extract all the queries that users issued after reading a given page from user browsing behavior data. Second, we learn a model to effectively rank these queries according to their likelihoods of being triggered by the page. Third, since search intents can be quite diverse even if triggered by the same page, we propose an optimization algorithm to diversify the ranked list of queries obtained in the second step, and then suggest the list to users. We have tested our approach on large-scale user browsing behavior data obtained from a commercial search engine. The experimental results have shown that our approach can predict meaningful queries for a given page, and the search performance for these queries can be significantly improved by using the triggering page as contextual information.
Keywords: contextual information, diversification, log mining, search intent, search-trigger
Max-cover in map-reduce BIBAKFull-Text 231-240
  Flavio Chierichetti; Ravi Kumar; Andrew Tomkins
The NP-hard Max-k-cover problem requires selecting k sets from a collection so as to maximize the size of the union. This classic problem occurs commonly in many settings in web search and advertising. For moderately-sized instances, a greedy algorithm gives an approximation of (1-1/e). However, the greedy algorithm requires updating scores of arbitrary elements after each step, and hence becomes intractable for large datasets.
   We give the first max cover algorithm designed for today's large-scale commodity clusters. Our algorithm has provably almost the same approximation as greedy, but runs much faster. Furthermore, it can be easily expressed in the MapReduce programming paradigm, and requires only polylogarithmically many passes over the data. Our experiments on five large problem instances show that our algorithm is practical and can achieve good speedups compared to the sequential greedy algorithm.
Keywords: greedy algorithm, map-reduce, maximum cover
Stochastic models for tabbed browsing BIBAKFull-Text 241-250
  Flavio Chierichetti; Ravi Kumar; Andrew Tomkins
We present a model of tabbed browsing that represents a hybrid between a Markov process capturing the graph of hyperlinks, and a branching process capturing the birth and death of tabs. We present a mathematical criterion to characterize whether the process has a steady state independent of initial conditions, and we show how to characterize the limiting behavior in both cases. We perform a series of experiments to compare our tabbed browsing model with pagerank, and show that tabbed browsing is able to explain 15-25% of the deviation between actual measured browsing behavior and the behavior predicted by the simple pagerank model. We find this to be a surprising result, as the tabbed browsing model does not make use of any notion of site popularity, but simply captures deviations in user likelihood to open and close tabs from a particular node in the graph.
Keywords: branching process, convergence, random walks, stationary distribution, tabbed browsing
Using landing pages for sponsored search ad selection BIBAKFull-Text 251-260
  Yejin Choi; Marcus Fontoura; Evgeniy Gabrilovich; Vanja Josifovski; Mauricio Mediano; Bo Pang
We explore the use of the landing page content in sponsored search ad selection. Specifically, we compare the use of the ad's intrinsic content to augmenting the ad with the whole, or parts, of the landing page. We explore two types of extractive summarization techniques to select useful regions from the landing pages: out-of-context and in-context methods. Out-of-context methods select salient regions from the landing page by analyzing the content alone, without taking into account the ad associated with the landing page. In-context methods use the ad context (including its title, creative, and bid phrases) to help identify regions of the landing page that should be used by the ad selection engine. In addition, we introduce a simple yet effective unsupervised algorithm to enrich the ad context to further improve the ad selection. Experimental evaluation confirms that the use of landing pages can significantly improve the quality of ad selection. We also find that our extractive summarization techniques reduce the size of landing pages substantially, while retaining or even improving the performance of ad retrieval over the method that utilize the entire landing page.
Keywords: compositional semantics, extractive summarization, landing pages, sponsored search
The "Map Trap"?: an evaluation of map versus text-based interfaces for location-based mobile search services BIBAKFull-Text 261-270
  Karen Church; Joachim Neumann; Mauro Cherubini; Nuria Oliver
As the mobile Internet continues to grow, there is an increasing need to provide users with effective search and information access services. In order to build more effective mobile search services, we must first understand the impact that various interface choices have on mobile users. For example, the majority of mobile location-based search services are built on top of a map visualization, but is this intuitive design-decision the optimal interface choice from a human centric perspective? In order to tackle this fundamental design question, we have developed two proactive mobile search interfaces (one map-based and the other text-based) that utilize key mobile contexts to improve the search and information discovery experience of mobile users. In this paper, we present the results of an exploratory field study of these two interfaces -- involving 34 users over a 1 month period -- where we focus in particular on the impact that the type of user interface (e.g. map vs text) has on the search and information discovery experience of mobile users. We highlight the main usage results -- including that maps are not the interface of choice for certain information access tasks -- and outline key implications for the design of next generation mobile search services.
Keywords: location-based services, map-based interfaces, mobile interfaces, mobile internet, mobile search, mobile web, text-based interfaces, user study
Malicious interface design: exploiting the user BIBAKFull-Text 271-280
  Gregory Conti; Edward Sobiesk
In an ideal world, interface design is the art and science of helping users accomplish tasks in a timely, efficient, and pleasurable manner. This paper studies the inverse situation, the vast emergence of deliberately constructed malicious interfaces that violate design best practices in order to accomplish goals counter to those of the user. This has become a commonplace occurrence both on and off the desktop, particularly on the web. A primary objective of this paper is to formally define this problem, including construction of a taxonomy of malicious interface techniques and a preliminary analysis of their impact on users. Findings are presented that gauge the self-reported tolerance and expectation levels of users with regard to malicious interfaces as well as the effectiveness and ease of use of existing countermeasures. A second objective of this paper is to increase awareness, dialogue, and research in a domain that we consider largely unexplored but critical to future usability of the WWW. Our results were accomplished through significant compilation of malicious interface techniques based on review of thousands of web sites and by conducting three surveys. Ultimately, this paper concludes that malicious interfaces are a ubiquitous problem that demands intervention by the security and human computer interaction communities in order to reduce the negative impact on the global user population.
Keywords: adversarial interface design, design principles, evil interfaces, malicious interfaces, web usability guidelines
Detection and analysis of drive-by-download attacks and malicious JavaScript code BIBAKFull-Text 281-290
  Marco Cova; Christopher Kruegel; Giovanni Vigna
JavaScript is a browser scripting language that allows developers to create sophisticated client-side interfaces for web applications. However, JavaScript code is also used to carry out attacks against the user's browser and its extensions. These attacks usually result in the download of additional malware that takes complete control of the victim's platform, and are, therefore, called "drive-by downloads." Unfortunately, the dynamic nature of the JavaScript language and its tight integration with the browser make it difficult to detect and block malicious JavaScript code.
   This paper presents a novel approach to the detection and analysis of malicious JavaScript code. Our approach combines anomaly detection with emulation to automatically identify malicious JavaScript code and to support its analysis. We developed a system that uses a number of features and machine-learning techniques to establish the characteristics of normal JavaScript code. Then, during detection, the system is able to identify anomalous JavaScript code by emulating its behavior and comparing it to the established profiles. In addition to identifying malicious code, the system is able to support the analysis of obfuscated code and to generate detection signatures for signature-based systems. The system has been made publicly available and has been used by thousands of analysts.
Keywords: anomaly detection, drive-by-download attacks, web client exploits
Competing for users' attention: on the interplay between organic and sponsored search results BIBAKFull-Text 291-300
  Cristian Danescu-Niculescu-Mizil; Andrei Z. Broder; Evgeniy Gabrilovich; Vanja Josifovski; Bo Pang
Queries on major Web search engines produce complex result pages, primarily composed of two types of information: organic results, that is, short descriptions and links to relevant Web pages, and sponsored search results, the small textual advertisements often displayed above or to the right of the organic results. Strategies for optimizing each type of result in isolation and the consequent user reaction have been extensively studied; however, the interplay between these two complementary sources of information has been ignored, a situation we aim to change. Our findings indicate that their perceived relative usefulness (as evidenced by user clicks) depends on the nature of the query. Specifically, we found that, when both sources focus on the same intent, for navigational queries there is a clear competition between ads and organic results, while for non-navigational queries this competition turns into synergy.
   We also investigate the relationship between the perceived usefulness of the ads and their textual similarity to the organic results, and propose a model that formalizes this relationship. To this end, we introduce the notion of responsive ads, which directly address the user's information need, and incidental ads, which are only tangentially related to that need. Our findings support the hypothesis that in the case of navigational queries, which are usually fully satisfied by the top organic result, incidental ads are perceived as more valuable than responsive ads, which are likely to be duplicative. On the other hand, in the case of non-navigational queries, incidental ads are perceived as less beneficial, possibly because they diverge too far from the actual user need.
   We hope that our findings and further research in this area will allow search engines to tune ad selection for an increased synergy between organic and sponsored results, leading to both higher user satisfaction and better monetization.
Keywords: ad selection, incidental ads, navigational queries, responsive ads, sponsored search, textual similarity, usefulness of ads
Inferring relevant social networks from interpersonal communication BIBAKFull-Text 301-310
  Munmun De Choudhury; Winter A. Mason; Jake M. Hofman; Duncan J. Watts
Researchers increasingly use electronic communication data to construct and study large social networks, effectively inferring unobserved ties (e.g. i is connected to j) from observed communication events (e.g. i emails j). Often overlooked, however, is the impact of tie definition on the corresponding network, and in turn the relevance of the inferred network to the research question of interest. Here we study the problem of network inference and relevance for two email data sets of different size and origin. In each case, we generate a family of networks parameterized by a threshold condition on the frequency of emails exchanged between pairs of individuals. After demonstrating that different choices of the threshold correspond to dramatically different network structures, we then formulate the relevance of these networks in terms of a series of prediction tasks that depend on various network features. In general, we find: a) that prediction accuracy is maximized over a non-trivial range of thresholds corresponding to 5-10 reciprocated emails per year; b) that for any prediction task, choosing the optimal value of the threshold yields a sizable (~30%) boost in accuracy over naive choices; and c) that the optimal threshold value appears to be (somewhat surprisingly) consistent across data sets and prediction tasks. We emphasize the practical utility in defining ties via their relevance to the prediction task(s) at hand and discuss implications of our empirical results.
Keywords: communication networks, email, learning, network structure, network thresholds, social network analysis, social networks, ties
Scalable techniques for document identifier assignment ininverted indexes BIBAKFull-Text 311-320
  Shuai Ding; Josh Attenberg; Torsten Suel
Web search engines depend on the full-text inverted index data structure. Because the query processing performance is so dependent on the size of the inverted index, a plethora of research has focused on fast end effective techniques for compressing this structure. Recently, several authors have proposed techniques for improving index compression by optimizing the assignment of document identifiers to the documents in the collection, leading to significant reduction in overall index size.
   In this paper, we propose improved techniques for document identifier assignment. Previous work includes simple and fast heuristics such as sorting by URL, as well as more involved approaches based on the Traveling Salesman Problem or on graph partitioning. These techniques achieve good compression but do not scale to larger document collections. We propose a new framework based on performing a Traveling Salesman computation on a reduced sparse graph obtained through Locality Sensitive Hashing. This technique achieves improved compression while scaling to tens of millions of documents. Based on this framework, we describe a number of new algorithms, and perform a detailed evaluation on three large data sets showing improvements in index size.
Keywords: documentID reassignment, index compression
Do you want to take notes?: identifying research missions in Yahoo! search pad BIBAKFull-Text 321-330
  Debora Donato; Francesco Bonchi; Tom Chi; Yoelle Maarek
Addressing user's information needs has been one of the main goals of Web search engines since their early days. In some cases, users cannot see their needs immediately answered by search results, simply because these needs are too complex and involve multiple aspects that are not covered by a single Web or search results page. This typically happens when users investigate a certain topic in domains such as education, travel or health, which often require collecting facts and information from many pages. We refer to this type of activities as "research missions". These research missions account for 10% of users' sessions and more than 25% of all query volume, as verified by a manual analysis that was conducted by Yahoo! editors.
   We demonstrate in this paper that such missions can be automatically identified on-the-fly, as the user interacts with the search engine, through careful runtime analysis of query flows and query sessions.
   The on-the-fly automatic identification of research missions has been implemented in Search Pad, a novel Yahoo! application that was launched in 2009, and that we present in this paper. Search Pad helps users keeping trace of results they have consulted. Its novelty however is that unlike previous notes taking products, it is automatically triggered only when the system decides, with a fair level of confidence, that the user is undertaking a research mission and thus is in the right context for gathering notes. Beyond the Search Pad specific application, we believe that changing the level of granularity of query modeling, from an isolated query to a list of queries pertaining to the same research missions, so as to better reflect a certain type of information needs, can be beneficial in a number of other Web search applications. Session-awareness is growing and it is likely to play, in the near future, a fundamental role in many on-line tasks: this paper presents a first step on this path.
Keywords: persistent search, query logs, sessions, task-oriented search
Time is of the essence: improving recency ranking using Twitter data BIBAKFull-Text 331-340
  Anlei Dong; Ruiqiang Zhang; Pranam Kolari; Jing Bai; Fernando Diaz; Yi Chang; Zhaohui Zheng; Hongyuan Zha
Realtime web search refers to the retrieval of very fresh content which is in high demand. An effective portal web search engine must support a variety of search needs, including realtime web search. However, supporting realtime web search introduces two challenges not encountered in non-realtime web search: quickly crawling relevant content and ranking documents with impoverished link and click information. In this paper, we advocate the use of realtime micro-blogging data for addressing both of these problems. We propose a method to use the micro-blogging data stream to detect fresh URLs. We also use micro-blogging data to compute novel and effective features for ranking fresh URLs. We demonstrate these methods improve effective of the portal web search engine for realtime web search.
Keywords: Twitter, recency modeling, recency ranking
Highlighting disputed claims on the web BIBAKFull-Text 341-350
  Rob Ennals; Beth Trushkowsky; John Mark Agosta
We describe Dispute Finder, a browser extension that alerts a user when information they read online is disputed by a source that they might trust. Dispute Finder examines the text on the page that the user is browsing and highlights any phrases that resemble known disputed claims. If a user clicks on a highlighted phrase then Dispute Finder shows them a list of articles that support other points of view.
   Dispute Finder builds a database of known disputed claims by crawling web sites that already maintain lists of disputed claims, and by allowing users to enter claims that they believe are disputed. Dispute Finder identifies snippets that make known disputed claims by running a simple textual entailment algorithm inside the browser extension, referring to a cached local copy of the claim database.
   In this paper, we explain the design of Dispute Finder, and the trade-offs between the various design decisions that we explored.
Keywords: annotation, argumentation, cscw, sensemaking, web
Privacy wizards for social networking sites BIBAKFull-Text 351-360
  Lujun Fang; Kristen LeFevre
Privacy is an enormous problem in online social networking sites. While sites such as Facebook allow users fine-grained control over who can see their profiles, it is difficult for average users to specify this kind of detailed policy.
   In this paper, we propose a template for the design of a social networking privacy wizard. The intuition for the design comes from the observation that real users conceive their privacy preferences (which friends should be able to see which information) based on an implicit set of rules. Thus, with a limited amount of user input, it is usually possible to build a machine learning model that concisely describes a particular user's preferences, and then use this model to configure the user's privacy settings automatically.
   As an instance of this general framework, we have built a wizard based on an active learning paradigm called uncertainty sampling. The wizard iteratively asks the user to assign privacy "labels" to selected ("informative") friends, and it uses this input to construct a classifier, which can in turn be used to automatically assign privileges to the rest of the user's (unlabeled) friends.
   To evaluate our approach, we collected detailed privacy preference data from 45 real Facebook users. Our study revealed two important things. First, real users tend to conceive their privacy preferences in terms of communities, which can easily be extracted from a social network graph using existing techniques. Second, our active learning wizard, using communities as features, is able to recommend high-accuracy privacy settings using less user input than existing policy-specification tools.
Keywords: active learning, social network privacy, usability
A novel traffic analysis for identifying search fields in the long tail of web sites BIBAKFull-Text 361-370
  George Forman; Evan Kirshenbaum; Shyamsundar Rajaram
Using a clickstream sample of 2 billion URLs from many thousand volunteer Web users, we wish to analyze typical usage of keyword searches across the Web. In order to do this, we need to be able to determine whether a given URL represents a keyword search and, if so, which field contains the query. Although it is easy to recognize 'q' as the query field in 'http://www.google.com/search?hl=en&q=music', we must do this automatically for the long tail of diverse websites. This problem is the focus of this paper. Since the names, types and number of fields differ across sites, this does not conform to traditional text classification or to multi-class problem formulations. The problem also exhibits highly non-uniform importance across websites, since traffic follows a Zipf distribution.
   We developed a solution based on manually identifying the query fields on the most popular sites, followed by an adaptation of machine learning for the rest. It involves an interesting case-instances structure: labeling each website case usually involves selecting at most one of the field instances as positive, based on seeing sample field values. This problem structure and soft constraint -- which we believe has broader applicability -- can be used to greatly reduce the manual labeling effort. We employed active learning and judicious GUI presentation to efficiently train a classifier with accuracy estimated at 96%, beating several baseline alternatives.
Keywords: active learning, clickstream analysis, machine learning classification, web data mining
Expressive auctions for externalities in online advertising BIBAKFull-Text 371-380
  Arpita Ghosh; Amin Sayedi
When online ads are shown together, they compete for user attention and conversions, imposing negative externalities on each other. While the competition for user attention in sponsored search can be captured via models of clickthrough rates, the post-click competition for conversions cannot: since the value-per-click of an advertiser is proportional to the conversion probability conditional on a click, which depends on the other ads displayed, the private value of an advertiser is no longer one-dimensional, and the GSP mechanism is not adequately expressive. We study the design of expressive GSP-like mechanisms for the simplest form that an advertiser's private value can have in the presence of such externalities -- an advertiser's value depends on exclusivity, i.e., whether her ad is shown exclusively, or along with other ads.
   Our auctions take as input two-dimensional (per-click) bids for exclusive and nonexclusive display, and have two types of outcomes: either a single ad is displayed exclusively, or multiple ads are simultaneously shown. We design two expressive auctions that are both extensions of GSP- the first auction, GSP2D, is designed with the property that the allocation and pricing are identical to GSP when multiple ads are shown; the second auction, NP2D, is designed to be a next price auction. We show that both auctions have high efficiency and revenue in all reasonable equilibria; further, the NP2D auction is guaranteed to always have an equilibrium with revenue at least as much as the current GSP mechanism. However, we find that unlike with one-dimensional valuations, the GSP-like auctions for these richer valuations do not always preserve efficiency and revenue with respect to the VCG mechanism.
Keywords: auction design, expressive auctions, externalities, online advertising
Tracking the random surfer: empirically measured teleportation parameters in PageRank BIBAKFull-Text 381-390
  David F. Gleich; Paul G. Constantine; Abraham D. Flaxman; Asela Gunawardana
PageRank computes the importance of each node in a directed graph under a random surfer model governed by a teleportation parameter. Commonly denoted alpha, this parameter models the probability of following an edge inside the graph or, when the graph comes from a network of web pages and links, clicking a link on a web page. We empirically measure the teleportation parameter based on browser toolbar logs and a click trail analysis. For a particular user or machine, such analysis produces a value of alpha. We find that these values nicely fit a Beta distribution with mean edge-following probability between 0.3 and 0.7, depending on the site. Using these distributions, we compute PageRank scores where PageRank is computed with respect to a distribution as the teleportation parameter, rather than a constant teleportation parameter. These new metrics are evaluated on the graph of pages in Wikipedia.
Keywords: PageRank, Wikipedia, click trail analysis, empirical click probability, teleportation parameter, toolbar data
Document recommendation in social tagging services BIBAKFull-Text 391-400
  Ziyu Guan; Can Wang; Jiajun Bu; Chun Chen; Kun Yang; Deng Cai; Xiaofei He
Social tagging services allow users to annotate various online resources with freely chosen keywords (tags). They not only facilitate the users in finding and organizing online resources, but also provide meaningful collaborative semantic data which can potentially be exploited by recommender systems. Traditional studies on recommender systems focused on user rating data, while recently social tagging data is becoming more and more prevalent. How to perform resource recommendation based on tagging data is an emerging research topic. In this paper we consider the problem of document (e.g. Web pages, research papers) recommendation using purely tagging data. That is, we only have data containing users, tags, documents and the relationships among them. We propose a novel graph-based representation learning algorithm for this purpose. The users, tags and documents are represented in the same semantic space in which two related objects are close to each other. For a given user, we recommend those documents that are sufficiently close to him/her. Experimental results on two data sets crawled from Del.icio.us and CiteULike show that our algorithm can generate promising recommendations and outperforms traditional recommendation algorithms.
Keywords: recommender systems, social tagging
Equip tourists with knowledge mined from travelogues BIBAKFull-Text 401-410
  Qiang Hao; Rui Cai; Changhu Wang; Rong Xiao; Jiang-Ming Yang; Yanwei Pang; Lei Zhang
With the prosperity of tourism and Web 2.0 technologies, more and more people have willingness to share their travel experiences on the Web (e.g., weblogs, forums, or Web 2.0 communities). These so-called travelogues contain rich information, particularly including location-representative knowledge such as attractions (e.g., Golden Gate Bridge), styles (e.g., beach, history), and activities (e.g., diving, surfing). The location-representative information in travelogues can greatly facilitate other tourists' trip planning, if it can be correctly extracted and summarized. However, since most travelogues are unstructured and contain much noise, it is difficult for common users to utilize such knowledge effectively. In this paper, to mine location-representative knowledge from a large collection of travelogues, we propose a probabilistic topic model, named as Location-Topic model. This model has the advantages of (1) differentiability between two kinds of topics, i.e., local topics which characterize locations and global topics which represent other common themes shared by various locations, and (2) representation of locations in the local topic space to encode both location-representative knowledge and similarities between locations. Some novel applications are developed based on the proposed model, including (1) destination recommendation for on flexible queries, (2) characteristic summarization for a given destination with representative tags and snippets, and (3) identification of informative parts of a travelogue and enriching such highlights with related images. Based on a large collection of travelogues, the proposed framework is evaluated using both objective and subjective evaluation methods and shows promising results.
Keywords: probabilistic topic model, recommendation, travelogue mining
Data summaries for on-demand queries over linked data BIBAKFull-Text 411-420
  Andreas Harth; Katja Hose; Marcel Karnstedt; Axel Polleres; Kai-Uwe Sattler; Jürgen Umbrich
Typical approaches for querying structured Web Data collect (crawl) and pre-process (index) large amounts of data in a central data repository before allowing for query answering. However, this time-consuming pre-processing phase however leverages the benefits of Linked Data -- where structured data is accessible live and up-to-date at distributed Web resources that may change constantly -- only to a limited degree, as query results can never be current. An ideal query answering system for Linked Data should return current answers in a reasonable amount of time, even on corpora as large as the Web. Query processors evaluating queries directly on the live sources require knowledge of the contents of data sources. In this paper, we develop and evaluate an approximate index structure summarising graph-structured content of sources adhering to Linked Data principles, provide an algorithm for answering conjunctive queries over Linked Data on the Web exploiting the source summary, and evaluate the system using synthetically generated queries. The experimental results show that our lightweight index structure enables complete and up-to-date query results over Linked Data, while keeping the overhead for querying low and providing a satisfying source ranking at no additional cost.
Keywords: index structures, linked data, rdf querying
Context-aware citation recommendation BIBAKFull-Text 421-430
  Qi He; Jian Pei; Daniel Kifer; Prasenjit Mitra; Lee Giles
When you write papers, how many times do you want to make some citations at a place but you are not sure which papers to cite? Do you wish to have a recommendation system which can recommend a small number of good candidates for every place that you want to make some citations? In this paper, we present our initiative of building a context-aware citation recommendation system. High quality citation recommendation is challenging: not only should the citations recommended be relevant to the paper under composition, but also should match the local contexts of the places citations are made. Moreover, it is far from trivial to model how the topic of the whole paper and the contexts of the citation places should affect the selection and ranking of citations. To tackle the problem, we develop a context-aware approach. The core idea is to design a novel non-parametric probabilistic model which can measure the context-based relevance between a citation context and a document. Our approach can recommend citations for a context effectively. Moreover, it can recommend a set of citations for a paper with high quality. We implement a prototype system in CiteSeerX. An extensive empirical evaluation in the CiteSeerX digital library against many baselines demonstrates the effectiveness and the scalability of our approach.
Keywords: bibliometrics, context, Gleason's theorem, recommender systems
The anatomy of a large-scale social search engine BIBAKFull-Text 431-440
  Damon Horowitz; Sepandar D. Kamvar
We present Aardvark, a social search engine. With Aardvark, users ask a question, either by instant message, email, web input, text message, or voice. Aardvark then routes the question to the person in the user's extended social network most likely to be able to answer that question. As compared to a traditional web search engine, where the challenge lies in finding the right document to satisfy a user's information need, the challenge in a social search engine like Aardvark lies in finding the right person to satisfy a user's information need. Further, while trust in a traditional search engine is based on authority, in a social search engine like Aardvark, trust is based on intimacy. We describe how these considerations inform the architecture, algorithms, and user interface of Aardvark, and how they are reflected in the behavior of Aardvark users.
Keywords: social search
Multi-modality in one-class classification BIBAKFull-Text 441-450
  Matthijs Hovelynck; Boris Chidlovskii
We propose a method for improving classification performance in a one-class setting by combining classifiers of different modalities. We apply the method to the problem of distinguishing responsive documents in a corpus of e-mails, like Enron Corpus. We extract the social network of actors which is implicit in a large body of electronic communication and turn it into valuable features for classifying the exchanged documents. Working in a one-class setting we follow a semi-supervised approach based on the Mapping Convergence framework. We propose an alternative interpretation, that allows for broader applicability when positive and negative items are not naturally separable. We propose an extension to the one-class evaluation framework in truly one-case cases when only some positive training examples are available. We extent the one-class setting to the co-training principle that enables us to take advantage of multiple views on the data. We report evaluation results of this extension on three different corpora including Enron Corpus.
Keywords: co-training, Enron, one-class classification, responsiveness
Exploring web scale language models for search query processing BIBAKFull-Text 451-460
  Jian Huang; Jianfeng Gao; Jiangbo Miao; Xiaolong Li; Kuansan Wang; Fritz Behr; C. Lee Giles
It has been widely observed that search queries are composed in a very different style from that of the body or the title of a document. Many techniques explicitly accounting for this language style discrepancy have shown promising results for information retrieval, yet a large scale analysis on the extent of the language differences has been lacking. In this paper, we present an extensive study on this issue by examining the language model properties of search queries and the three text streams associated with each web document: the body, the title, and the anchor text. Our information theoretical analysis shows that queries seem to be composed in a way most similar to how authors summarize documents in anchor texts or titles, offering a quantitative explanation to the observations in past work.
   We apply these web scale n-gram language models to three search query processing (SQP) tasks: query spelling correction, query bracketing and long query segmentation. By controlling the size and the order of different language models, we find that the perplexity metric to be a good accuracy indicator for these query processing tasks. We show that using smoothed language models yields significant accuracy gains for query bracketing for instance, compared to using web counts as in the literature. We also demonstrate that applying web-scale language models can have marked accuracy advantage over smaller ones.
Keywords: language models, search query processing, very large-scale experiments, web n-gram
A scalable machine-learning approach for semi-structured named entity recognition BIBAKFull-Text 461-470
  Utku Irmak; Reiner Kraft
Named entity recognition studies the problem of locating and classifying parts of free text into a set of predefined categories. Although extensive research has focused on the detection of person, location and organization entities, there are many other entities of interest, including phone numbers, dates, times and currencies (to name a few examples). We refer to these types of entities as "semi-structured named entities", since they usually follow certain syntactic formats according to some conventions, although their structure is typically not well-defined. Regular expression solutions require significant amount of manual effort and supervised machine learning approaches rely on large sets of labeled training data. Therefore, these approaches do not scale when we need to support many semi-structured entity types in many languages and regions.
   In this paper, we study this problem and propose a novel three-level bootstrapping framework for the detection of semi-structured entities. We describe the proposed techniques for phone, date and time entities, and perform extensive evaluations on English, German, Polish, Swedish and Turkish documents. Despite the minimal input from the user, our approach can achieve 95% precision and 84% recall for phone entities, and 94% precision and 81% recall for date and time entities, on average. We also discuss implementation details and report run time performance results, which show significant improvements over regular expression based solutions.
Keywords: boostrapping algorithm, ner, weakly-supervised learning
Autonomous resource provisioning for multi-service web applications BIBAKFull-Text 471-480
  Dejun Jiang; Guillaume Pierre; Chi-Hung Chi
Dynamic resource provisioning aims at maintaining the end-to-end response time of a web application within a pre-defined SLA. Although the topic has been well studied for monolithic applications, provisioning resources for applications composed of multiple services remains a challenge. When the SLA is violated, one must decide which service(s) should be reprovisioned for optimal effect. We propose to assign an SLA only to the front-end service. Other services are not given any particular response time objectives. Services are autonomously responsible for their own provisioning operations and collaboratively negotiate performance objectives with each other to decide the provisioning service(s). We demonstrate through extensive experiments that our system can add/remove/shift both servers and caches within an entire multi-service application under varying workloads to meet the SLA target and improve resource utilization.
Keywords: multi-service application, resource provisioning
Topic initiator detection on the world wide web BIBAKFull-Text 481-490
  Xin Jin; Scott Spangler; Rui Ma; Jiawei Han
In this paper we introduce a new Web mining and search technique -- Topic Initiator Detection (TID) on the Web. Given a topic query on the Internet and the resulting collection of time-stamped web documents which contain the query keywords, the task of TID is to automatically return which web document (or its author) initiated the topic or was the first to discuss about the topic.
   To deal with the TID problem, we design a system framework and propose algorithm InitRank (Initiator Ranking) to rank the web documents by their possibility to be the topic initiator. We first extract features from the web documents and design several topic initiator indicators. Then, we propose a TCL graph which integrates the Time, Content and Link information and design an optimization framework over the graph to compute InitRank. Experiments show that compared with baseline methods, such as direct time sorting, well-known link based ranking algorithms PageRank and HITS, InitRank achieves the best overall performance with high effectiveness and robustness. In case studies, we successfully detected (1) the first web document related to a famous rumor of an Australia product banned in USA and (2) the pre-release of IBM and Google Cloud Computing collaboration before the official announcement.
Keywords: information retrieval, ranking, topic initiator, web mining
Smart caching for web browsers BIBAKFull-Text 491-500
  Kaimin Zhang; Lu Wang; Aimin Pan; Bin Benjamin Zhu
This paper presents smart caching schemes for Web browsers. For modern Web applications, the style formatting and layout calculation often account for substantial amounts of the local computation in order to render a Web page. In this paper, we propose two caching schemes to reduce the computation of style formatting and layout calculation, named smart style caching and layout caching, respectively. The stable style data and layout data for DOM (Document Object Model) elements are recorded to construct the caches when a Web page is browsed. The cached data is checked in the granularity of DOM elements and applied directly if the identified DOM element is not changed in the sequent visits to the same page.
   We have implemented a prototype of the proposed caching schemes based on the Webkit layout engine. The experimental results with Web pages from the Top 25 Web sites show that, with the smart style caching scheme enabled, the time consumed for style formatting is reduced by 64% on average; with both the smart style caching scheme and layout caching scheme enabled, the time consumed for layout calculation are reduced by 61% on average, and the overall performance improvement is about 46%.
Keywords: browser, caching, cascade style sheet, css, javascript, web
Large-scale bot detection for search engines BIBAKFull-Text 501-510
  Hongwen Kang; Kuansan Wang; David Soukal; Fritz Behr; Zijian Zheng
In this paper, we propose a semi-supervised learning approach for classifying program (bot) generated web search traffic from that of genuine human users. The work is motivated by the challenge that the enormous amount of search data pose to traditional approaches that rely on fully annotated training samples. We propose a semi-supervised framework that addresses the problem in multiple fronts. First, we use the CAPTCHA technique and simple heuristics to extract from the data logs a large set of training samples with initial labels, though directly using these training data is problematic because the data thus sampled are biased. To tackle this problem, we further develop a semi-supervised learning algorithm to take advantage of the unlabeled data to improve the classification performance. These two proposed algorithms can be seamlessly combined and very cost efficient to scale the training process. In our experiment, the proposed approach showed significant (i.e. 2:1) improvement compared to the traditional supervised approach.
Keywords: bot detection, captcha, click logs, query logs, search engine, semi-supervised learning
LCA-based selection for XML document collections BIBAKFull-Text 511-520
  Georgia Koloniari; Evaggelia Pitoura
In this paper, we address the problem of database selection for XML document collections, that is, given a set of collections and a user query, how to rank the collections based on their goodness to the query. Goodness is determined by the relevance of the documents in the collection to the query. We consider keyword queries and support Lowest Common Ancestor (LCA) semantics for defining query results, where the relevance of each document to a query is determined by properties of the LCA of those nodes in the XML document that contain the query keywords. To avoid evaluating queries against each document in a collection, we propose maintaining in a preprocessing phase, information about the LCAs of all pairs of keywords in a document and use it to approximate the properties of the LCA-based results of a query. To improve storage and processing efficiency, we use appropriate summaries of the LCA information based on Bloom filters. We address both a boolean and a weighted version of the database selection problem. Our experimental results show that our approach incurs low errors in the estimation of the goodness of a collection and provides rankings that are very close to the actual ones.
Keywords: database selection, lowest common ancestor, xml
Stop thinking, start tagging: tag semantics emerge from collaborative verbosity BIBAKFull-Text 521-530
  Christian Körner; Dominik Benz; Andreas Hotho; Markus Strohmaier; Gerd Stumme
Recent research provides evidence for the presence of emergent semantics in collaborative tagging systems. While several methods have been proposed, little is known about the factors that influence the evolution of semantic structures in these systems. A natural hypothesis is that the quality of the emergent semantics depends on the pragmatics of tagging: Users with certain usage patterns might contribute more to the resulting semantics than others. In this work, we propose several measures which enable a pragmatic differentiation of taggers by their degree of contribution to emerging semantic structures. We distinguish between categorizers, who typically use a small set of tags as a replacement for hierarchical classification schemes, and describers, who are annotating resources with a wealth of freely associated, descriptive keywords. To study our hypothesis, we apply semantic similarity measures to 64 different partitions of a real-world and large-scale folksonomy containing different ratios of categorizers and describers. Our results not only show that "verbose" taggers are most useful for the emergence of tag semantics, but also that a subset containing only 40% of the most 'verbose' taggers can produce results that match and even outperform the semantic precision obtained from the whole dataset. Moreover, the results suggest that there exists a causal link between the pragmatics of tagging and resulting emergent semantics. This work is relevant for designers and analysts of tagging systems interested (i) in fostering the semantic development of their platforms, (ii) in identifying users introducing "semantic noise", and (iii) in learning ontologies.
Keywords: folksonomies, pragmatics, semantics, tagging, user characteristics
Mind the data skew: distributed inferencing by speeddating in elastic regions BIBAKFull-Text 531-540
  Spyros Kotoulas; Eyal Oren; Frank van Harmelen
Semantic Web data exhibits very skewed frequency distributions among terms. Efficient large-scale distributed reasoning methods should maintain load-balance in the face of such highly skewed distribution of input data. We show that term-based partitioning, used by most distributed reasoning approaches, has limited scalability due to load-balancing problems.
   We address this problem with a method for data distribution based on clustering in elastic regions. Instead of assigning data to fixed peers, data flows semi-randomly in the network. Data items "speed-date" while being temporarily collocated in the same peer. We introduce a bias in the routing to allow semantically clustered neighborhoods to emerge. Our approach is self-organising, efficient and does not require any central coordination.
   We have implemented this method on the MaRVIN platform and have performed experiments on large real-world datasets, using a cluster of up to 64 nodes. We compute the RDFS closure over different datasets and show that our clustering algorithm drastically reduces computation time, calculating the RDFS closure of 200 million triples in 7.2 minutes.
Keywords: distributed, peer-to-peer, reasoning, self-organisation
Towards natural question guided search BIBAKFull-Text 541-550
  Alexander Kotov; ChengXiang Zhai
Web search is generally motivated by an information need. Since asking well-formulated questions is the fastest and the most natural way to obtain information for human beings, almost all queries posed to search engines correspond to some underlying questions, which reflect the user's information need. Accurate determination of these questions may substantially improve the quality of search results and usability of search interfaces. In this paper, we propose a new framework for question-guided search, in which a retrieval system would automatically generate potentially interesting questions to users based on the search results of a query. Since the answers to such questions are known to exist in the search results, these questions can potentially guide users directly to the answers that they are looking for, eliminating the need to scan the documents in the result list. Moreover, in case of imprecise or ambiguous queries, automatically generated questions can naturally engage users into a feedback cycle to refine their information need and guide them towards their search goals. Implementation of the proposed strategy raises new challenges in content indexing, question generation, ranking and feedback. We propose new methods to address these challenges and evaluated them with a prototype system on a subset of Wikipedia. Evaluation results show the promise of this new question-guided search strategy.
Keywords: interactive search, query processing, search interface
Fine-grained privilege separation for web applications BIBAKFull-Text 551-560
  Akshay Krishnamurthy; Adrian Mettler; David Wagner
We present a programming model for building web applications with security properties that can be confidently verified during a security review. In our model, applications are divided into isolated, privilege-separated components, enabling rich security policies to be enforced in a way that can be checked by reviewers. In our model, the web framework enforces privilege separation and isolation of web applications by requiring the use of an object-capability language and providing interfaces that expose limited, explicitly-specified privileges to application components. This approach restricts what each component of the application can do and quarantines buggy or compromised code. It also provides a way to more safely integrate third-party, less-trusted code into a web application. We have implemented a prototype of this model based upon the Java Servlet framework and used it to build a webmail application. Our experience with this example suggests that the approach is viable and helpful at establishing reviewable application-specific security properties.
Keywords: object-capabilities, privilege separation, web applications
A characterization of online browsing behavior BIBAKFull-Text 561-570
  Ravi Kumar; Andrew Tomkins
In this paper, we undertake a large-scale study of online user behavior based on search and toolbar logs. We propose a new CCS taxonomy of pageviews consisting of Content (news, portals, games, verticals, multimedia), Communication (email, social networking, forums, blogs, chat), and Search (Web search, item search, multimedia search). We show that roughly half of all pageviews online are content, one-third are communications, and the remaining one-sixth are search. We then give further breakdowns to characterize the pageviews within each high-level category.
   We then study the extent to which pages of certain types are revisited by the same user over time, and the mechanisms by which users move from page to page, within and across hosts, and within and across page types. We consider robust schemes for assigning responsibility for a pageview to ancestors along the chain of referrals. We show that mail, news, and social networking pageviews are insular in nature, appearing primarily in homogeneous sessions of one type. Search pageviews, on the other hand, appear on the path to a disproportionate number of pageviews, but cannot be viewed as the principal mechanism by which those pageviews were reached.
   Finally, we study the burstiness of pageviews associated with a URL, and show that by and large, online browsing behavior is not significantly affected by "breaking" material with non-uniform visit frequency.
Keywords: browsing, pageviews, toolbar analysis
Generalized distances between rankings BIBAKFull-Text 571-580
  Ravi Kumar; Sergei Vassilvitskii
Spearman's footrule and Kendall's tau are two well established distances between rankings. They, however, fail to take into account concepts crucial to evaluating a result set in information retrieval: element relevance and positional information. That is, changing the rank of a highly-relevant document should result in a higher penalty than changing the rank of an irrelevant document; a similar logic holds for the top versus the bottom of the result ordering. In this work, we extend both of these metrics to those with position and element weights, and show that a variant of the Diaconis-Graham inequality still holds -- the generalized two measures remain within a constant factor of each other for all permutations.
   We continue by extending the element weights into a distance metric between elements. For example, in search evaluation, swapping the order of two nearly duplicate results should result in little penalty, even if these two are highly relevant and appear at the top of the list. We extend the distance measures to this more general case and show that they remain within a constant factor of each other.
   We conclude by conducting simple experiments on web search data with the proposed measures. Our experiments show that the weighted generalizations are more robust and consistent with each other than their unweighted counter-parts.
Keywords: kendall's tau, metrics, permutation distances, spearman footrule
Redundancy detection in service-oriented systems BIBAKFull-Text 581-590
  Peep Küngas; Marlon Dumas
This paper addresses the problem of identifying redundant data in large-scale service-oriented information systems. Specifically, the paper puts forward an automated method to pinpoint potentially redundant data attributes from a given collection of semantically-annotated Web service interfaces. The key idea is to construct a service network to represent all input and output dependencies between data attributes and operations captured in the service interfaces, and to apply centrality measures from network theory in order to quantify the degree to which an attribute belongs to a given subsystem. The proposed method was tested on a federated governmental information system consisting of 58 independently-maintained information systems providing altogether about 1000 service operations described in WSDL. The accuracy of the method is evaluated in terms of precision and recall.
Keywords: data model redundancy, federated information systems, metrics, semantic web services, web services
What is Twitter, a social network or a news media? BIBAKFull-Text 591-600
  Haewoon Kwak; Changhyun Lee; Hosung Park; Sue Moon
Twitter, a microblogging service less than three years old, commands more than 41 million users as of July 2009 and is growing fast. Twitter users tweet about any topic within the 140-character limit and follow others to receive their tweets. The goal of this paper is to study the topological characteristics of Twitter and its power as a new medium of information sharing.
   We have crawled the entire Twitter site and obtained 41.7 million user profiles, 1.47 billion social relations, 4,262 trending topics, and 106 million tweets. In its follower-following topology analysis we have found a non-power-law follower distribution, a short effective diameter, and low reciprocity, which all mark a deviation from known characteristics of human social networks [28]. In order to identify influentials on Twitter, we have ranked users by the number of followers and by PageRank and found two rankings to be similar. Ranking by retweets differs from the previous two rankings, indicating a gap in influence inferred from the number of followers and that from the popularity of one's tweets. We have analyzed the tweets of top trending topics and reported on their temporal behavior and user participation. We have classified the trending topics based on the active period and the tweets and show that the majority (over 85%) of topics are headline news or persistent news in nature. A closer look at retweets reveals that any retweeted tweet is to reach an average of 1,000 users no matter what the number of followers is of the original tweet. Once retweeted, a tweet gets retweeted almost instantly on next hops, signifying fast diffusion of information after the 1st retweet.
   To the best of our knowledge this work is the first quantitative study on the entire Twittersphere and information diffusion on it.
Keywords: Twitter, degree of separation, homophily, influential, information diffusion, online social network, pagerank, reciprocity, retweet
Randomization tests for distinguishing social influence and homophily effects BIBAKFull-Text 601-610
  Timothy La Fond; Jennifer Neville
Relational autocorrelation is ubiquitous in relational domains. This observed correlation between class labels of linked instances in a network (e.g., two friends are more likely to share political beliefs than two randomly selected people) can be due to the effects of two different social processes. If social influence effects are present, instances are likely to change their attributes to conform to their neighbor values. If homophily effects are present, instances are likely to link to other individuals with similar attribute values. Both these effects will result in autocorrelated attribute values. When analyzing static relational networks it is impossible to determine how much of the observed correlation is due each of these factors. However, the recent surge of interest in social networks has increased the availability of dynamic network data. In this paper, we present a randomization technique for temporal network data where the attributes and links change over time. Given data from two time steps, we measure the gain in correlation and assess whether a significant portion of this gain is due to influence and/or homophily. We demonstrate the efficacy of our method on semi-synthetic data and then apply the method to a real-world social networks dataset, showing the impact of both influence and homophily effects.
Keywords: homophily, randomization, social influence, social networks
A pattern tree-based approach to learning URL normalization rules BIBAKFull-Text 611-620
  Tao Lei; Rui Cai; Jiang-Ming Yang; Yan Ke; Xiaodong Fan; Lei Zhang
Duplicate URLs have brought serious troubles to the whole pipeline of a search engine, from crawling, indexing, to result serving. URL normalization is to transform duplicate URLs to a canonical form using a set of rewrite rules. Nowadays URL normalization has attracted significant attention as it is lightweight and can be flexibly integrated into both the online (e.g. crawling) and the offline (e.g. index compression) parts of a search engine. To deal with a large scale of websites, automatic approaches are highly desired to learn rewrite rules for various kinds of duplicate URLs. In this paper, we rethink the problem of URL normalization from a global perspective and propose a pattern tree-based approach, which is remarkably different from existing approaches. Most current approaches learn rewrite rules by iteratively inducing local duplicate pairs to more general forms, and inevitably suffer from noisy training data and are practically inefficient. Given a training set of URLs partitioned into duplicate clusters for a targeted website, we develop a simple yet efficient algorithm to automatically construct a URL pattern tree. With the pattern tree, the statistical information from all the training samples is leveraged to make the learning process more robust and reliable. The learning process is also accelerated as rules are directly summarized based on pattern tree nodes. In addition, from an engineering perspective, the pattern tree helps select deployable rules by removing conflicts and redundancies. An evaluation on more than 70 million duplicate URLs from 200 websites showed that the proposed approach achieves very promising performance, in terms of both de-duping effectiveness and computational efficiency.
Keywords: url deduplication, url normalization, url pattern
Using a model of social dynamics to predict popularity of news BIBAKFull-Text 621-630
  Kristina Lerman; Tad Hogg
Popularity of content in social media is unequally distributed, with some items receiving a disproportionate share of attention from users. Predicting which newly-submitted items will become popular is critically important for both companies that host social media sites and their users. Accurate and timely prediction would enable the companies to maximize revenue through differential pricing for access to content or ad placement. Prediction would also give consumers an important tool for filtering the ever-growing amount of content. Predicting popularity of content in social media, however, is challenging due to the complex interactions among content quality, how the social media site chooses to highlight content, and influence among users. While these factors make it difficult to predict popularity a priori, we show that stochastic models of user behavior on these sites allows predicting popularity based on early user reactions to new content. By incorporating aspects of the web site design, such models improve on predictions based on simply extrapolating from the early votes. We validate this claim on the social news portal Digg using a previously-developed model of social voting based on the Digg user interface.
Keywords: popularity, prediction, social dynamics, social media, social voting
Empirical comparison of algorithms for network community detection BIBAKFull-Text 631-640
  Jure Leskovec; Kevin J. Lang; Michael Mahoney
Detecting clusters or communities in large real-world graphs such as large social or information networks is a problem of considerable interest. In practice, one typically chooses an objective function that captures the intuition of a network cluster as set of nodes with better internal connectivity than external connectivity, and then one applies approximation algorithms or heuristics to extract sets of nodes that are related to the objective function and that "look like" good communities for the application of interest.
   In this paper, we explore a range of network community detection methods in order to compare them and to understand their relative performance and the systematic biases in the clusters they identify. We evaluate several common objective functions that are used to formalize the notion of a network community, and we examine several different classes of approximation algorithms that aim to optimize such objective functions. In addition, rather than simply fixing an objective and asking for an approximation to the best cluster of any size, we consider a size-resolved version of the optimization problem. Considering community quality as a function of its size provides a much finer lens with which to examine community detection algorithms, since objective functions and approximation algorithms often have non-obvious size-dependent behavior.
Keywords: community structure, conductance, flow-based methods, graph partitioning, spectral methods
Predicting positive and negative links in online social networks BIBAKFull-Text 641-650
  Jure Leskovec; Daniel Huttenlocher; Jon Kleinberg
We study online social networks in which relationships can be either positive (indicating relations such as friendship) or negative (indicating relations such as opposition or antagonism). Such a mix of positive and negative links arise in a variety of online settings; we study datasets from Epinions, Slashdot and Wikipedia. We find that the signs of links in the underlying social networks can be predicted with high accuracy, using models that generalize across this diverse range of sites. These models provide insight into some of the fundamental principles that drive the formation of signed links in networks, shedding light on theories of balance and status from social psychology; they also suggest social computing applications by which the attitude of one user toward another can be estimated from evidence provided by their relationships with other members of the surrounding social network.
Keywords: distrust, negative edges, positive edges, signed networks, status theory, structural balance, trust
Facetedpedia: dynamic generation of query-dependent faceted interfaces for wikipedia BIBAKFull-Text 651-660
  Chengkai Li; Ning Yan; Senjuti B. Roy; Lekhendro Lisham; Gautam Das
This paper proposes Facetedpedia, a faceted retrieval system for information discovery and exploration in Wikipedia. Given the set of Wikipedia articles resulting from a keyword query, Facetedpedia generates a faceted interface for navigating the result articles. Compared with other faceted retrieval systems, Facetedpedia is fully automatic and dynamic in both facet generation and hierarchy construction, and the facets are based on the rich semantic information from Wikipedia. The essence of our approach is to build upon the collaborative vocabulary in Wikipedia, more specifically the intensive internal structures (hyperlinks) and folksonomy (category system). Given the sheer size and complexity of this corpus, the space of possible choices of faceted interfaces is prohibitively large. We propose metrics for ranking individual facet hierarchies by user's navigational cost, and metrics for ranking interfaces (each with k facets) by both their average pairwise similarities and average navigational costs. We thus develop faceted interface discovery algorithms that optimize the ranking metrics. Our experimental evaluation and user study verify the effectiveness of the system.
Keywords: data exploration, faceted search, wikipedia
A contextual-bandit approach to personalized news article recommendation BIBAKFull-Text 661-670
  Lihong Li; Wei Chu; John Langford; Robert E. Schapire
Personalized web services strive to adapt their services (advertisements, news articles, etc.) to individual users by making use of both content and user information. Despite a few recent advances, this problem remains challenging for at least two reasons. First, web service is featured with dynamically changing pools of content, rendering traditional collaborative filtering methods inapplicable. Second, the scale of most web services of practical interest calls for solutions that are both fast in learning and computation.
   In this work, we model personalized recommendation of news articles as a contextual bandit problem, a principled approach in which a learning algorithm sequentially selects articles to serve users based on contextual information about the users and articles, while simultaneously adapting its article-selection strategy based on user-click feedback to maximize total user clicks.
   The contributions of this work are three-fold. First, we propose a new, general contextual bandit algorithm that is computationally efficient and well motivated from learning theory. Second, we argue that any bandit algorithm can be reliably evaluated offline using previously recorded random traffic. Finally, using this offline evaluation method, we successfully applied our new algorithm to a Yahoo! Front Page Today Module dataset containing over 33 million events. Results showed a 12.5% click lift compared to a standard context-free bandit algorithm, and the advantage becomes even greater when data gets more scarce.
Keywords: contextual bandit, exploration/exploitation dilemma, personalization, recommender systems, web service
b-Bit minwise hashing BIBAKFull-Text 671-680
  Ping Li; Christian König
This paper establishes the theoretical framework of b-bit minwise hashing. The original minwise hashing method has become a standard technique for estimating set similarity (e.g., resemblance) with applications in information retrieval, data management, computational advertising, etc.
   By only storing b bits of each hashed value (e.g., b=1 or 2), we gain substantial advantages in terms of storage space. We prove the basic theoretical results and provide an unbiased estimator of the resemblance for any b. We demonstrate that, even in the least favorable scenario, using b=1 may reduce the storage space at least by a factor of 21.3 (or 10.7) compared to b=64 (or b=32), if one is interested in resemblance >0.5.
Keywords: hashing, similarity estimation
Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce BIBAKFull-Text 681-690
  Chao Liu; Hung-chih Yang; Jinliang Fan; Li-Wei He; Yi-Min Wang
The Web abounds with dyadic data that keeps increasing by every single second. Previous work has repeatedly shown the usefulness of extracting the interaction structure inside dyadic data [21, 9, 8]. A commonly used tool in extracting the underlying structure is the matrix factorization, whose fame was further boosted in the Netflix challenge [26]. When we were trying to replicate the same success on real-world Web dyadic data, we were seriously challenged by the scalability of available tools. We therefore in this paper report our efforts on scaling up the nonnegative matrix factorization (NMF) technique. We show that by carefully partitioning the data and arranging the computations to maximize data locality and parallelism, factorizing a tens of millions by hundreds of millions matrix with billions of nonzero cells can be accomplished within tens of hours. This result effectively assures practitioners of the scalability of NMF on Web-scale dyadic data.
Keywords: distributed computing, dyadic data, mapreduce, nonnegative matrix factorization
Exploiting social context for review quality prediction BIBAKFull-Text 691-700
  Yue Lu; Panayiotis Tsaparas; Alexandros Ntoulas; Livia Polanyi
Online reviews in which users publish detailed commentary about their experiences and opinions with products, services, or events are extremely valuable to users who rely on them to make informed decisions. However, reviews vary greatly in quality and are constantly increasing in number, therefore, automatic assessment of review helpfulness is of growing importance. Previous work has addressed the problem by treating a review as a stand-alone document, extracting features from the review text, and learning a function based on these features for predicting the review quality. In this work, we exploit contextual information about authors' identities and social networks for improving review quality prediction. We propose a generic framework for incorporating social context information by adding regularization constraints to the text-based predictor. Our approach can effectively use the social context information available for large quantities of unlabeled reviews. It also has the advantage that the resulting predictor is usable even when social context is unavailable. We validate our framework within a real commerce portal and experimentally demonstrate that using social context information can help improve the accuracy of review quality prediction especially when the available training data is sparse.
Keywords: graph regularization, review helpfulness, review quality, social network
Sampling community structure BIBAKFull-Text 701-710
  Arun S. Maiya; Tanya Y. Berger-Wolf
We propose a novel method, based on concepts from expander graphs, to sample communities in networks. We show that our sampling method, unlike previous techniques, produces subgraphs representative of community structure in the original network. These generated subgraphs may be viewed as stratified samples in that they consist of members from most or all communities in the network. Using samples produced by our method, we show that the problem of community detection may be recast into a case of statistical relational learning. We empirically evaluate our approach against several real-world datasets and demonstrate that our sampling method can effectively be used to infer and approximate community affiliation in the larger network.
Keywords: clustering, community detection, complex networks, graphs, sampling, social networks
Fast and parallel webpage layout BIBAKFull-Text 711-720
  Leo A. Meyerovich; Rastislav Bodik
The web browser is a CPU-intensive program. Especially on mobile devices, webpages load too slowly, expending significant time in processing a document's appearance. Due to power constraints, most hardware-driven speedups will come in the form of parallel architectures. This is also true of mobile devices such as phones and e-books. In this paper, we introduce new algorithms for CSS selector matching, layout solving, and font rendering, which represent key components for a fast layout engine. Evaluation on popular sites shows speedups as high as 80x. We also formulate the layout problem with attribute grammars, enabling us to not only parallelize our algorithm but prove that it computes in O(log) time and without reflow.
Keywords: attribute grammar, box model, css, font, html, layout, mobile, multicore, selector
Object views: fine-grained sharing in browsers BIBAKFull-Text 721-730
  Leo A. Meyerovich; Adrienne Porter Felt; Mark S. Miller
Browsers do not currently support the secure sharing of JavaScript objects between principals. We present this problem as the need for object views, which are consistent and controllable versions of objects. Multiple views can be made for the same object and customized for the recipients. We implement object views with a JavaScript library that wraps shared objects and interposes on all access attempts. The security challenge is to fully mediate access to objects shared through a view and prevent privilege escalation. We discuss how object views can be deployed in two settings: same-origin sharing with rewriting-based JavaScript isolation systems like Google Caja, and inter-origin sharing between browser frames over a message-passing channel.
   To facilitate simple document sharing, we build a policy system for declaratively defining policies for document object views. Notably, our document policy system makes it possible to hide elements without breaking document structure invariants. Developers can control the fine-grained behavior of object views with an aspect system that accepts programmatic policies.
Keywords: aspect, browser, ecmascript, javascript, membrane, reference monitor, remote object, same origin policy, web security, wrapper
Protocol-aware matching of web service interfaces for adapter development BIBAKFull-Text 731-740
  Hamid Reza Motahari Nezhad; Guang Yuan Xu; Boualem Benatallah
With the rapid growth in the number of online Web services, the problem of service adaptation has received significant attention. In matching and adaptation, the functional description of services including interface and data as well as behavioral descriptions are important. Existing work on matching and adaptation focuses only on one aspect.
   In this paper, we present a semi-automated matching approach that considers both service descriptions. We introduce two protocol-aware service interface matching algorithms, i.e. depth-based interface matching and iterative reference-based interface matching. These algorithms refine the results of interface matching by incorporating the ordering constraints imposed by business protocol definitions on service operations. We have implemented a prototype and performed experiments using the specification of synthetic and real-world Web services. Experiments show that the proposed approaches lead to a significant improvement in the quality of matching between services.
Keywords: adaptation, business process, web services
Volunteer computing: a model of the factors determining contribution to community-based scientific research BIBAKFull-Text 741-750
  Oded Nov; David Anderson; Ofer Arazy
Volunteer computing is a powerful way to harness distributed resources to perform large-scale tasks, similarly to other types of community-based initiatives. Volunteer computing is based on two pillars: the first is computational -- allocating and managing large computing tasks; the second is participative -- making large numbers of individuals volunteer their computer resources to a project. While the computational aspects of volunteer computing received much research attention, the participative aspect remains largely unexplored. In this study we aim to address this gap: by drawing on social psychology and online communities research, we develop and test a three-dimensional model of the factors determining volunteer computing users' contribution. We investigate one of the largest volunteer computing projects -- SETI@home -- by linking survey data about contributors' motivations to their activity logs. Our findings highlight the differences between volunteer computing and other forms of community-based projects, and reveal the intricate relationship between individual motivations, social affiliation, tenure in the project, and resource contribution. Implications for research and practice are discussed.
Keywords: boinc, citizen science, crowdsourcing, motivations, online communities, seti@home, volunteer computing
Cross-domain sentiment classification via spectral feature alignment BIBAKFull-Text 751-760
  Sinno Jialin Pan; Xiaochuan Ni; Jian-Tao Sun; Qiang Yang; Zheng Chen
Sentiment classification aims to automatically predict sentiment polarity (e.g., positive or negative) of users publishing sentiment data (e.g., reviews, blogs). Although traditional classification algorithms can be used to train sentiment classifiers from manually labeled text data, the labeling work can be time-consuming and expensive. Meanwhile, users often use some different words when they express sentiment in different domains. If we directly apply a classifier trained in one domain to other domains, the performance will be very low due to the differences between these domains. In this work, we develop a general solution to sentiment classification when we do not have any labels in a target domain but have some labeled data in a different domain, regarded as source domain. In this cross-domain sentiment classification setting, to bridge the gap between the domains, we propose a spectral feature alignment (SFA) algorithm to align domain-specific words from different domains into unified clusters, with the help of domain-independent words as a bridge. In this way, the clusters can be used to reduce the gap between domain-specific words of the two domains, which can be used to train sentiment classifiers in the target domain accurately. Compared to previous approaches, SFA can discover a robust representation for cross-domain data by fully exploiting the relationship between the domain-specific and domain-independent words via simultaneously co-clustering them in a common latent space. We perform extensive experiments on two real world datasets, and demonstrate that SFA significantly outperforms previous approaches to cross-domain sentiment classification.
Keywords: domain adaptation, feature alignment, opinion mining, sentiment classification, transfer learning
DSNotify: handling broken links in the web of data BIBAKFull-Text 761-770
  Niko P. Popitsch; Bernhard Haslhofer
The Web of Data has emerged as a way of exposing structured linked data on the Web. It builds on the central building blocks of the Web (URIs, HTTP) and benefits from its simplicity and wide-spread adoption. It does, however, also inherit the unresolved issues such as the broken link problem. Broken links constitute a major challenge for actors consuming Linked Data as they require them to deal with reduced accessibility of data. We believe that the broken link problem is a major threat to the whole Web of Data idea and that both Linked Data consumers and providers will require solutions that deal with this problem. Since no general solutions for fixing such links in the Web of Data have emerged, we make three contributions into this direction: first, we provide a concise definition of the broken link problem and a comprehensive analysis of existing approaches. Second, we present DSNotify, a generic framework able to assist human and machine actors in fixing broken links. It uses heuristic feature comparison and employs a time-interval-based blocking technique for the underlying instance matching problem. Third, we derived benchmark datasets from knowledge bases such as DBpedia and evaluated the effectiveness of our approach with respect to the broken link problem. Our results show the feasibility of a time-interval-based blocking approach for systems that aim at detecting and fixing broken links in the Web of Data.
Keywords: blocking, broken links, instance matching, link integrity, linked data
Ad-hoc object retrieval in the web of data BIBAKFull-Text 771-780
  Jeffrey Pound; Peter Mika; Hugo Zaragoza
Semantic Search refers to a loose set of concepts, challenges and techniques having to do with harnessing the information of the growing Web of Data (WoD) for Web search. Here we propose a formal model of one specific semantic search task: ad-hoc object retrieval. We show that this task provides a solid framework to study some of the semantic search problems currently tackled by commercial Web search engines. We connect this task to the traditional ad-hoc document retrieval and discuss appropriate evaluation metrics. Finally, we carry out a realistic evaluation of this task in the context of a Web search application.
Keywords: evaluation, object retrieval, semantic search
Diversifying web search results BIBAKFull-Text 781-790
  Davood Rafiei; Krishna Bharat; Anand Shukla
Result diversity is a topic of great importance as more facets of queries are discovered and users expect to find their desired facets in the first page of the results. However, the underlying questions of how 'diversity' interplays with 'quality' and when preference should be given to one or both are not well-understood. In this work, we model the problem as expectation maximization and study the challenges of estimating the model parameters and reaching an equilibrium. One model parameter, for example, is correlations between pages which we estimate using textual contents of pages and click data (when available). We conduct experiments on diversifying randomly selected queries from a query log and the queries chosen from the disambiguation topics of Wikipedia. Our algorithm improves upon Google in terms of the diversity of random queries, retrieving 14% to 38% more aspects of queries in top 5, while maintaining a precision very close to Google. On a more selective set of queries that are expected to benefit from diversification, our algorithm improves upon Google in terms of precision and diversity of the results, and significantly outperforms another baseline system for result diversification.
Keywords: query ambiguity, result diversity, search diversity
A large-scale active learning system for topical categorization on the web BIBAKFull-Text 791-800
  Suju Rajan; Dragomir Yankov; Scott J. Gaffney; Adwait Ratnaparkhi
Many web applications such as ad matching systems, vertical search engines, and page categorization systems require the identification of a particular type or class of pages on the Web. The sheer number and diversity of the pages on the Web, however, makes the problem of obtaining a good sample of the class of interest hard. In this paper, we describe a successfully deployed end-to-end system that starts from a biased training sample and makes use of several state-of-the-art machine learning algorithms working in tandem, including a powerful active learning component, in order to achieve a good classification system. The system is evaluated on traffic from a real-world ad-matching platform and is shown to achieve high categorization effectiveness with a significant reduction in editorial effort and labeling time.
Keywords: active learning, classification with imbalanced datasets, svms, web scale performance evaluation
Distributing private data in challenged network environments BIBAKFull-Text 801-810
  Azarias Reda; Brian Noble; Yidnekachew Haile
Developing countries face significant challenges in network access, making even simple network tasks unpleasant. Many standard techniques -- caching and predictive prefetching -- help somewhat, but provide little or no assistance for personal data that is needed only by a single user. Sulula addresses this problem by leveraging the near-ubiquity of cellular phones able to send and receive simple SMS messages. Rather than visit a kiosk and fetch data on demand -- a tiresome process at best -- users request a future visit. If capacity exists, the kiosk can schedule secure retrieval of that user's data, saving time and more efficiently utilizing the kiosk's limited connectivity. When the user arrives at a provisioned kiosk, she need only obtain the session key on-demand, and thereafter has instant access. In addition, Sulula allows users to schedule data uploads. Experimental results show significant gains for the end user, saving tens of minutes of time for a typical email/news reading session. We also describe a small, ongoing deployment in-country for proof-of-concept, lessons learned from that experience, and provide a discussion on pricing and marketplace issues that remain to be addressed to make the system viable for developing-world access.
Keywords: band-width, caching, developing regions, Ethiopia, latency, limited connectivity, personal data, prefetching, sms, www access
Factorizing personalized Markov chains for next-basket recommendation BIBAKFull-Text 811-820
  Steffen Rendle; Christoph Freudenthaler; Lars Schmidt-Thieme
Recommender systems are an important component of many websites. Two of the most popular approaches are based on matrix factorization (MF) and Markov chains (MC). MF methods learn the general taste of a user by factorizing the matrix over observed user-item preferences. On the other hand, MC methods model sequential behavior by learning a transition graph over items that is used to predict the next action based on the recent actions of a user. In this paper, we present a method bringing both approaches together. Our method is based on personalized transition graphs over underlying Markov chains. That means for each user an own transition matrix is learned -- thus in total the method uses a transition cube. As the observations for estimating the transitions are usually very limited, our method factorizes the transition cube with a pairwise interaction model which is a special case of the Tucker Decomposition. We show that our factorized personalized MC (FPMC) model subsumes both a common Markov chain and the normal matrix factorization model. For learning the model parameters, we introduce an adaption of the Bayesian Personalized Ranking (BPR) framework for sequential basket data. Empirically, we show that our FPMC model outperforms both the common matrix factorization and the unpersonalized MC model both learned with and without factorization.
Keywords: basket recommendation, Markov chain, matrix factorization
Sketcha: a captcha based on line drawings of 3D models BIBAKFull-Text 821-830
  Steven A. Ross; J. Alex Halderman; Adam Finkelstein
This paper introduces a captcha based on upright orientation of line drawings rendered from 3D models. The models are selected from a large database, and images are rendered from random viewpoints, affording many different drawings from a single 3D model. The captcha presents the user with a set of images, and the user must choose an upright orientation for each image. This task generally requires understanding of the semantic content of the image, which is believed to be difficult for automatic algorithms. We describe a process called covert filtering whereby the image database can be continually refreshed with drawings that are known to have a high success rate for humans, by inserting randomly into the captcha new images to be evaluated. Our analysis shows that covert filtering can ensure that captchas are likely to be solvable by humans while deterring attackers who wish to learn a portion of the database. We performed several user studies that evaluate how effectively people can solve the captcha. Comparing these results to an attack based on machine learning, we find that humans possess a substantial performance advantage over computers.
Keywords: 3D models, captcha, drawings, security
Unlocking the semantics of multimedia presentations in the web with the multimedia metadata ontology BIBAKFull-Text 831-840
  Carsten Saathoff; Ansgar Scherp
The semantics of rich multimedia presentations in the web such as SMIL, SVG, and Flash cannot or only to a very limited extend be understood by search engines today. This hampers the retrieval of such presentations and makes their archival and management a difficult task. Existing metadata models and metadata standards are either conceptually too narrow, focus on a specific media type only, cannot be used and combined together, or are not practically applicable for the semantic description of rich multimedia presentations.
   In this paper, we propose the Multimedia Metadata Ontology (M3O) for annotating rich, structured multimedia presentations. The M3O provides a generic modeling framework for representing sophisticated multimedia metadata. It allows for integrating the features provided by the existing metadata models and metadata standards. Our approach bases on Semantic Web technologies and can be easily integrated with multimedia formats such as the W3C standards SMIL and SVG. With the M3O, we unlock the semantics of rich multimedia presentations in the web by making the semantics machine-readable and machine-understandable. The M3O is used with our SemanticMM4U framework for the multi-channel generation of semantically-rich multimedia presentations.
Keywords: multimedia metadata, rich multimedia presentations, semantic annotation
Clustering query refinements by user intent BIBAKFull-Text 841-850
  Eldar Sadikov; Jayant Madhavan; Lu Wang; Alon Halevy
We address the problem of clustering the refinements of a user search query. The clusters computed by our proposed algorithm can be used to improve the selection and placement of the query suggestions proposed by a search engine, and can also serve to summarize the different aspects of information relevant to the original user query. Our algorithm clusters refinements based on their likely underlying user intents by combining document click and session co-occurrence information. At its core, our algorithm operates by performing multiple random walks on a Markov graph that approximates user search behavior. A user study performed on top search engine queries shows that our clusters are rated better than corresponding clusters computed using approaches that use only document click or only sessions co-occurrence information.
Keywords: clustering, query refinements, random walks
Earthquake shakes Twitter users: real-time event detection by social sensors BIBAKFull-Text 851-860
  Takeshi Sakaki; Makoto Okazaki; Yutaka Matsuo
Twitter, a popular microblogging service, has received much attention recently. An important characteristic of Twitter is its real-time nature. For example, when an earthquake occurs, people make many Twitter posts (tweets) related to the earthquake, which enables detection of earthquake occurrence promptly, simply by observing the tweets. As described in this paper, we investigate the real-time interaction of events such as earthquakes in Twitter and propose an algorithm to monitor tweets and to detect a target event. To detect a target event, we devise a classifier of tweets based on features such as the keywords in a tweet, the number of words, and their context. Subsequently, we produce a probabilistic spatiotemporal model for the target event that can find the center and the trajectory of the event location. We consider each Twitter user as a sensor and apply Kalman filtering and particle filtering, which are widely used for location estimation in ubiquitous/pervasive computing. The particle filter works better than other comparable methods for estimating the centers of earthquakes and the trajectories of typhoons. As an application, we construct an earthquake reporting system in Japan. Because of the numerous earthquakes and the large number of Twitter users throughout the country, we can detect an earthquake with high probability (96% of earthquakes of Japan Meteorological Agency (JMA) seismic intensity scale 3 or more are detected) merely by monitoring tweets. Our system detects earthquakes promptly and sends e-mails to registered users. Notification is delivered much faster than the announcements that are broadcast by the JMA.
Keywords: Twitter, earthquake, event detection, location estimation, social sensor
Measurement-calibrated graph models for social network experiments BIBAKFull-Text 861-870
  Alessandra Sala; Lili Cao; Christo Wilson; Robert Zablit; Haitao Zheng; Ben Y. Zhao
Access to realistic, complex graph datasets is critical to research on social networking systems and applications. Simulations on graph data provide critical evaluation of new systems and applications ranging from community detection to spam filtering and social web search. Due to the high time and resource costs of gathering real graph datasets through direct measurements, researchers are anonymizing and sharing a small number of valuable datasets with the community. However, performing experiments using shared real datasets faces three key disadvantages: concerns that graphs can be de-anonymized to reveal private information, increasing costs of distributing large datasets, and that a small number of available social graphs limits the statistical confidence in the results.
   The use of measurement-calibrated graph models is an attractive alternative to sharing datasets. Researchers can "fit" a graph model to a real social graph, extract a set of model parameters, and use them to generate multiple synthetic graphs statistically similar to the original graph. While numerous graph models have been proposed, it is unclear if they can produce synthetic graphs that accurately match the properties of the original graphs. In this paper, we explore the feasibility of measurement-calibrated synthetic graphs using six popular graph models and a variety of real social graphs gathered from the Facebook social network ranging from 30,000 to 3 million edges. We find that two models consistently produce synthetic graphs with common graph metric values similar to those of the original graphs. However, only one produces high fidelity results in our application-level benchmarks. While this shows that graph models can produce realistic synthetic graphs, it also highlights the fact that current graph metrics remain incomplete, and some applications expose graph properties that do not map to existing metrics.
Keywords: graph models, social networking
Monitoring algorithms for negative feedback systems BIBAKFull-Text 871-880
  Mark Sandler; S. Muthukrishnan
There are many online systems where millions of users post original content such as videos, reviews of items such as products, services and businesses, etc. While there are general rules for good behavior or even formal Terms of Service, there are still users who post content that is not suitable. Increasingly, online systems rely on other users who view the posted content to provide feedback.
   We study online systems where users report negative feedback, i.e., report abuse; these systems are quite distinct from much studied, traditional reputation systems that focus on eliciting popularity of content by various voting methods. The central problem that we study here is how to monitor the quality of negative feedback, that is, detect negative feedback which is incorrect, or perhaps even malicious. Systems address this problem by testing flags manually, which is an expensive operation. As a result, there is a tradeoff between the number of manual tests and the number of errors defined as the number of incorrect flags the monitoring system misses.
   Our contributions are as follows:
  • We initiate a systematic study of negative feedbacks systems. Our framework
       is general enough to be applicable for a variety of systems. In this
       framework, the number of errors the system admits is bounded over the worst
       case of adversarial users while simultaneously the system performs only
       small amount of manual testing for multitude of standard users who might
       still err while reporting.
  • Our main contribution is a randomized monitoring algorithm that we call
       Adaptive Probabilistic Testing (APT), that is simple to implement and has
       guarantees on expected number of errors. Even for adversarial users, the
       total expected error is bounded by "N over N flags for a given e < 0.
       Simultaneously, the number of tests performed by the algorithm is within a
       constant factor of the best possible algorithm for standard users.
  • Finally, we present empirical study of our algorithm that shows its
       performance on both synthetic data and real data accumulated from a variety
       of negative feedback systems at Google. Our study indicates that the
       algorithm performs better than the analysis above shows.
    Keywords: negative feedback systems, probabilistic analysis, user reputation
  • Exploiting query reformulations for web search result diversification BIBAKFull-Text 881-890
      Rodrygo L. T. Santos; Craig Macdonald; Iadh Ounis
    When a Web user's underlying information need is not clearly specified from the initial query, an effective approach is to diversify the results retrieved for this query. In this paper, we introduce a novel probabilistic framework for Web search result diversification, which explicitly accounts for the various aspects associated to an underspecified query. In particular, we diversify a document ranking by estimating how well a given document satisfies each uncovered aspect and the extent to which different aspects are satisfied by the ranking as a whole. We thoroughly evaluate our framework in the context of the diversity task of the TREC 2009 Web track. Moreover, we exploit query reformulations provided by three major Web search engines (WSEs) as a means to uncover different query aspects. The results attest the effectiveness of our framework when compared to state-of-the-art diversification approaches in the literature. Additionally, by simulating an upper-bound query reformulation mechanism from official TREC data, we draw useful insights regarding the effectiveness of the query reformulations generated by the different WSEs in promoting diversity.
    Keywords: diversity, relevance, web search
    How useful are your comments?: analyzing and predicting youtube comments and comment ratings BIBAKFull-Text 891-900
      Stefan Siersdorfer; Sergiu Chelaru; Wolfgang Nejdl; Jose San Pedro
    An analysis of the social video sharing platform YouTube reveals a high amount of community feedback through comments for published videos as well as through meta ratings for these comments. In this paper, we present an in-depth study of commenting and comment rating behavior on a sample of more than 6 million comments on 67,000 YouTube videos for which we analyzed dependencies between comments, views, comment ratings and topic categories. In addition, we studied the influence of sentiment expressed in comments on the ratings for these comments using the SentiWordNet thesaurus, a lexical WordNet-based resource containing sentiment annotations. Finally, to predict community acceptance for comments not yet rated, we built different classifiers for the estimation of ratings for these comments. The results of our large-scale evaluations are promising and indicate that community feedback on already rated comments can help to filter new unrated comments or suggest particularly useful but still unrated comments.
    Keywords: YouTube, comment ratings, community feedback
    Optimal rare query suggestion with implicit user feedback BIBAKFull-Text 901-910
      Yang Song; Li-wei He
    Query suggestion has been an effective approach to help users narrow down to the information they need. However, most of existing studies focused on only popular/head queries. Since rare queries possess much less information (e.g., clicks) than popular queries in the query logs, it is much more difficult to efficiently suggest relevant queries to a rare query. In this paper, we propose an optimal rare query suggestion framework by leveraging implicit feedbacks from users in the query logs. Our model resembles the principle of pseudo-relevance feedback which assumes that top-returned results by search engines are relevant. However, we argue that the clicked URLs and skipped URLs contain different levels of information and thus should be treated differently. Hence, our framework optimally combines both the click and skip information from users and uses a random walk model to optimize the query correlation. Our model specifically optimizes two parameters: (1) the restarting (jumping) rate of random walk, and (2) the combination ratio of click and skip information. Unlike the Rocchio algorithm, our learning process does not involve the content of the URLs but simply leverages the click and skip counts in the query-URL bipartite graphs. Consequently, our model is capable of scaling up to the need of commercial search engines. Experimental results on one-month query logs from a large commercial search engine with over 40 million rare queries demonstrate the superiority of our framework, with statistical significance, over the traditional random walk models and pseudo-relevance feedback models.
    Keywords: pseudo-relevance feedback, query suggestion, random walk
    What are the most eye-catching and ear-catching features in the video?: implications for video summarization BIBAKFull-Text 911-920
      Yaxiao Song; Gary Marchionini; Chi Young Oh
    Video summarization is a mechanism for generating short summaries of the video to help people quickly make sense of the content of the video before downloading or seeking more detailed information. To produce reliable automatic video summarization algorithms, it is essential to first understand how human beings create video summaries with manual efforts. This paper focuses on a corpus of instructional documentary video, and seeks to improve automatic video summaries by understanding what features in the video catch the eyes and ears of human assessors, and using these findings to inform automatic summarization algorithms. The paper contributes a thorough and valuable methodology for performing automatic video summarization, and the methodology can be extended to inform summarization of other video corpuses.
    Keywords: audio salience, video summarization, visual salience
    Reining in the web with content security policy BIBAKFull-Text 921-930
      Sid Stamm; Brandon Sterne; Gervase Markham
    The last three years have seen a dramatic increase in both awareness and exploitation of Web Application Vulnerabilities. 2008 and 2009 saw dozens of high-profile attacks against websites using Cross Site Scripting (XSS) and Cross Site Request Forgery (CSRF) for the purposes of information stealing, website defacement, malware planting, clickjacking, etc. While an ideal solution may be to develop web applications free from any exploitable vulnerabilities, real world security is usually provided in layers.
       We present content restrictions, and a content restrictions enforcement scheme called Content Security Policy (CSP), which intends to be one such layer. Content restrictions allow site designers or server administrators to specify how content interacts on their web sites-a security mechanism desperately needed by the untamed Web. These content restrictions rules are activated and enforced by supporting web browsers when a policy is provided for a site via HTTP, and we show how a system such as CSP can be effective to lock down sites and provide an early alert system for vulnerabilities on a web site. Our scheme is also easily deployed, which is made evident by our prototype implementation in Firefox and on the Mozilla Add-Ons web site.
    Keywords: content restrictions, http, security policy, web security
    Visualizing differences in web search algorithms using the expected weighted Hoeffding distance BIBAKFull-Text 931-940
      Mingxuan Sun; Guy Lebanon; Kevyn Collins-Thompson
    We introduce a new dissimilarity function for ranked lists, the expected weighted Hoeffding distance, that has several advantages over current dissimilarity measures for ranked search results. First, it is easily customized for users who pay varying degrees of attention to websites at different ranks. Second, unlike existing measures such as generalized Kendall's tau, it is based on a true metric, preserving meaningful embeddings when visualization techniques like multi-dimensional scaling are applied. Third, our measure can effectively handle partial or missing rank information while retaining a probabilistic interpretation. Finally, the measure can be made computationally tractable and we give a highly efficient algorithm for computing it. We then apply our new metric with multi-dimensional scaling to visualize and explore relationships between the result sets from different search engines, showing how the weighted Hoeffding distance can distinguish important differences in search engine behavior that are not apparent with other rank-distance metrics. Such visualizations are highly effective at summarizing and analyzing insights on which search engines to use, what search strategies users can employ, and how search results evolve over time. We demonstrate our techniques using a collection of popular search engines, a representative set of queries, and frequently used query manipulation methods.
    Keywords: expected weighted Hoeffding distance, ranking, search algorithm dissimilarity
    Alhambra: a system for creating, enforcing, and testing browser security policies BIBAKFull-Text 941-950
      Shuo Tang; Chris Grier; Onur Aciicmez; Samuel T. King
    Alhambra is a browser-based system designed to enforce and test web browser security policies. At the core of Alhambra is a policy-enhanced browser supporting fine-grain security policies that restrict web page contents and execution. Alhambra requires no server-side modifications or additions to the web application. Policies can restrict the construction of the document as well as the execution of JavaScript using access control rules and a taint-tracking engine. Using the Alhambra browser, we present two security policies that we have built using our architecture, both designed to prevent cross-site scripting. The first policy uses a taint-tracking engine to prevent cross-site scripting attacks that exploit bugs in the client-side of the web applications. The second one uses browsing history to create policies that restrict the contents of documents and prevent the inclusion of malicious content.
       Using Alhambra we analyze the impact of policies on the compatibility of web pages. To test compatibility, Alhambra supports revisiting user-generated browsing sessions and comparing multiple security policies in parallel to quickly and automatically evaluate security policies. To compare security policies for identical pages we have also developed useful comparison metrics that quantify differences between identical pages executed with different security policies. Not only do we show that our policies are effective with minimal compatibility cost, we also demonstrate that Alhambra can enforce strong security policies and provide quantitative evaluation of the differences introduced by security policies.
    Keywords: cross-site scripting, web browser, web security
    Atomate it! end-user context-sensitive automation using heterogeneous information sources on the web BIBAKFull-Text 951-960
      Max Van Kleek; Brennan Moore; David R. Karger; Paul André; m.c. schraefel
    The transition of personal information management (PIM) tools off the desktop to the Web presents an opportunity to augment these tools with capabilities provided by the wealth of real-time information readily available. In this paper, we describe a next-generation personal information assistance engine that lets end-users delegate to it various simple context- and activity-reactive tasks and reminders. Our system, Atomate, treats RSS/ATOM feeds from social networking and life-tracking sites as sensor streams, integrating information from such feeds into a simple unified RDF world model representing people, places and things and their timevarying states and activities. Combined with other information sources on the web, including the user's online calendar, web-based e-mail client, news feeds and messaging services, Atomate can be made to automatically carry out a variety of simple tasks for the user, ranging from context-aware filtering and messaging, to sharing and social coordination actions. Atomate's open architecture and world model easily accommodate new information sources and actions via the addition of feeds and web services. To make routine use of the system easy for non-programmers, Atomate provides a constrained-input natural language interface (CNLI) for behavior specification, and a direct-manipulation interface for inspecting and updating its world model.
    Keywords: context-aware computing, end-user programming, reactive automation, web mash-ups
    Faceted exploration of image search results BIBAKFull-Text 961-970
      Roelof van Zwol; Börkur Sigurbjornsson; Ramu Adapala; Lluis Garcia Pueyo; Abhinav Katiyar; Kaushal Kurapati; Mridul Muralidharan; Sudar Muthu; Vanessa Murdock; Polly Ng; Anand Ramani; Anuj Sahai; Sriram Thiru Sathish; Hari Vasudev; Upendra Vuyyuru
    This paper describes MediaFaces, a system that enables faceted exploration of media collections. The system processes semi-structured information sources to extract objects and facets, e.g. the relationships between two objects. Next, we rank the facets based on a statistical analysis of image search query logs, and the tagging behaviour of users annotating photos in Flickr. For a given object of interest, we can then retrieve the top-k most relevant facets and present them to the user. The system is currently deployed in production by Yahoo!'s image search engine1. We present the system architecture, its main components, and the application of the system as part of the image search experience.
    Keywords: faceted browsing, image search, object extraction, ranking
    CETR: content extraction via tag ratios BIBAKFull-Text 971-980
      Tim Weninger; William H. Hsu; Jiawei Han
    We present Content Extraction via Tag Ratios (CETR) -- a method to extract content text from diverse webpages by using the HTML document's tag ratios. We describe how to compute tag ratios on a line-by-line basis and then cluster the resulting histogram into content and non-content areas. Initially, we find that the tag ratio histogram is not easily clustered because of its one-dimensionality; therefore we extend the original approach in order to model the data in two dimensions. Next, we present a tailored clustering technique which operates on the two-dimensional model, and then evaluate our approach against a large set of alternative methods using standard accuracy, precision and recall metrics on a large and varied Web corpus. Finally, we show that, in most cases, CETR achieves better content extraction performance than existing methods, especially across varying web domains, languages and styles.
    Keywords: content extraction, tag ratio, world wide web
    Modeling relationship strength in online social networks BIBAKFull-Text 981-990
      Rongjing Xiang; Jennifer Neville; Monica Rogati
    Previous work analyzing social networks has mainly focused on binary friendship relations. However, in online social networks the low cost of link formation can lead to networks with heterogeneous relationship strengths (e.g., acquaintances and best friends mixed together). In this case, the binary friendship indicator provides only a coarse representation of relationship information. In this work, we develop an unsupervised model to estimate relationship strength from interaction activity (e.g., communication, tagging) and user similarity. More specifically, we formulate a link-based latent variable model, along with a coordinate ascent optimization procedure for the inference. We evaluate our approach on real-world data from Facebook and LinkedIn, showing that the estimated link weights result in higher autocorrelation and lead to improved classification accuracy.
    Keywords: homophily, randomization, social influence, social networks
    Automatic extraction of clickable structured web contents for name entity queries BIBAKFull-Text 991-1000
      Xiaoxin Yin; Wenzhao Tan; Xiao Li; Yi-Chin Tu
    Today the major web search engines answer queries by showing ten result snippets, which need to be inspected by users for identifying relevant results. In this paper we investigate how to extract structured information from the web, in order to directly answer queries by showing the contents being searched for. We treat users' search trails (i.e., post-search browsing behaviors) as implicit labels on the relevance between web contents and user queries. Based on such labels we use information extraction approach to build wrappers and extract structured information. An important observation is that many web sites contain pages for name entities of certain categories (e.g., AOL Music contains a page for each musician), and these pages have the same format. This makes it possible to build wrappers from a small amount of implicit labels, and use them to extract structured information from many web pages for different name entities. We propose STRUCLICK, a fully automated system for extracting structured information for queries containing name entities of certain categories. It can identify important web sites from web search logs, build wrappers from users' search trails, filter out bad wrappers built from random user clicks, and combine structured information from different web sites for each query. Comparing with existing approaches on information extraction, STRUCLICK can assign semantics to extracted data without any human labeling or supervision. We perform comprehensive experiments, which show STRUCLICK achieves high accuracy and good scalability.
    Keywords: information extraction, web search
    Building taxonomy of web search intents for name entity queries BIBAKFull-Text 1001-1010
      Xiaoxin Yin; Sarthak Shah
    A significant portion of web search queries are name entity queries. The major search engines have been exploring various ways to provide better user experiences for name entity queries, such as showing "search tasks" (Bing search) and showing direct answers (Yahoo!, Kosmix). In order to provide the search tasks or direct answers that can satisfy most popular user intents, we need to capture these intents, together with relationships between them.
       In this paper we propose an approach for building a hierarchical taxonomy of the generic search intents for a class of name entities (e.g., musicians or cities). The proposed approach can find phrases representing generic intents from user queries, and organize these phrases into a tree, so that phrases indicating equivalent or similar meanings are on the same node, and the parent-child relationships of tree nodes represent the relationships between search intents and their sub-intents. Three different methods are proposed for tree building, which are based on directed maximum spanning tree, hierarchical agglomerative clustering, and pachinko allocation model. Our approaches are purely based on search logs, and do not utilize any existing taxonomies such as Wikipedia. With the evaluation by human judges (via Mechanical Turk), it is shown that our approaches can build trees of phrases that capture the relationships between important search intents.
    Keywords: query clustering, web search intent
    Beyond position bias: examining result attractiveness as a source of presentation bias in clickthrough data BIBAKFull-Text 1011-1018
      Yisong Yue; Rajan Patel; Hein Roehrig
    Leveraging clickthrough data has become a popular approach for evaluating and optimizing information retrieval systems. Although data is plentiful, one must take care when interpreting clicks, since user behavior can be affected by various sources of presentation bias. While the issue of position bias in clickthrough data has been the topic of much study, other presentation bias effects have received comparatively little attention. For instance, since users must decide whether to click on a result based on its summary (e.g., the title, URL and abstract), one might expect clicks to favor "more attractive" results. In this paper, we examine result summary attractiveness as a potential source of presentation bias. This study distinguishes itself from prior work by aiming to detect systematic biases in click behavior due to attractive summaries inflating perceived relevance. Our experiments conducted on the Google web search engine show substantial evidence of presentation bias in clicks towards results with more attractive titles.
    Keywords: click logs, implicit feedback, presentation bias
    Statistical models of music-listening sessions in social media BIBAKFull-Text 1019-1028
      Elena Zheleva; John Guiver; Eduarda Mendes Rodrigues; Natasa Milić-Frayling
    User experience in social media involves rich interactions with the media content and other participants in the community. In order to support such communities, it is important to understand the factors that drive the users' engagement. In this paper we show how to define statistical models of different complexity to describe patterns of song listening in an online music community. First, we adapt the LDA model to capture music taste from listening activities across users and identify both the groups of songs associated with the specific taste and the groups of listeners who share the same taste. Second, we define a graphical model that takes into account listening sessions and captures the listening moods of users in the community. Our session model leads to groups of songs and groups of listeners with similar behavior across listening sessions and enables faster inference when compared to the LDA model. Our experiments with the data from an online media site demonstrate that the session model is better in terms of the perplexity compared to two other models: the LDA-based taste model that does not incorporate cross-session information and a baseline model that does not use latent groupings of songs.
    Keywords: collaborative filtering, graphical models, mood, music, recommendations, sessions, social media, taste
    Collaborative location and activity recommendations with GPS history data BIBAKFull-Text 1029-1038
      Vincent W. Zheng; Yu Zheng; Xing Xie; Qiang Yang
    With the increasing popularity of location-based services, such as tour guide and location-based social network, we now have accumulated many location data on the Web. In this paper, we show that, by using the location data based on GPS and users' comments at various locations, we can discover interesting locations and possible activities that can be performed there for recommendations. Our research is highlighted in the following location-related queries in our daily life: 1) if we want to do something such as sightseeing or food-hunting in a large city such as Beijing, where should we go? 2) If we have already visited some places such as the Bird's Nest building in Beijing's Olympic park, what else can we do there? By using our system, for the first question, we can recommend her to visit a list of interesting locations such as Tiananmen Square, Bird's Nest, etc. For the second question, if the user visits Bird's Nest, we can recommend her to not only do sightseeing but also to experience its outdoor exercise facilities or try some nice food nearby. To achieve this goal, we first model the users' location and activity histories that we take as input. We then mine knowledge, such as the location features and activity-activity correlations from the geographical databases and the Web, to gather additional inputs. Finally, we apply a collective matrix factorization method to mine interesting locations and activities, and use them to recommend to the users where they can visit if they want to perform some specific activities and what they can do if they visit some specific places. We empirically evaluated our system using a large GPS dataset collected by 162 users over a period of 2.5 years in the real-world. We extensively evaluated our system and showed that our system can outperform several state-of-the-art baselines.
    Keywords: collaborative filtering, location and activity recommendations
    Measurement and analysis of an online content voting network: a case study of Digg BIBAKFull-Text 1039-1048
      Yingwu Zhu
    In online content voting networks, aggregate user activities (e.g., submitting and rating content) make high-quality content thrive through the unprecedented scale, high dynamics and divergent quality of user generated content (UGC). To better understand the nature and impact of online content voting networks, we have analyzed Digg, a popular online social news aggregator and rating website. Based on a large amount of data collected, we provide an in-depth study of Digg. We study structural properties of Digg social network, revealing some strikingly distinct properties such as low link symmetry and the power-law distribution of node outdegree with truncated tails. We explore impact of the social network on user digging activities, and investigate the issues of content promotion, content filtering, vote spam and content censorship, which are inherent to content rating networks. We also provide insight into design of content promotion algorithms and recommendation-assisted content discovery. Overall, we believe that the results presented in this paper are crucial in understanding online content rating networks.
    Keywords: content filtering, content promotion, social networks

    WWW posters

    What you see is what you search: adaptive visual search framework for the web BIBAKFull-Text 1049-1050
      Jae-wook Ahn; Peter Brusilovsky
    Information retrieval is one of the most popular information access methods for overcoming the information overload problem of the Web. However, its interaction model is still utilizing the old text-based ranked lists and static interaction algorithm. In this paper, we introduce our adaptive visualization approach for searching the Web, which we call Adaptive VIBE. It is an extended version of a reference point-based spatial visualization algorithm, and is designed to serve as a user interaction module for a personalized search system. Personalized search can incorporate dynamic user interests and different contexts, improving search results. When it is combined with adaptive visualization, it can encourage users to become involved in the search process more actively by exploring the information space and learning new facts for effective searching. In this paper, we introduce the rationale and functions of our adaptive visualization approach and discuss the approaches' potential to create a better search environment for the Web.
    Keywords: adaptive visualization, exploratory search, personalized search, user modeling, vibe
    RESTler: crawling RESTful services BIBAKFull-Text 1051-1052
      Rosa Alarcón; Erik Wilde
    Service descriptions allow designers to document, understand, and use services, creating new useful and complex services with aggregated business value. Unlike RPC-based services, REST characteristics require a different approach to service description. We present the Resource Linking Language (ReLL) that introduces the concepts of media types, resource types, and link types as first class citizens for a service description. A proof of concept, a crawler called RESTler that crawls RESTful services based on ReLL descriptions, is also presented.
    Keywords: crawling, rest, soa, web services
    Intelligent ad resizing BIBKFull-Text 1053-1054
      Anthony P. Badali; Parham Aarabi; Ron Appel
    Keywords: image processing, image resizing, internet monetization
    SourceRank: relevance and trust assessment for deep web sources based on inter-source agreement BIBAKFull-Text 1055-1056
      Raju Balakrishnan; Subbarao Kambhampati
    We consider the problem of deep web source selection and argue that existing source selection methods are inadequate as they are based on local similarity assessment. Specically, they fail to account for the fact that sources can vary in trustworthiness and individual results can vary in importance. In response, we formulate a global measure to calculate relevance and trustworthiness of a source based on agreement between the answers provided by different sources. Agreement is modeled as a graph with sources at the vertices. On this agreement graph, source quality scores -- namely SourceRank -- are calculated as the stationary visit probability of a weighted random walk. Our experiments on online databases and 675 book sources from Google Base show that SourceRank improves relevance of the results by 25-40% over existing methods and Google Base ranking. SourceRank also reduces linearly with the corruption levels of the sources.
    Keywords: deep web, source selection, source trust, web databases
    Talking about data: sharing richly structured information through blogs and wikis BIBAKFull-Text 1057-1058
      Edward Benson; Adam Marcus; Fabian Howahl; David Karger
    The web has dramatically enhanced people's ability to communicate ideas, knowledge, and opinions. But the authoring tools that most people understand, blogs and wikis, primarily guide users toward authoring text. In this work, we show that substantial gains in expressivity and communication would accrue if people could easily share richly structured information in meaningful visualizations. We then describe several extensions we have created for blogs and wikis that enable users to publish, share, and aggregate such structured information using the same workflows they apply to text. In particular, we aim to preserve those attributes that make blogs and wikis so effective: one-click access to the information, one-click publishing of content, natural authoring interfaces, and the ability to easily copy-and-paste information and visualizations from other sources.
    Keywords: blogs, linked data, visualization, wikis
    Privacy in dynamic social networks BIBAKFull-Text 1059-1060
      Smriti Bhagat; Graham Cormode; Balachander Krishnamurthy; Divesh Srivastava
    Anonymization of social networks before they are published or shared has become an important research question. Recent work on anonymizing social networks has looked at privacy preserving techniques for publishing a single instance of the network. However, social networks evolve and a single instance is inadequate for analyzing the evolution of the social network or for performing any longitudinal data analysis. We study the problem of repeatedly publishing social network data as the network evolves, while preserving privacy of users. Publishing multiple instances of the same network independently has privacy risks, since stitching the information together may allow an adversary to identify users in the networks.
       We propose methods to anonymize a dynamic network such that the privacy of users is preserved when new nodes and edges are added to the published network. These methods make use of link prediction algorithms to model the evolution of the social network. Using this predicted graph to perform group-based anonymization, the loss in privacy caused by new edges can be reduced. We evaluate the privacy loss on publishing multiple social network instances using our methods.
    Keywords: dynamic networks, privacy, re-publication
    Finding algorithms in scientific articles BIBAKFull-Text 1061-1062
      Sumit Bhatia; Prasenjit Mitra; C. Lee Giles
    Algorithms are an integral part of computer science literature. However, none of the current search engines offer specialized algorithm search facility. We describe a vertical search engine that identifies the algorithms present in documents and extracts and indexes the related metadata and textual description of the identified algorithms. This algorithm specific information is then utilized for algorithm ranking in response to user queries. Experimental results show the superiority of our system on other popular search engines.
    Keywords: algorithms, pseudocodes, vertical search engines
    Exploiting information redundancy to wring out structured data from the web BIBAKFull-Text 1063-1064
      Lorenzo Blanco; Mirko Bronzi; Valter Crescenzi; Paolo Merialdo; Paolo Papotti
    A large number of web sites publish pages containing structured information about recognizable concepts, but these data are only partially used by current applications. Although such information is spread across a myriad of sources, the web scale implies a relevant redundancy. We present a domain independent system that exploits the redundancy of information to automatically extract and integrate data from the Web. Our solution concentrates on sources that provide structured data about multiple instances from the same conceptual domain, e.g. financial data, product information. Our proposal is based on an original approach that exploits the mutual dependency between the data extraction and the data integration tasks. Experiments confirmed the quality and the feasibility of the approach.
    Keywords: data extraction, data integration, wrapper generation
    Caching search engine results over incremental indices BIBAKFull-Text 1065-1066
      Roi Blanco; Edward Bortnikov; Flavio Junqueira; Ronny Lempel; Luca Telloli; Hugo Zaragoza
    A Web search engine must update its index periodically to incorporate changes to the Web, and we argue in this work that index updates fundamentally impact the design of search engine result caches. Index updates lead to the problem of cache invalidation: invalidating cached entries of queries whose results have changed. To enable efficient invalidation of cached results, we propose a framework for developing invalidation predictors and some concrete predictors. Evaluation using Wikipedia documents and a query log from Yahoo! shows that selective invalidation of cached search results can lower the number of query re-evaluations by as much as 30% compared to a baseline time-to-live scheme, while returning results of similar freshness.
    Keywords: cache, real-time indexing, search engine, search results
    Visual structure-based web page clustering and retrieval BIBAKFull-Text 1067-1068
      Paul Bohunsky; Wolfgang Gatterbauer
    Clustering and retrieval of web pages dominantly relies on analyzing either the content of individual web pages or the link structure between them. Some literature also suggests to use the structure of web pages, notably the structure of its DOM tree. However, little work considers the visual structure of web pages for clustering. In this paper (i) we motivate visual structure-based web page clustering and retrieval for a number of applications, (ii) we formalize a visual box model-based representation of web pages that supports new metrics of visual similarity, and (iii) we report on our current work on evaluating human perception of visual similarity of web pages and applying the learned visual similarity features to web page clustering and retrieval.
    Keywords: similarity measures, web page clustering, web page representation, web page retrieval
    Efficient resource allocation and power saving in multi-tiered systems BIBAKFull-Text 1069-1070
      Andrew Caniff; Lei Lu; Ningfang Mi; Ludmila Cherkasova; Evgenia Smirni
    In this paper, we present Fastrack, a parameter-free algorithm for dynamic resource provisioning that uses simple statistics to promptly distill information about changes in workload burstiness. This information, coupled with the application's end-to-end response times and system bottleneck characteristics, guide resource allocation that shows to be very effective under a broad variety of burstiness profiles and bottleneck scenarios.
    Keywords: burstiness, multi-tiered systems, resource allocation
    RankCompete: simultaneous ranking and clustering of web photos BIBAKFull-Text 1071-1072
      Liangliang Cao; Andrey Del Pozo; Xin Jin; Jiebo Luo; Jiawei Han; Thomas S. Huang
    With the explosive growth of digital cameras and online media, it has become crucial to design efficient methods that help users browse and search large image collections. The recent VisualRank algorithm [4] employs visual similarity to represent the link structure in a graph so that the classic PageRank algorithm can be applied to select the most relevant images. However, measuring visual similarity is difficult when there exist diversified semantics in the image collection, and the results from VisualRank cannot supply good visual summarization with diversity. This paper proposes to rank the images in a structural fashion, which aims to discover the diverse structure embedded in photo collections, and rank the images according to their similarity among local neighborhoods instead of across the entire photo collection. We design a novel algorithm named RankCompete, which generalizes the PageRank algorithm for the task of simultaneous ranking and clustering. The experimental results show that RankCompete outperforms VisualRank and provides an efficient but effective tool for organizing web photos.
    Keywords: image ranking, image summarization, pagerank
    Study language models with specific user goals BIBAKFull-Text 1073-1074
      Rongwei Cen; Yiqun Liu; Min Zhang; Liyun Ru; Shaoping Ma
    Under different language contexts, people choose different terms or phrases to express their feelings and opinions. When a user is writing a paper or chatting with a friend, he/she applies a specific language model corresponding to the underlying goal. This paper presents a log-based study of analyzing the language models with specific goals. We exhibit the statistical information of terms and software programs, propose some methods to estimate the divergence of language models with specific user goals and measure the discrimination of these models. Experimental results show that the language models with different user goals have large divergence and different discrimination. These study conclusions can be applied to understand user needs and improve Human-Computer Interaction (HCI).
    Keywords: language model, log analysis, user goal
    Navigational complexity in web interactions BIBAKFull-Text 1075-1076
      Praphul Chandra; Geetha Manjunath
    As the web grows in size, interfaces & interactions across websites diverge -- for differentiation and arguably for a better user experience. However, this size & diversity is also a cognitive load for the user who has to learn a new user interface for every new website she visits. Several studies have confirmed the importance of well designed websites. In this paper, we propose a method for quantitative evaluation of the navigational complexity of user interactions on the web. Our approach of quantifying interaction complexity exploits the modeling of the web as a graph and uses the information theoretic definition of complexity. It enables us to measure the navigational complexity of web interaction in bits. Our approach is structural in nature and can be applied to both traditional paradigm of web interaction (browsing) and to emerging paradigms of web interaction like web widgets.
    Keywords: complexity, graph theory, hypertext, user interaction, widgets
    Transfer learning for behavioral targeting BIBAKFull-Text 1077-1078
      Tianqi Chen; Jun Yan; Guirong Xue; Zheng Chen
    Recently, Behavioral Targeting (BT) is attracting much attention from both industry and academia due to its rapid growth in online advertising market. Though a basic assumption of BT, which is, the users who share similar Web browsing behaviors will have similar preference over ads, has been empirically verified, we argue that the users' ad click preference and Web browsing behavior are not reflecting the same user intent though they are correlated. In this paper, we propose to formulate BT as a transfer learning problem. We treat the users' preference over ads and Web browsing behaviors as two different user behavioral domains and propose to utilize transfer learning strategy across these two user behavioral domains to segment users for BT ads delivery. We show that some classical BT solutions could be formulated in transfer learning view. As an example, we propose to leverage translated learning, which is a recent proposed transfer learning algorithm, to benefit the BT ads delivery. Experimental results on real ad click data show that, BT user segmentation by the approach of transfer learning can outperform the classical user segmentation strategies for larger than 20% in terms of smoothed ad Click Through Rate (CTR).
    Keywords: behavioral targeting, transfer learning, user segmentation
    Context-oriented web video tag recommendation BIBAKFull-Text 1079-1080
      Zhineng Chen; Juan Cao; YiCheng Song; Junbo Guo; Yongdong Zhang; Jintao Li
    Tag recommendation is a common way to enrich the textual annotation of multimedia contents. However, state-of-the-art recommendation methods are built upon the pair-wised tag relevance, which hardly capture the context of the web video, i.e., when who are doing what at where. In this paper we propose the context-oriented tag recommendation (CtextR) approach, which expands tags for web videos under the context-consistent constraint. Given a web video, CtextR first collects the multi-form WWW resources describing the same event with the video, which produce an informative and consistent context; and then, the tag recommendation is conducted based on the obtained context. Experiments on an 80,031 web video collection show CtextR recommends various relevant tags to web videos. Moreover, the enriched tags improve the performance of web video categorization.
    Keywords: context-oriented, social tagging, tag recommendation, web video
    Web-scale knowledge extraction from semi-structured tables BIBAKFull-Text 1081-1082
      Eric Crestan; Patrick Pantel
    A wealth of knowledge is encoded in the form of tables on the World Wide Web. We propose a classification algorithm and a rich feature set for automatically recognizing layout tables and attribute/value tables. We report the frequencies of these table types over a large analysis of the Web and propose open challenges for extracting from attribute/value tables semantic triples (knowledge). We then describe a solution to a key problem in extracting semantic triples: protagonist detection, i.e., finding the subject of the table that often is not present in the table itself. In 79% of our Web tables, our method finds the correct protagonist in its top three returned candidates.
    Keywords: classification, information extraction, structured data, web tables
    Constructing travel itineraries from tagged geo-temporal breadcrumbs BIBAKFull-Text 1083-1084
      Munmun De Choudhury; Moran Feldman; Sihem Amer-Yahia; Nadav Golbandi; Ronny Lempel; Cong Yu
    Vacation planning is a frequent laborious task which requires skilled interaction with a multitude of resources. This paper develops an end-to-end approach for constructing intra-city travel itineraries automatically by tapping a latent source reflecting geo-temporal breadcrumbs left by millions of tourists. In particular, the popular rich media sharing site, Flickr, allows photos to be stamped by the date and time of when they were taken, and be mapped to Points Of Interest (POIs) by latitude-longitude information as well as semantic metadata (e.g., tags) that describe them.
       Our extensive user study on a "crowd-sourcing" marketplace (Amazon Mechanical Turk), indicates that high quality itineraries can be automatically constructed from Flickr data, when compared against popular professionally generated bus tours.
    Keywords: flickr, geo-tags, mechanical turk, media applications, orienteering problem, rich media, travel itinerary
    How much is your personal recommendation worth? BIBAKFull-Text 1085-1086
      Paul Dütting; Monika Henzinger; Ingmar Weber
    Suppose you buy a new laptop and, simply because you like it so much, you recommend it to friends, encouraging them to purchase it as well. What would be an adequate price for the vendor of the laptop to pay for your recommendation?
       Personal recommendations like this are of considerable commercial interest, but unlike in sponsored search auctions there can be no truthful prices. Despite this "lack of truthfulness" the vendor of the product might still decide to pay you for recommendation e.g. because she wants to (i) provide you with an additional incentive to actually recommend her or to (ii) increase your satisfaction and/or brand loyalty. This leads us to investigate a pricing scheme based on the Shapley value [5] that satisfies certain "axioms of fairness". We find that it is vulnerable to manipulations and show how to overcome these difficulties using the anonymity-proof Shapley value of [4].
    Keywords: pricing mechanisms, recommendations, shapley value
    A unified ontology-based web page model for improving accessibility BIBAKFull-Text 1087-1088
      Ruslan R. Fayzrahmanov; Max C. Göbel; Wolfgang Holzinger; Bernhard Krüpl; Robert Baumgartner
    Fast technological advancements and little compliance with accessibility standards by Web page authors pose serious obstacles to the Web experience of the blind user. We propose a unified Web document model that enables us to create a richer browsing experience and improved navigability for blind users. The model provides an integrated view on all aspects of a Web page and is leveraged to create a multi-axial user interface.
    Keywords: web accessibility
    Query parsing in mobile voice search BIBAKFull-Text 1089-1090
      Junlan Feng
    Mobile voice search is a fast-growing business. It provides users an easier way to search for information using voice from mobile devices. In this paper, we describe a statistical approach to query parsing to assure search effectiveness. The task is to segment speech recognition (ASR) output, including ASR 1-Best and ASR word lattices, into segments and associate each segment with needed concepts in the application. We train the models including concept prior probability, query segment generation probability, and query subject probability from application data such as query log and source database. We apply the learned models on a mobile business search application and demonstrate the robustness of query parsing to ASR errors.
    Keywords: mobile voice search, query parsing
    RDF compression: basic approaches BIBAKFull-Text 1091-1092
      Javier D. Fernández; Claudio Gutierrez; Miguel A. Martínez-Prieto
    This paper studies the compressibility of RDF data sets. We show that big RDF data sets are highly compressible due to the structure of RDF graphs (power law), organization of URIs and RDF syntax verbosity. We present basic approaches to compress RDF data and test them with three well-known, real-world RDF data sets.
    Keywords: compression, rdf, semantic web
    How google analytics and conventional cookie tracking techniques overestimate unique visitors BIBAKFull-Text 1093-1094
      Max I. Fomitchev
    We report the results of the analysis of website traffic logs and argue that both unique IP address and cookies vastly overestimate unique visitors, e.g. by factor of 8 in our studies. Google Analytics 'absolute unique visitors' measure is shown to produce similar 6x overestimation. To address the problem we present a new model for relating unique visitors to IP address or cookies.
    Keywords: cookies, traffic logs, unique visitors, web, web analytics
    Shout out: integrating news and reader comments BIBAKFull-Text 1095-1096
      Lisa M. Gandy; Nathan D. Nichols; Kristian J. Hammond
    A useful approach for enabling computers to automatically create new content is utilizing the text, media, and information already present on the World Wide Web. The newly created content is known as "machine-generated content". For example, a machine-generated content system may create a multimedia news show with two animated anchors presenting a news story; one anchor reads the news story with text taken from an existing news article, and the other anchor regularly interrupts with his or her own opinion about the story. In this paper, we present such a system, and describe in detail its strategy for autonomously extracting and selecting the opinions given by the second anchor.
    Keywords: cosine similarity, emotional valence detection, machine-generated content
    A framework for trust establishment and assessment on the web of data BIBAKFull-Text 1097-1098
      Qi Gao; Geert-Jan Houben
    With the enormous and still growing amount of data and user interaction on the Web, it becomes more and more necessary for data consumers to be able to assess the trustworthiness of data on the Web. In this paper, we present a framework for trust establishment and assessment on the Web of Data. Different from many approaches that build trust metrics within networks of people, we propose a model to represent the trust in concrete pieces of web data for a specific consumer, in which also the context is considered. Further more, we provide three strategies for trust assessment based on the principles of Linked Data, to overcome the shortcomings of the conventional web of documents: i.e. the lack of semantics and interlinking.
    Keywords: linked data, semantic web, trust, web of data
    On the high density of leadership nuclei in endorsement social networks BIBAKFull-Text 1099-1100
      Guillermo Garrido; Francesco Bonchi; Aristides Gionis
    In this paper we study the community structure of endorsement networks, i.e., social networks in which a directed edge u ⇒ v is asserting an action of support from user u to user v. Examples include scenarios in which a user u is favoring a photo, liking a post, or following the microblog of user v.
       Starting from the hypothesis that the footprint of a community in an endorsement network is a bipartite directed clique from a set of followers to a set of leaders, we apply frequent itemset mining techniques to discover such bicliques. Our analysis of real networks discovers that an interesting phenomenon is taking place: the leaders of a community are endorsing each other forming a very dense nucleus.
    Keywords: communities, endorsement social networks
    Measuring the web crawler ethics BIBAKFull-Text 1101-1102
      C. Lee Giles; Yang Sun; Isaac G. Councill
    Web crawlers are highly automated and seldom regulated manually. The diversity of crawler activities often leads to ethical problems such as spam and service attacks. In this research, quantitative models are proposed to measure the web crawler ethics based on their behaviors on web servers. We investigate and define rules to measure crawler ethics, referring to the extent to which web crawlers respect the regulations set forth in robots.txt configuration files. We propose a vector space model to represent crawler behavior and measure the ethics of web crawlers based on the behavior vectors. The results show that ethicality scores vary significantly among crawlers. Most commercial web crawlers' behaviors are ethical. However, many commercial crawlers still consistently violate or misinterpret certain robots.txt rules. We also measure the ethics of big search engine crawlers in terms of return on investment. The results show that Google has a higher score than other search engines for a US website but has a lower score than Baidu for Chinese websites.
    Keywords: ethicality, privacy, robots.txt, web crawler ethics
    SNDocRank: document ranking based on social networks BIBAKFull-Text 1103-1104
      Liang Gou; Hung-Hsuan Chen; Jung-Hyun Kim; Xiaolong (Luke) Zhang; C. Lee Giles
    To improve the search results for socially-connect users, we propose a ranking framework, Social Network Document Rank (SNDocRank). This framework considers both document contents and the similarity between a searcher and document owners in a social network and uses a Multi-level Actor Similarity (MAS) algorithm to efficiently calculate user similarity in a social network. Our experiment results based on YouTube data show that compared with the tf-idf algorithm, the SNDocRank method returns more relevant documents of interest. Our findings suggest that in this framework, a searcher can improve search by joining larger social networks, having more friends, and connecting larger local communities in a social network.
    Keywords: information retrieval, multilevel actor similarity, ranking, social networks
    Exploiting content redundancy for web information extraction BIBAKFull-Text 1105-1106
      Pankaj Gulhane; Rajeev Rastogi; Srinivasan H. Sengamedu; Ashwin Tengli
    We propose a novel extraction approach that exploits content redundancy on the web to extract structured data from template-based web sites. We start by populating a seed database with records extracted from a few initial sites. We then identify values within the pages of each new site that match attribute values contained in the seed set of records. To filter out noisy attribute value matches, we exploit the fact that attribute values occur at fixed positions within template-based sites. We develop an efficient Apriori-style algorithm to systematically enumerate attribute position configurations with sufficient matching values across pages. Finally, we conduct an extensive experimental study with real-life web data to demonstrate the effectiveness of our extraction approach.
    Keywords: content redundancy, information extraction
    Exploring searcher interactions for distinguishing types of commercial intent BIBAKFull-Text 1107-1108
      Qi Guo; Eugene Agichtein
    An improved understanding of the relationship between search intent, result quality, and searcher behavior is crucial for improving the effectiveness of web search. While recent progress in user behavior mining has been largely focused on aggregate server-side click logs, we present a new search behavior model that incorporates fine-grained user interactions with the search results. We show that mining these interactions, such as mouse movements and scrolling, can enable more effective detection of the user's search intent. Potential applications include automatic search evaluation, improving search ranking, result presentation, and search advertising. As a case study, we report results on distinguishing between "research" and "purchase" variants of commercial intent, that show our method to be more effective than the current state-of-the-art.
    Keywords: search behavior modeling, user intent inference
    Deep mashup: a description-based framework for lightweight integration of web contents BIBAKFull-Text 1109-1110
      Hao Han; Junxia Guo; Takehiro Tokuda
    In this paper, we present a description-based mashup framework for lightweight integration of Web contents. Our implementation shows that we can integrate not only the Web services but also the Web applications easily, even the Web contents dynamically generated by client-side scripts.
    Keywords: mashup, web application, web feed, web service
    Estimation of user interest in visited web page BIBAKFull-Text 1111-1112
      Michal Holub; Maria Bielikova
    Nowadays web portals contain large amount of information that is meant for various visitors or groups of visitors. To effectively navigate within the content the website needs to "know" its users in order to provide personalized content to them. We propose a method for automatic estimation of the user's interest in a web page he visits. This estimation is used for the recommendation of web portal pages (through presenting adaptive links) that the user might like. We conducted series of experiments in the domain of our faculty web portal to evaluate proposed approach.
    Keywords: adaptive navigation support, link recommendation, user behaviour, user interest estimation
    Hierarchical feature selection for ranking BIBAKFull-Text 1113-1114
      Guichun Hua; Min Zhang; Yiqun Liu; Shaoping Ma; Liyun Ru
    Ranking is an essential part of information retrieval (IR) tasks such as Web search. Nowadays there are hundreds of features for ranking. So learning to rank (LTR), an interdisciplinary field of IR and machine learning (ML), has attracted increasing attention. Those features used in the IR are not always independent from each other, hence the feature selection, an important issue in ML, should be paid attention to for LTR. However, the state-of-the-art LTR approaches merely analyze the connection among the features from the aspects of feature selection. In this paper, we propose a hierarchical feature selection strategy containing 2 phases for ranking and learn ranking functions. The experimental results show that ranking functions based on the selected feature subset significantly outperform the ones based on all features.
    Keywords: feature selection, learning to rank
    Selectivity estimation for SPARQL graph pattern BIBAKFull-Text 1115-1116
      Hai Huang; Chengfei Liu
    This paper focuses on selectivity estimation for SPARQL graph patterns, which is crucial to RDF query optimization. The previous work takes the join uniformity assumption, which would lead to high inaccurate estimation in the cases where properties in SPARQL graph patterns are correlated. We take into account the dependencies among properties in SPARQL graph patterns and propose a more accurate estimation model. We first focus on two common SPARQL graph patterns (star and chain patterns) and propose to use Bayesian network and chain histogram for estimating the selectivity of them. Then, for an arbitrary composite SPARQL graph pattern, we maximally combines the results of the star and chain patterns we have precomputed. The experiments show that our method outperforms existing approaches in accuracy.
    Keywords: rdf query processing, selectivity estimation
    Yet another paper ranking algorithm advocating recent publications BIBAKFull-Text 1117-1118
      Won-Seok Hwang; Soo-Min Chae; Sang-Wook Kim; Gyun Woo
    In this paper, we propose a new paper ranking algorithm that gives a high rank to papers which is credited by other authoritative papers or published in premier conferences or journals. Also, the proposed algorithm solves a problem that recent papers are rated poorly due to few citations.
    Keywords: citation analysis, impact factor, pagerank, ranking
    Enabling entity-based aggregators for web 2.0 data BIBAKFull-Text 1119-1120
      Ekaterini Ioannou; Claudia Niederée; Yannis Velegrakis
    Selecting and presenting content culled from multiple heterogeneous and physically distributed sources is a challenging task. The exponential growth of the web data in modern times has brought new requirements to such integration systems. Data is not any more produced by content providers alone, but also from regular users through the highly popular Web 2.0 social and semantic web applications. The plethora of the available web content increased its demand by regular users who could not any more wait the development of advanced integration tools. They wanted to be able to build in a short time their own specialized integration applications. Aggregators came to the risk of these users. They allowed them not only to combine distributed content, but also to process it in ways that generate new services available for further consumption.
       To cope with the heterogeneous data, the Linked Data initiative aims at the creation and exploitation of correspondences across data values. In this work, although we share the Linked Data community vision, we advocate that for the modern web, linking at the data value level is not enough. Aggregators should base their integration tasks on the concept of an entity, i.e., identifying whether different pieces of information correspond to the same real world entity, such as an event or a person. We describe our theory, system, and experimental results that illustrate the approach's effectiveness.
    Keywords: entity matching, linked data, semantic web
    Antourage: mining distance-constrained trips from flickr BIBAKFull-Text 1121-1122
      Saral Jain; Stephan Seufert; Srikanta Bedathur
    We study how to automatically extract tourist trips from large volumes of geo-tagged photographs. Working with more than 8 million of these photographs that are publicly available via photo-sharing communities such as Flickr and Panoramio, our goal is to satisfy the needs of a tourist who specifies a starting location (typically a hotel) together with a bounded travel distance and demands a tour that visits the popular sites along the way. Our system, named ANTOURAGE, solves this intractable problem using a novel adaptation of the max-min ant system (MMAS) meta-heuristic. Experiments using GPS metadata crawled from Flickr show that ANTOURAGE can generate high-quality tours.
    Keywords: flickr, geolocation, graph mining, photo collections, trip planning
    Analyzing collective view of future, time-referenced events on the web BIBAKFull-Text 1123-1124
      Adam Jatowt; Hideki Kawai; Kensuke Kanazawa; Katsumi Tanaka; Kazuo Kunieda; Keiji Yamada
    Humans have always desired to guess the future in order to adapt their behavior and maximize chances of success. In this paper, we conduct exploratory analysis of future-related information on the web. We focus on the future-related information which is grounded in time, that is, the information on forthcoming events whose expected occurrence dates are already known. We collect data by crawling search engine index and analyze collective view of future time-referenced events discussed on the web.
    Keywords: collective prediction, future-related information, opinion analysis
    Optimizing two stage bigram language models for IR BIBAKFull-Text 1125-1126
      Sara Javanmardi; Jianfeng Gao; Kuansan Wang
    Although higher order language models (LMs) have shown benefit of capturing word dependencies for Information retrieval (IR), the tuning of the increased number of free parameters remains a formidable engineering challenge. Consequently, in many real world retrieval systems, applying higher order LMs is an exception rather than the rule. In this study, we address the parameter tuning problem using a framework based on a linear ranking model in which different component models are incorporated as features. Using unigram and bigram LMs with 2 stage smoothing as examples, we show that our method leads to a bigram LM that outperforms significantly its unigram counterpart and the well-tuned BM25 model.
    Keywords: bigram LM, parameter tuning, retrieval model
    The utility of tweeted URLs for web search BIBAKFull-Text 1127-1128
      Vasileios Kandylas; Ali Dasdan
    Microblogging as introduced by Twitter is becoming a source of tracking real-time news. Although identifying the highest quality or most useful posts or tweets from Twitter for breaking news is still an open problem, major web search engines seem convinced of the value of such posts and have already started allocating part of their search results pages to them. In this paper, we study a different aspect of the problem for a search engine: instead of the value of the posts, we study the value of the (shortened) URLs referenced in these posts. Our results indicate that unlike frequently bookmarked URLs, which are generally of high quality, frequently tweeted URLs tend to fall in two opposite categories: they are either high in quality, or they are spam. Identifying the quality category of a URL is not trivial, but the combination of characteristics can reveal some trends.
    Keywords: content quality, microblogging, shortened urls, twitter
    Trend detection model BIBAKFull-Text 1129-1130
      Noriaki Kawamae; Ryuichiro Higashinaka
    This paper presents a topic model that detects topic distributions over time. Our proposed model, Trend Detection Model (TDM) introduces a latent trend class variable into each document. The trend class has a probability distribution over topics and a continuous distribution over time. Experiments using our data set show that TDM is useful as a generative model in the analysis of the evolution of trends.
    Keywords: dynamic topic model, latent variable modeling, trend analysis
    Unsupervised query segmentation using click data: preliminary results BIBAKFull-Text 1131-1132
      Julia Kiseleva; Qi Guo; Eugene Agichtein; Daniel Billsus; Wei Chai
    We describe preliminary results of experiments with an unsupervised framework for query segmentation, transforming keyword queries into structured queries. The resulting queries can be used to more accurately search product databases, and potentially improve result presentation and query suggestion. The key to developing an accurate and scalable system for this task is to train a query segmentation or attribute detection system over labeled data, which can be acquired automatically from query and click-through logs. The main contribution of our work is a new method to automatically acquire such training data -- resulting in significantly higher segmentation performance, compared to previously reported methods.
    Keywords: attribute extraction, query segmentation, structured queries
    Detecting epidemic tendency by mining search logs BIBAKFull-Text 1133-1134
      Weize Kong; Yiqun Liu; Shaoping Ma; Liyun Ru
    We consider the problem of detecting epidemic tendency by mining search logs. We propose an algorithm based on click-through information to select epidemic related queries/terms. We adopt linear regression to model epidemic occurrences and frequencies of epidemic related terms (ERTs) in search logs. The results show our algorithm is effective in finding ERTs which obtain a high correlation value with epidemic occurrences. We also find the proposed method performs better when combining different ERTs than using single ERT.
    Keywords: epidemic detection, query, search log mining
    An information retrieval approach to spelling suggestion BIBAKFull-Text 1135-1136
      Sai Krishna; Prasad Pingali; Vasudeva Varma
    In this paper, we present a two-step language-independent spelling suggestion system. In the first step, candidate suggestions are generated using an Information Retrieval (IR) approach. In step two, candidate suggestions are re-ranked using a new string similarity measure that uses the length of the longest common substrings occurring at the beginning and end of the words. We obtained very impressive results by reranking candidate suggestions using the new similarity measure. The accuracy of first suggestion is 92.3%, 90.0% and 83.5% for Dutch, Danish and Bulgarian language datasets respectively.
    Keywords: information retrieval, language independent, spelling suggestion
    Finding influentials based on the temporal order of information adoption in twitter BIBAKFull-Text 1137-1138
      Changhyun Lee; Haewoon Kwak; Hosung Park; Sue Moon
    Twitter offers an explicit mechanism to facilitate information diffusion and has emerged as a new medium for communication. Many approaches to find influentials have been proposed, but they do not consider the temporal order of information adoption. In this work, we propose a novel method to find influentials by considering both the link structure and the temporal order of information adoption in Twitter. Our method finds distinct influentials who are not discovered by other methods.
    Keywords: influentials, information diffusion, ranking, social networks, twitter
    The Social Honeypot Project: protecting online communities from spammers BIBAKFull-Text 1139-1140
      Kyumin Lee; James Caverlee; Steve Webb
    We present the conceptual framework of the Social Honeypot Project for uncovering social spammers who target online communities and initial empirical results from Twitter and MySpace. Two of the key components of the Social Honeypot Project are: (1) The deployment of social honeypots for harvesting deceptive spam profiles from social networking communities; and (2) Statistical analysis of the properties of these spam profiles for creating spam classifiers to actively filter out existing and new spammers.
    Keywords: social honeypots, social media, spam
    Pusic: musicalize microblog messages for summarization and exploration BIBAKFull-Text 1141-1142
      Cheng-Te Li; Hung-Che Lai; Chien-Tung Ho; Chien-Lin Tseng; Shou-De Lin
    Micro-blogging services provide platforms for users to share their feelings and ideas on the go. Designing to produce information stream in almost Micro-blogging services, although are capable of recording rich and diverse senses, still suffer from a drawback of not being able to provide deeper and summarized views. In this paper, we present a novel framework, Pusic, to musicalize micro-blogging messages for terms or users. Pusic can be used to (1) summarize users' messages into certain expression of emotions, (2) explore the emotions and senses and transform them into music, and (3) serve as a presentation of crowd net art. We generate the music from two aspects: emotion and harmony. The former is tackled by emotion detection from messages while the latter is established by rule-based harmonic heuristics according to the detected emotions. Pusic has been announced online for people's experience and further investigation.
    Keywords: emotion, exploration, micro-blogging, music, summarization
    Keyword extraction for social snippets BIBAKFull-Text 1143-1144
      Zhenhui Li; Ding Zhou; Yun-Fang Juan; Jiawei Han
    Today, a huge amount of text is being generated for social purposes on social networking services on the Web. Unlike traditional documents, such text is usually extremely short and tends to be informal. Analysis of such text benefit many applications such as advertising, search, and content filtering. In this work, we study one traditional text mining task on such new form of text, that is extraction of meaningful keywords. We propose several intuitive yet useful features and experiment with various classification models. Evaluation is conducted on Facebook data. Performances of various features and models are reported and compared.
    Keywords: keyword extraction, online advertising, social snippets
    Entity relation discovery from web tables and links BIBAKFull-Text 1145-1146
      Cindy Xide Lin; Bo Zhao; Tim Weninger; Jiawei Han; Bing Liu
    The World-Wide Web consists not only of a huge number of unstructured texts, but also a vast amount of valuable structured data. Web tables [2] are a typical type of structured information that are pervasive on the web, and Web-scale methods that automatically extract web tables have been studied extensively [1]. Many powerful systems (e.g. OCTOPUS [4], Mesa [3]) use extracted web tables as a fundamental component.
       In the database vernacular, a table is defined as a set of tuples which have the same attributes. Similarly, a web table is defined as a set of rows (corresponding to database tuples) which have the same column headers (corresponding to database attributes). Therefore, to extract a web table is to extract a relation on the web. In databases, tables often contain foreign keys which refer to other tables. Therefore, it follows that hyperlinks inside a web table sometimes function as foreign keys to other relations whose tuples are contained in the hyperlink's target pages. In this paper, we explore this idea by asking: can we discover new attributes for web tables by exploring hyperlinks inside web tables?
       This poster proposes a solution that takes a web table as input. Frequent patterns are generated as new candidate relations by following hyperlinks in the web table. The confidence of candidates are evaluated, and trustworthy candidates are selected to become new attributes for the table. Finally, we show the usefulness of our method by performing experiments on a variety of web domains.
    Keywords: entity relation discovery, link, web table
    Identifying featured articles in wikipedia: writing style matters BIBAKFull-Text 1147-1148
      Nedim Lipka; Benno Stein
    Wikipedia provides an information quality assessment model with criteria for human peer reviewers to identify featured articles. For this classification task "Is an article featured or not?" we present a machine learning approach that exploits an article's character trigram distribution. Our approach differs from existing research in that it aims to writing style rather than evaluating meta features like the edit history. The approach is robust, straightforward to implement, and outperforms existing solutions. We underpin these claims by an experiment design where, among others, the domain transferability is analyzed. The achieved performances in terms of the F-measure for featured articles are 0.964 within a single Wikipedia domain and 0.880 in a domain transfer situation.
    Keywords: domain transfer, information quality, wikipedia
    Retagging social images based on visual and semantic consistency BIBAKFull-Text 1149-1150
      Dong Liu; Xian-Sheng Hua; Meng Wang; Hong-Jiang Zhang
    The tags on social media websites such as Flickr are frequently imprecise and incomplete, thus there is still a gap between these tags and the actual content of the images. This paper proposes a social image "retagging" scheme that aims at assigning images with better content descriptors. The refining process is formulated as an optimization framework based on the consistency between "visual similarity" and "semantic similarity" in social images. An effective iterative bound optimization algorithm is applied to learn the optimal tag assignment. In addition, as many tags are intrinsically not closely-related to the visual content of the images, we employ a knowledge-based method to differentiate visual content related from unrelated tags and then constrain the tagging vocabulary of our automatic algorithm within the content related tags. Experimental results on a Flickr image collection demonstrate the effectiveness of this approach.
    Keywords: image tagging; tag refinement; retagging
    Research trails: getting back where you left off BIBAKFull-Text 1151-1152
      Jiahui Liu; Peter Jin Hong; Elin Rønby Pedersen
    In this paper, we present a prototype system that helps users in early-stage web research to create and reestablish context across fragmented work process, without requiring them to explicitly collect and organize the material they visit. The system clusters a user's web history and shows it as research trails. We present two user interaction models with the research trails. The first interaction model is implemented as a standalone application, which presents a hierarchical view of research trails. The second interaction model is integrated with the web browser. It shows the user's research trails as selectable and manipulable visual streams when they open a new tab. Thereby, the NewTab page serves as a springboard in the browser for a user resuming an ongoing task.
    Keywords: browser, early-stage research, research trail
    An efficient random access inverted index for information retrieval BIBAKFull-Text 1153-1154
      Xiaozhu Liu; Zhiyong Peng
    To improve query performance and space efficiency, an efficient random access blocked inverted index (RABI) is proposed. RABI divides an inverted list into blocks and compresses different part of each block with the corresponding encoding method to decrease space consumption. RABI can provide fast addressing and random access functions on the compressed blocked inverted index with the novel hybrid compression method, which can provide both block level and inner block level skipping function and further enhance both space and time efficiencies without inserting any additional auxiliary information. Experimental results show that RABI achieves both high space efficiency and search efficiency, and outperforms the existing approach significantly.
    Keywords: information retrieval, inverted index, random access
    Structure-aware music resizing using lyrics BIBAKFull-Text 1155-1156
      Zhang Liu; Chaokun Wang; Jianmin Wang; Wei Zheng; Shengfei Shi
    World wide web provides plenty of multimedia resources for creating rich media web applications. However, the collected music and other media resources always mismatch in the metric of time length. Existent music resizing approaches suffer from perceptual artifacts which degrade the performance of resized music. In this paper, a novel structure-aware music resizing approach is proposed. Through lyrics analysis, our approach can compress different parts of a music piece in variant compression rates. Experimental results show that the proposed method can effectively generate resized songs with good quality.
    Keywords: music resizing, rich media, time-scale compression
    HTTP database connector (HDBC): RESTful access to relational databases BIBAKFull-Text 1157-1158
      Alexandros Marinos; Erik Wilde; Jiannan Lu
    Relational databases hold a vast quantity of information and making them accessible to the web is an big challenge. There is a need to make these databases accessible with as little difficulty as possible, opening them up to the power and serendipity of the Web. Our work presents a series of patterns that bridge the relational database model with the architecture of the Web along with an implementation of some of them. The aim is for relational databases to be made accessible with no intermediate steps and no extra metadata required. This approach can vastly increase the data available on the web, therefore making the Web itself all the more powerful, while enabling its users to seamlessly perform tasks that previously required bridging multiple domains and paradigms or were not possible.
    Keywords: relational databases, rest, web architecture
    Detecting communities from tripartite networks BIBAKFull-Text 1159-1160
      Tsuyoshi Murata
    Online social media such as delicious and digg are represented as tripartite networks whose vertices are users, tags, and resources. Detecting communities from such tripartite networks is practically important. Modularity is often used as the criteria for evaluating the goodness of network divisions into communities. For tripartite networks, Neubauer defines a tripartite modularity which extends Murata's bipartite modularity. However, Neubauer's tripartite modularity still uses projections and it will lose information that original tripartite networks have. This paper proposes new tripartite modularity for tripartite networks that do not use projections. Experimental results show that better community structures can be detected by optimizing our tripartite modularity.
    Keywords: community, modularity, tripartite network
    INEX+DBPEDIA: a corpus for semantic search evaluation BIBAKFull-Text 1161-1162
      Jose R. Perez-Aguera; Javier Arroyo; Jane Greenberg; Joaquin Perez-Iglesias; Victor Fresno
    This paper presents a new collection based on DBpedia and INEX for evaluating semantic search performance. The proposed corpus is used to calculate the impact of considering document's structure on the retrieval performance of the Lucene and BM25 ranking functions. Results show that BM25 outperforms Lucene in all the considered metrics and that there is room for future improvements, which may be obtained using a hybrid approach combining both semantic technology and information retrieval ranking functions.
    Keywords: evaluation, semantic search
    Protecting data in multi-stakeholder web service system BIBAKFull-Text 1163-1164
      Tan Phan; Jun han; Garth Heward; Steve Versteeg
    Current Web Service security standards have inadequate support for end-to-end protection of data when some receivers of the data are unknown to the sender. This paper presents an approach to aid collaborative partner services in properly protecting each other's data. Our approach allows each partner to derive an adequate protection mechanism with minimum performance overhead for each message it sends based on those of the corresponding messages it receives. We modify the message handling mechanisms of Web Service engines to dynamically gather protection requirements for a given outgoing message by aggregating requirements from original owners of message data.
    Keywords: data security, propagation, protection mechanism
    Constructing folksonomies by integrating structured metadata BIBAKFull-Text 1165-1166
      Anon Plangprasopchok; Kristina Lerman; Lise Getoor
    Aggregating many personal hierarchies into a common taxonomy, also known as a folksonomy, presents several challenges due to its sparseness, ambiguity, noise, and inconsistency. We describe an approach to folksonomy learning based on relational clustering that addresses these challenges by exploiting structured metadata contained in personal hierarchies. Our approach clusters similar hierarchies using their structure and tag statistics, then incrementally weaves them into a deeper, bushier tree. We study folksonomy learning using social metadata extracted from the photo-sharing site Flickr. We evaluate the learned folksonomy quantitatively by automatically comparing it to a reference taxonomy created by the Open Directory Project. Our empirical results suggest that the proposed approach improves upon the state-of-the-art folksonomy learning method.
    Keywords: collective knowledge, data mining, folksonomies
    Semantic lexicon adaptation for use in query interpretation BIBAKFull-Text 1167-1168
      Ana Maria Popescu; Patrick Pantel; Gilad Mishne
    We describe improvements to the use of semantic lexicons by a state-of-the-art query interpretation system powering a major search engine. We successfully compute concept label importance information for lexicon strings; lexicon augmentation with such information leads to a 6.4% precision increase on affected queries with no query coverage loss. Finally, lexicon filtering based on label importance leads to a 13% precision increase, but at the expense of query coverage.
    Keywords: lexicon adaptation, word sense disambiguation
    Towards comment-based cross-media retrieval BIBAKFull-Text 1169-1170
      Martin Potthast; Benno Stein; Steffen Becker
    This paper investigates whether Web comments can be exploited for cross-media retrieval. Comparing Web items such as texts, images, videos, music, products, or personal profiles can be done at various levels of detail; our focus is on topic similarity. We propose to compare user-supplied comments on Web items in lieu of the commented items themselves. If this approach is feasible, the task of extracting and mapping features between arbitrary pairs of item types can be circumvented, and well-known text retrieval models can be applied instead -- given that comments are available. We report on results of a preliminary, but nonetheless large-scale experiment which shows that, if comments on textual items are compared with comments on video items, topically similar pairs achieve a sufficiently high cross-media similarity.
    Keywords: cross-media retrieval, web comments
    Inferring query intent from reformulations and clicks BIBAKFull-Text 1171-1172
      Filip Radlinski; Martin Szummer; Nick Craswell
    Many researchers have noted that web search queries are often ambiguous or unclear. We present an approach for identifying the popular meanings of queries using web search logs and user click behavior. We show our approach to produce more complete and user-centric intents than expert judges by evaluating on TREC queries. This approach was also used by the TREC 2009 Web Track judges to obtain more representative topic descriptions from real queries.
    Keywords: diversity, intents, subtopics
    Conversion rate based bid adjustment for sponsored search BIBAKFull-Text 1173-1174
      Benjamin Rey; Ashvin Kannan
    Advertisers use Sponsored Search to drive traffic to their site at a conversion rate and cost per conversion that provides value to them. However, very often advertisers bid at a constant price on a bundle of keywords, either for lack of enough data to fully optimize their bids at a keyword level, or indirectly by opting into Advanced Matching (AM) that allows an advertiser to reach a large number of queries while explicitly bidding only on a limited number. Then this single bid price reflects the return the advertiser gets from the full bundle. Under these conditions, the advertiser is competing too aggressively for some keyword auctions and with too low bids for others. In this paper, we propose a solution to improve the fairness of each keyword's bid prices within an AM bundle: adjusting the AM keyword bid by the ratio of its conversion rate to the conversion rate it would have reached had it been an Exact Match (EM). First we describe how we measure advertisers' conversion rates despite the opt-in nature of conversion tracking, and illustrate the need for bid adjustment in the context of AM. Then we present our approach to predict conversion rates in a robust manner. Our model uses a number of features capturing the quality of the match between the ad and the query. Then we describe how we adjust keyword bid prices to reflect their value to the advertiser thereby improving (1) the auction through fewer incorrectly high bids in the auction, (2) advertiser return through more auctions won by high value keywords and less by low value keywords, and (3) user satisfaction through higher conversion rate. Finally, we present experimental results from our live system.
    Keywords: ad auctions, advanced match, bid optimization, conversion rate, cost per action, search engine, sponsored search
    useKit: a step towards the executable web 3.0 BIBAKFull-Text 1175-1176
      Sven Rizzotti; Helmar Burkhart
    A lot of situations from daily life have found their counterpart in the internet. Buying or selling goods, discovering information or making social connections have been very successful in the virtual world and make the Web the central working place.
       Until now the purpose of a Web site has been defined at the time of creation. Extension of the available functionality or transfer of the functionality onto another site has only been possible with considerable additional effort. With useKit, we present a software platform that allows users to add individual selected functionalities to any Web site without installing software.
       useKit offers a new approach towards a collaborative, personalized and individually compiled view of the Web. useKit focuses on personalized applications and services that can be applied to any Web site. In analogy to file system permissions, this can be seen as an instance of the" "executable Web"" or Web 3.0 if Web 1.0 was ""read-only"", and Web 2.0 was "read-write"". User contributed code will morph online applications into omnipresent functional platforms with a single interface.
    Keywords: mashup, personalization, refinement, service composition, web3.0
    Web-scale k-means clustering BIBAKFull-Text 1177-1178
      D. Sculley
    We present two modifications to the popular k-means clustering algorithm to address the extreme requirements for latency, scalability, and sparsity encountered in user-facing web applications. First, we propose the use of mini-batch optimization for k-means clustering. This reduces computation cost by orders of magnitude compared to the classic batch algorithm while yielding significantly better solutions than online stochastic gradient descent. Second, we achieve sparsity with projected gradient descent, and give a fast ε-accurate projection onto the L1-ball. Source code is freely available: http://code.google.com/p/sofia-ml
    Keywords: scalability, sparse solutions, unsupervised clustering
    Learning based access control in online social networks BIBAKFull-Text 1179-1180
      Mohamed Shehab; Gorrell Cheek; Hakim Touati; Anna C. Squicciarini; Pau-Chen Cheng
    Online social networking sites are experiencing tremendous user growth with hundreds of millions of active users. As a result, there is a tremendous amount of user profile data online, e.g., name, birthdate, etc. Protecting this data is a challenge. The task of access policy composition is a tedious and confusing effort for the average user having hundreds of friends. We propose an approach that assists users in composing and managing their access control policies. Our approach is based on a supervised learning mechanism that leverages user provided example policy settings as training sets to build classifiers that are the basis for auto-generated policies. Furthermore, we provide mechanisms to enable users to fuse policy decisions that are provided by their friends or others in the social network. These policies then regulate access to user profile objects. We implemented our approach and, through extensive experimentation, prove the accuracy of our proposed mechanisms.
    Keywords: privacy, social networks, supervised learning
    Situation detection and control using spatio-temporal analysis of microblogs BIBAKFull-Text 1181-1182
      Vivek K. Singh; Mingyan Gao; Ramesh Jain
    Large volumes of spatio-temporal-thematic data being created using sites like Twitter and Jaiku, can potentially be combined to detect events, and understand various 'situations' as they are evolving at different spatio-temporal granularity across the world. Taking inspiration from traditional image pixels which represent aggregation of photon energies at a location, we consider aggregation of user interest levels at different geo-locations as social pixels. Combining such pixels spatio-temporally allows for creation of social images and video. Here, we describe how the use of relevant (media processing inspired) situation detection operators upon such 'images', and domain based rules can be used to decide relevant control actions. The ideas are showcased using a Swine flu monitoring application which uses Twitter data.
    Keywords: control, event, microblogs, situation, twitter
    Structural analysis of the emerging event-web BIBAKFull-Text 1183-1184
      Vivek K. Singh; Ramesh Jain
    Events are the fundamental abstractions to study the dynamic world. We believe that the next generation of web (i.e. event-web), will focus on interconnections between events as they occur across space and time [3]. In fact we argue that the real value of large volumes of microblog data being created daily lies in its inherent spatio-temporality, and its correlation with the real-world events. In this context, we studied the structural properties of a corpus of 5,835,237 Twitter microblogs, and found it to exhibit Power laws across space and time, much like those exhibited by events in multiple domains. The properties studied over microblogs on different topics can be applied to study relationships between related events, as well as data organization for event-based, real-time, and location-aware applications.
    Keywords: event-web, microblogs, pareto law, power law, zipfs law
    Tagging and navigability BIBAKFull-Text 1185-1186
      Adish Singla; Ingmar Weber
    We consider the problem of optimal tagging for navigational purposes in one's own collection. What is the best that a forgetful user can hope for in terms of ease of retrieving a labeled object? We prove that the number of tags has to increase logarithmically in the collection size to maintain a manageable result set. Using Flickr data we then show that users do indeed apply more and more tags as their collection grows and that this is not due to a global increase in tagging activity. However, as the additional terms applied are not statistically independent, users of large collections still have to deal with larger and larger result sets, even when more tags are used as search terms. We pose optimal tag suggestion for navigational purposes as an open problem.
    Keywords: data organization, navigability, tagging
    Sampling high-quality clicks from noisy click data BIBAKFull-Text 1187-1188
      Adish Singla; Ryen W. White
    Click data captures many users' document preferences for a query and has been shown to help significantly improve search engine ranking. However, most click data is noisy and of low frequency, with queries associated to documents via only one or a few clicks. This severely limits the usefulness of click data as a ranking signal. Given potentially noisy clicks comprising results with at most one click for a query, how do we extract high-quality clicks that may be useful for ranking? In this poster, we introduce a technique based on query entropy for noise reduction in click data. We study the effect of query entropy and as well as features such as user engagement and the match between the query and the document. Based on query entropy plus other features, we can sample noisy data to 15% of its overall size with 43% query recall and an average increase of 20% in precision for recalled queries.
    Keywords: click data, noise elimination, query entropy, web search ranking
    Estimating the web robot population BIBAKFull-Text 1189-1190
      Yang Sun; C. Lee Giles
    In this research, capture-recapture (CR) models are used to estimate the population of web robots based on web server access logs from different websites. Each robot is considered as an individual randomly surfing the web and each website is considered as a trap that records the visitation of robots. We use maximum likelihood estimator to fit the observation data. Results show that there are 3,860 identifiable robot User-Agent strings and 780,760 IP addresses being used by web robots around the world. We also examine the origination of the named robots by their IP addresses. The results suggest that over 50% of web robot IP addresses are from United States and China.
    Keywords: web robot population
    Automated detection of session fixation vulnerabilities BIBAKFull-Text 1191-1192
      Yusuke Takamatsu; Yuji Kosuga; Kenji Kono
    Session fixation is a technique for obtaining the visitor's session identifier (SID) by forcing the visitor to use the SID supplied by the attacker. The attacker who obtains the victim's SID can masquerade as the visitor. In this paper, we propose a technique to automatically detect session fixation vulnerabilities in web applications. Our technique uses attack simulator that executes a real session fixation attack and check whether it is successful or not. In the experiment, our system successfully detected vulnerabilities in our original test cases and in a real world web application.
    Keywords: session fixation, web application security
    Keyword search over key-value stores BIBAKFull-Text 1193-1194
      Arash Termehchy; Marianne Winslett
    Key-value stores (KVSs) are the most prevalent storage systems for large scale web services. As they do not have the structural complexities of RDBMSs, they are more efficient. In this paper, we introduce the problem of keyword search over KVSs and analyze its differences with keyword search over documents and relational databases. We propose a novel method called Keyword Search over Key-value stores (KSTORE) to solve the problem. Our user study using two real life data sets shows that KSTORE provides better ranking than the extended versions of document and relational database keyword search proposals and poses reasonable overheads.
    Keywords: cloud computing, key-value stores, keyword search
    Scalable discovery of contradictions on the web BIBAKFull-Text 1195-1196
      Mikalai Tsytsarau; Themis Palpanas; Kerstin Denecke
    Our study addresses the problem of large-scale contradiction detection and management, from data extracted from the Web. We describe the first systematic solution to the problem, based on a novel statistical measure for contradictions, which exploits first- and second-order moments of sentiments. Our approach enables the interactive analysis and online identification of contradictions under multiple levels of time granularity. The proposed algorithm can be used to analyze and track opinion evolution over time and to identify interesting trends and patterns. It uses an incrementally updatable data structure to achieve computational efficiency and scalability. Experiments with real datasets show promising time performance and accuracy.
    Keywords: contradiction analysis, opinion mining
    A practical system for harvesting and monitoring hot topics on the web BIBAKFull-Text 1197-1198
      Xiaojun Wan; Jianwu Yang
    This poster briefly describes a practical system named FounderWISE for harvesting and monitoring hot topics on the Web. FounderWISE consists of five components: Web crawler, text classifier, topic detector, topic summarizer and topic analyzer. In this poster we present two key components of topic detector and topic analyzer. The system has been successfully deployed in a few Chinese major government departments.
    Keywords: founderwise, information diffusion, topic detection, topic evolution
    Co-optimization of multiple relevance metrics in web search BIBAKFull-Text 1199-1200
      Dong Wang; Chenguang Zhu; Weizhu Chen; Gang Wang; Zheng Chen
    Several relevance metrics, such as NDCG, precision and pSkip, are proposed to measure search relevance, where different metrics try to characterize search relevance from different perspectives. Yet we empirically find that the direct optimization of one metric cannot always achieve the optimal ranking of another metric. In this paper, we propose two novel relevance optimization approaches, which take different metrics into a global consideration where the objective is to achieve an ideal tradeoff between different metrics. To achieve this objective, we propose to co-optimize multiple relevance metrics and show their effectiveness.
    Keywords: lambdarank, learning to rank, user feedback
    Analyzing content-level properties of the web adversphere BIBAKFull-Text 1201-1202
      Yong Wang; Daniel Burgener; Aleksandar Kuzmanovic; Gabriel Maciá-Fernández
    Advertising has become an integral and inseparable part of the World Wide Web. However, neither public auditing nor monitoring mechanisms still exist in this emerging area. In this paper, we present our initial efforts on building a content-level auditing service for web-based ad networks. Our content-level measurements -- understanding the ad distribution mechanisms and evaluating location-based and behavioral targeting approaches -- bring useful auditing information to all entities involved in the online advertising business. We extensively evaluate Google's, AOL's, and Adblade's ad networks and demonstrate how their different design philosophies dominantly affect their performance at the content level.
    Keywords: behavioral targeting, location-based advertising, web-based ad network
    How accurately can one's interests be inferred from friends BIBAKFull-Text 1203-1204
      Zhen Wen; Ching-Yung Lin
    Search and recommendation systems must effectively model user interests in order to provide personalized results. The proliferation of social software makes social network an increasingly important source for user interest modeling, because of the social influence and correlation among friends. However, there are large variations in people's contribution of social content. Therefore, it is impractical to accurately model interests for all users. As a result, applications need to decide whether to utilize a user interest model based on its accuracy. To address this challenge, we present a study on the accuracy of user interests inferred from three types of social content: social bookmarking, file sharing, and electronic communication, in an organizational social network within a large-scale enterprise. First, we demonstrate that combining different types of social content to infer user interests outperforms methods that use only one type of social content. Second, we present a technique to predict the inference accuracy based on easily observed network characteristics, including user activeness, network in-degree, out-degree, and betweenness centrality.
    Keywords: accuracy, social networks, user modeling
    Learning to evaluate the visual quality of web pages BIBAKFull-Text 1205-1206
      Ou Wu; Yunfei Chen; Bing Li; Weiming Hu
    A beautiful and well-laid out Web page greatly facilitates users' accessing and enhances browsing experiences. We use "visual quality (VQ)" to denote the aesthetics of Web pages. In this paper, a computational aesthetics approach is proposed to learn the evaluation model for the visual quality of Web pages. First, a Web page layout extraction algorithm (V-LBE) is introduced to partition a Web page into major layout blocks. Then, regarding a Web page as a semi-structured image, features known to significantly affect the visual quality of a Web page are extracted to construct a feature vector. The experimental results show the initial success of our approach. Potential applications include Web search and Web design.
    Keywords: aesthetics, learning, semi-structured image, visual quality
    Are search engine users equally reliable? BIBAKFull-Text 1207-1208
      Qianli Xing; Yiqun Liu; Rongwei Cen; Min Zhang; Shaoping Ma; Liyun Ru
    In this paper, we study on the reliability of search engine users using click-through data. We proposed a graph-based approach to evaluate user reliability according to how users click on search result lists. We tried to incorporate this measure of reliability into relevance feedback for improving ranking performances. Experimental results indicate that the proposed approach is both effective and applicable.
    Keywords: relevance feedback, user reliability, web search
    RerankEverything: a reranking interface for browsing search results BIBAKFull-Text 1209-1210
      Takehiro Yamamoto; Satoshi Nakamura; Katsumi Tanaka
    This paper proposes a system called RerankEverything, which enables users to rerank search results in any search service, such as a Web search engine, an e-commerce site, a hotel reservation site and so on. In conventional search services, interactions between users and services are quite limited and complicated. In addition, search functions and interactions to refine search results differ depending on the services. By using RerankEverything, users can interactively explore search results in accordance with their interests by reranking search results from various viewpoints.
    Keywords: reranking, search user interfaces, wrapper generation
    LINKREC: a unified framework for link recommendation with user attributes and graph structure BIBAKFull-Text 1211-1212
      Zhijun Yin; Manish Gupta; Tim Weninger; Jiawei Han
    With the phenomenal success of networking sites (e.g., Facebook, Twitter and LinkedIn), social networks have drawn substantial attention. On online social networking sites, link recommendation is a critical task that not only helps improve user experience but also plays an essential role in network growth. In this paper we propose several link recommendation criteria, based on both user attributes and graph structure. To discover the candidates that satisfy these criteria, link relevance is estimated using a random walk algorithm on an augmented social graph with both attribute and structure information. The global and local influence of the attributes is leveraged in the framework as well. Besides link recommendation, our framework can also rank attributes in a social network. Experiments on DBLP and IMDB data sets demonstrate that our method outperforms state-of-the-art methods based on network structure and node attribute information for link recommendation.
    Keywords: link recommendation, random walk
    A link-based similarity measure for scientific literature BIBAKFull-Text 1213-1214
      Seok-Ho Yoon; Sang-Wook Kim; Sunju Park
    In this paper, we propose a new approach to measure similarities among academic papers based on their references. Our similarity measure uses both in-link and out-link by transforming in-link and out-link into undirected links.
    Keywords: link-based similarity measure, scientific literature
    Social group suggestion from user image collections BIBAKFull-Text 1215-1216
      Jie Yu; Xin Jin; Jiawei Han; Jiebo Luo
    Photo-sharing services have attracted millions of people and helped construct massive social networks on the Web. A popular trend is that users share their image collections within social groups, which greatly promotes the interactions between users and expands their social networks. Existing systems have difficulties in generating satisfactory social group suggestions because the images are classified independently and their relationship in a collection is ignored. In this work, we intend to produce suggestions of suitable photo-sharing groups from a user's personal photo collection by mining images on the Web and leveraging the collection context. Both visual content and textual annotations are integrated to generate initial prediction of the events or topics depicted in the images. A user collection-based label propagation method is proposed to improve the group suggestion by modeling the relationship of images in the same collection as a sparse weighted graph. Experiments on real user images and comparisons with the state-of-the-art techniques demonstrate the effectiveness of the proposed approaches.
    Keywords: group recommendation, label propagation, social image
    A quality-aware model for sales prediction using reviews BIBAKFull-Text 1217-1218
      Xiaohui Yu; Yang Liu; Xiangji Huang; Aijun An
    Writing and publishing reviews online has become an increasingly popular way for people to express opinions and sentiments. Analyzing the large volume of online reviews available can produce useful knowledge that are of interest to vendors and other parties. Prior studies in the literature have shown that online reviews have a significant correlation with the sales of products, and therefore mining the reviews could help predict the sales performance of relevant products. However, those studies fail to consider one important factor that may significantly affect the accuracy of the prediction, i.e., the quality of the reviews. In this paper, we propose a regression model that explicitly takes into account the quality factor, and discusses how this quality information can be predicted when it is not readily available. Experimental results on a movie review dataset confirm the effectiveness of the proposed model.
    Keywords: review mining, review quality mining
    Review recommendation with graphical model and EM algorithm BIBAKFull-Text 1219-1220
      Richong Zhang; Thomas Tran
    Automatically assessing the quality and helpfulness of consumer reviews is more and more desirable with the evolutionary development of online review systems. Existing helpfulness assessment methodologies make use of the positive vote fraction as a benchmark and heuristically find a "best guess" to estimate the helpfulness of review documents. This benchmarking methodology ignores the voter population size and treats the same positive vote fraction as the same helpfulness value. We propose a review recommendation approach that make use of the probability density of the review helpfulness as the benchmark and exploit graphical model and Expectation Maximization (EM) algorithm for the inference of review helpfulness. The experimental results demonstrate that the proposed approach is superior to existing approaches.
    Keywords: helpfulness, online review, recommendation
    Selective recrawling for object-level vertical search BIBAKFull-Text 1221-1222
      Yaqian Zhou; Mengjing Jiang; Qi Zhang; Xuanjing Huang; Lide Wu
    In this paper we propose a novel recrawling method based on navigation patterns called Selective Recrawling. The goal of selective recrawling is to automatically select page collections that have large coverage and little redundancy to a pre-defined vertical domain. It only requires several seed objects and can select a set of URL patterns to cover most objects. The selected set can be used to recrawl the web pages for quite a period of time and renewed periodically. Experiments on local event data show that our method can greatly reduce the downloading of web pages while keep the comparative object coverage.
    Keywords: recrawling, vertical search
    Efficient web pages identification for entity resolution BIBAKFull-Text 1223-1224
      Jia Zhu; Gabriel Fung; Xiaofang Zhou
    Entity resolution (ER) is a problem that arises in many areas. In most of cases, it represents a task that multiple entities from different sources require to be identified if they refer to the same or different objects because there are not unique identifiers associated with them. In this paper, we propose a model using web pages identification to identify entities and merge those entities refer to one object together. We use a classical name disambiguation problem as case study and examine our model on a subset of digital library records as the first stage of our work. The favorable results indicated that our proposed approach is highly effective.
    Keywords: entity resolution, name disambiguation, web pages identification
    Anonymizing user profiles for personalized web search BIBAKFull-Text 1225-1226
      Yun Zhu; Li Xiong; Christopher Verdery
    We study the problem of anonymizing user profiles so that user privacy is sufficiently protected while the anonymized profiles are still effective in enabling personalized web search. We propose a Bayes-optimal privacy notion to bound the prior and posterior probability of associating a user with an individual term in the anonymized user profile set. We also propose a novel bundling technique that clusters user profiles into groups by taking into account the semantic relationships between the terms while satisfying the privacy constraint. We evaluate our approach through a set of preliminary experiments using real data demonstrating its feasibility and effectiveness.
    Keywords: anonymization, personalized search, privacy-preserving data publishing
    Patch-based skin color detection and its application to pornography image filtering BIBAKFull-Text 1227-1228
      Haiqiang Zuo; Weiming Hu; Ou Wu
    Along with the explosive growth of the World Wide Web, an immense industry for the production and consumption of pornography has grown. Though the censorship and legal restraints on pornography are discriminating in different historical, cultural and national contexts, selling pornography to minors is not allowed in most cases. Detecting human skin tone is of utmost importance in pornography image filtering algorithms. In this paper, we propose two patch-based skin color detection algorithms: regular patch and irregular patch skin color detection algorithms. On the basis of skin detection, we extract 31-dimensional features from the input image, and these features are fed into a random forest classifier. Our algorithm has been incorporated into an adult-content filtering infrastructure, and is now in active use for preventing minors from accessing pornographic images via mobile phones.
    Keywords: covariance descriptors, pornography, skin color detection, skin patch, skin tone

    WWW 2010 demos

    Access: news and blog analysis for the social sciences BIBAKFull-Text 1229-1232
      Mikhail Bautin; Charles B. Ward; Akshay Patil; Steven S. Skiena
    The social sciences strive to understand the political, social, and cultural world around us, but have been impaired by limited access to the quantitative data sources enjoyed by the hard sciences. Careful analysis of Web document streams holds enormous potential to solve longstanding problems in a variety of social science disciplines through massive data analysis. This paper introduces the TextMap Access system, which provides ready access to a wealth of interesting statistics on millions of people, places, and things across a number of interesting web corpora. Powered by a flexible and scalable distributed statistics computation framework using Hadoop, continually updated corpora include newspapers, blogs, patent records, legal documents, and scientific abstracts; well over a terabyte of raw text and growing daily. The Lydia Textmap Access system, available through http://www.textmap.com/access, provides instant access for students and scholars through a convenient web user-interface. We describe the architecture of the TextMap Access system, and its impact on current research in political science, sociology, and business/marketing.
    Keywords: blog analysis, Hadoop, news analysis, social sciences
    Hearsay: a new generation context-driven multi-modal assistive web browser BIBAKFull-Text 1233-1236
      Yevgen Borodin; Faisal Ahmed; Muhammad Asiful Islam; Yury Puzis; Valentyn Melnyk; Song Feng; I. V. Ramakrishnan; Glenn Dausch
    This demo will present HearSay, a multi-modal non-visual web browser, which aims to bridge the growing Web Accessibility divide between individuals with visual impairments and their sighted counterparts, and to facilitate full participation of blind individuals in the growing Web-based society.
    Keywords: assistive browser, audio interface, blind users, multi-modal, screen reader, web accessibility
    A client-server architecture for state-dependent dynamic visualizations on the web BIBAKFull-Text 1237-1240
      Daniel Coffman; Danny Soroker; Chandra Narayanaswami; Aaron Zinman
    As sophisticated enterprise applications move to the Web, some advanced user experiences become difficult to migrate due to prohibitively high computation, memory, and bandwidth requirements. State-dependent visualizations of large-scale data sets are particularly difficult since a change in the client's context necessitates a change in the displayed results. This paper describes a Web architecture where clients are served a session-specific image of the data, with this image divided into tiles dynamically generated by the server. This set of tiles is supplemented with a corpus of metadata describing the immediate vicinity of interest; additional metadata is delivered as needed in a progressive fashion in support and anticipation of the user's actions. We discuss how the design of this architecture was motivated by the goal of delivering a highly responsive user experience. As an example of a complete application built upon this architecture, we present OrgMaps, an interactive system for navigating hierarchical data, enabling fluid, low-latency navigation of trees of hundreds of thousands of nodes on standard Web browsers using only HTML and JavaScript.
    Keywords: rich internet applications
    Hierarchical cluster visualization in web mapping systems BIBAKFull-Text 1241-1244
      Jean-Yves Delort
    This paper presents a technique for visualizing large spatial data sets in Web Mapping Systems (WMS). The technique creates a hierarchical clustering tree, which is subsequently used to extract clusters that can be displayed at a given scale without cluttering the map. Voronoi polygons are used as aggregation symbols to represent the clusters. This technique retains hierarchical relationships between data items at different scales. In addition, aggregation symbols do not overlap, and their sizes and the number of points that they cover is controlled by the same parameter. A prototype has been implemented and tested showing the effectiveness of the method for visualizing large data sets in WMS.
    Keywords: cluster footprints, hierarchical clustering, visualization
    Rants: a framework for rank editing and sharing in web search BIBAKFull-Text 1245-1248
      Byron J. Gao; Joey Jan
    With a Wiki-like search interface, users can edit ranks of search results and share the edits with the rest of the world. This is an effective way of personalization, as well as a practice of mass collaboration that allows users to vote for ranking and improve search performance. Currently, there are several ongoing experimentation efforts from the industry, e.g., SearchWiki by Google and U Rank by Microsoft. Beyond that, there is little published research on this new search paradigm. In this paper, we make an effort to establish a framework for rank editing and sharing in the context of web search, where we identify fundamental issues and propose principled solutions. Comparing to existing systems, for rank editing, our framework allows users to specify not only relative, but also absolute preferences. For edit sharing, our framework provides enhanced flexibility, allowing users to select arbitrarily aggregated views. In addition, edits can be shared among similar queries. We present a prototype system Rants, that implements the framework and provides search services through the Google web search API.
    Keywords: edit sharing, mass collaboration, personalization, rank editing, search interface, social search
    QuickSuggest: character prediction on web appliances BIBAKFull-Text 1249-1252
      Ullas Gargi; Rich Gossweiler
    As traditional media and information devices integrate with the web, they must suddenly support a vastly larger database of relevant items. Many devices use remote controls with on-screen keyboards which are not well suited for text entry but are difficult to displace. We introduce a text entry method which significantly improves text entry speed for on-screen keyboards using the same simple Up/Down/Left/Right/Enter interface common to remote controls and gaming devices used to enter text. The paper describes QuickSuggest's novel adaptive user interface, demonstrates quantitative improvements from simulation results on millions of user queries and shows ease of use and efficiency with no learning curve in user experiments.
    Keywords: language models, onscreen keyboards, predictive text input, query suggestions, remote controls, user interface, web appliances
    visKQWL, a visual renderer for a semantic web query language BIBAKFull-Text 1253-1256
      Andreas Hartl; Klara Weiand; François Bry
    KiWi is a semantic Wiki that combines the Wiki philosophy of collaborative content creation with the methods of the Semantic Web in order to enable effective knowledge management.
       Querying a Wiki must be simple enough for beginning users, yet powerful enough to accommodate experienced users.
       To this end, the keyword-based KiWi query language (KWQL) supports queries ranging from simple lists of keywords to expressive rules for selecting and reshaping Wiki (meta-)data.
       In this demo, we showcase visKWQL, a visual interface for the KWQL language aimed at supporting users in the query construction process. visKWQL and its editor are described, and their functionality is illustrated using example queries.
       visKWQL's editor provides guidance throughout the query construction process through hints, warnings and highlighting of syntactic errors.
       The editor enables round-tripping between the twin languages KWQL and visKWQL, meaning that users can switch freely between the textual and visual form when constructing or editing a query. It is implemented using HTML, JavaScript, and CSS, and can thus be used in (almost) any web browser without any additional software.
    Keywords: keyword querying, semantic web, semantic wikis, visual query languages, wiki
    GRAPE: a system for disambiguating and tagging people names in web search BIBAKFull-Text 1257-1260
      Lili Jiang; Wei Shen; Jianyong Wang; Ning An
    Name ambiguity is a big challenge in people information retrieval and has received considerable attention, especially with the increasing volume of Web data in recent years. In this demo, we present a system, GRAPE, which is capable of finding people related information over the Web. The salient features of our system are people name disambiguation and people tag presentation, which effectively distinguish different people entities sharing the same name and uniquely represent each namesake with a cluster of tags, such as occupation, birthdate, and organization.
    Keywords: clustering, name disambiguation, tag presentation
    iRIN: image retrieval in image-rich information networks BIBAKFull-Text 1261-1264
      Xin Jin; Jiebo Luo; Jie Yu; Gang Wang; Dhiraj Joshi; Jiawei Han
    In this demo, we present a system called iRIN designed for performing image retrieval in image-rich information networks. We first introduce MoK-SimRank to significantly improve the speed of SimRank, one of the most popular algorithms for computing node similarity in information networks. Next, we propose an algorithm called SimLearn to (1) extend MoK-SimRank to heterogeneous image-rich information network, and (2) account for both link-based and content-based similarities by seamlessly integrating reinforcement learning with feature learning.
    Keywords: image retrieval, information network, ranking
    Live web search experiments for the rest of us BIBAKFull-Text 1265-1268
      Timothy Jones; David Hawking; Ramesh Sankaranarayana
    There are significant barriers to academic research into user Web search preferences. Academic researchers are unable to manipulate the results shown by a major search engine to users and would have no access to the interaction data collected by the engine. Our initial approach to overcoming this was to ask participants to submit queries to an experimental search engine rather than their usual search tool. Over several different experiments we found that initial user buy-in was high but that people quickly drifted back to their old habits and stopped contributing data. Here, we report our investigation of possible reasons why this occurs. An alternative approach is exemplified by the Lemur browser toolbar, which allows local collection of user interaction data from search engine sessions, but does not allow result pages to be modified. We will demonstrate a new Firefox toolbar that we have developed to support experiments in which search results may be arbitrarily manipulated. Using our toolbar, academics can set up the experiments they want to conduct, while collecting (subject to human experimentation guidelines) queries, clicks and dwell times as well as optional explicit judgments.
    Keywords: browser extensions, implicit measures, web search
    Debugging standard document formats BIBAKFull-Text 1269-1272
      Nabil Layaida; Pierre Geneves
    We present a tool for helping XML schema designers to obtain a high quality level for their specifications.
       The tool allows one to analyze relations between classes of XML documents and formally prove them. For instance, the tool can be used to check forward and backward compatibilities of recommendations. When such a relation does not hold, the tool allows one to identify the reasons and reports detailed counter-examples that exemplify the problem.
       For this purpose, the tool relies on recent advances in logic-based automated theorem proving techniques that allow for efficient reasoning on very large sets of XML documents.
       We believe this tool can be of great value for standardization bodies that define specifications using various XML type definition languages (such as W3C specifications), and are concerned with quality assurance for their normative recommendations.
    Keywords: compatibility, format, schema, xml
    PageSense: style-wise web page advertising BIBAKFull-Text 1273-1276
      Lusong Li; Tao Mei; Xiang Niu; Chong-Wah Ngo
    This paper presents an innovative style-wise advertising platform for web page. Web page "style" mainly refers to visual effects, such as color and layout. Unlike the most popular ad-network such as Google AdSense which needs publishers to change the original structure of their pages and define the position and style of the embedded ads manually, style-wise page advertising aims to automatically deliver style-consistent ads at proper positions within the web page, without breaking the layout of the original page. Our system is motivated from the fact that almost 90% web pages contain blank regions without any content. Given a web page with some blank regions, style-wise page advertising is able to detect the suitable blank areas for advertising, rank the ads according to the semantic relevance and web page style, and embed relevant ads into these nonintrusive areas. Style-wise page advertising represents one of the first attempts towards contextual advertising which enables the publishers to save effort when applying online advertising services.
    Keywords: multimodal relevance, web page advertising, web page style
    Web-based framework for spatiotemporal screen real estate management of interactive public displays BIBAKFull-Text 1277-1280
      Tomas Linden; Tommi Heikkinen; Timo Ojala; Hannu Kukka; Marko Jurmu
    In this paper we present a web-based framework for spatiotemporal screen real estate management of interactive public displays. The framework facilitates dynamic partitioning of the screen real estate into virtual screens assigned for multiple concurrent web applications. The framework is utilized in the implementation of so-called UBI-hotspot, which provides various information services via different interaction modalities including mobile. The framework facilitates seamless integration of third party web applications residing anywhere in the public Internet into the UBI-hotspot, thus catering for a scalable and open architecture. We report the deployment of a network of indoor and outdoor UBI-hotspots at downtown Oulu, Finland. The quantitative data on the usage of the UBI-hotspots implicitly speaks in favor of the practical applicability of the framework.
    Keywords: ubi-hotspot, ubiquitous computing, urban computing
    Structured audio podcasts via web text-to-speech system BIBAKFull-Text 1281-1284
      Giulio Mori; Maria Claudia Buzzi; Marina Buzzi; Barbara Leporini
    Audio podcasting is increasingly present in the educational field and is especially appreciated as an ubiquitous/pervasive tool ("anywhere, anytime, at any pace") for acquiring or expanding knowledge. We designed and implemented a Web-based Text To Speech (TTS) system for automatic generation of a set of structured audio podcasts from a single text document. The system receives a document in input (doc, rtf, or txt), and in output provides a set of audio files that reflect the document's internal structure (one mp3 file for each document section), ready to be downloaded on portable mp3 players. Structured audio files are useful for everyone but are especially appreciated by blind users, who must explore content audially. Fully accessible for the blind, our system offers WAI-ARIA-based Web interfaces for easy navigation and interaction via screen reader and voice synthesizer, and produces a set of accessible audio files for Rockbox mp3 players (mp3 and talk files), allowing blind users to also listen to naturally spoken file names (instead of their spelled-out strings).
       In this demo, we will show how the system works when a user interacts via screen reader and voice synthesizer, showing the interaction with both our Web-based system and with an mp3 player.
    Keywords: audio podcasting, blind, e-learning, mp3 files, TTS, WAI-ARIA
    ourSpaces: linking provenance and social data in a virtual research environment BIBAKFull-Text 1285-1288
      Richard Reid; Edoardo Pignotti; Peter Edwards; Adele Laing
    Web-based Virtual Research Environments (VREs) have been proposed as one way in which e-Science tools can be deployed to support and enhance the research process. We are exploring the use of Linked Data in combination with the Open Provenance Model (OPM) and Social Web concepts to facilitate interactions between people and data in the context of a VRE. In this demo we present the ourSpaces VRE and outline the technologies used to link together provenance, research artefacts, projects, geographical locations and social data in the context of interdisciplinary research.
    Keywords: e-science, linked data, provenance, vre
    Diversifying landmark image search results by learning interested views from community photos BIBAKFull-Text 1289-1292
      Yuheng Ren; Mo Yu; Xin-Jing Wang; Lei Zhang; Wei-Ying Ma
    In this paper, we demonstrate a novel landmark photo search and browsing system: Agate, which ranks landmark image search results considering their relevance, diversity and quality. Agate learns from community photos the most interested aspects and related activities of a landmark, and generates adaptively a Table of Content (TOC) as a summary of the attractions to facilitate the user browsing. Image search results are thus re-ranked with the TOC so as to ensure a quick overview of the attractions of the landmarks. A novel non-parametric TOC generation and set-based ranking algorithm, MoM-DPM Sets, is proposed as the key technology of Agate. Experimental results based on human evaluation show the effectiveness of our model and users' preference for Agate.
    Keywords: landmark image search, set-based ranking, user interest modeling
    New-web search with microblog annotations BIBAKFull-Text 1293-1296
      Tom Rowlands; David Hawking; Ramesh Sankaranarayana
    Web search engines discover indexable documents by recursively 'crawling' from a seed URL. Their rankings take into account link popularity. While this works well, it introduces biases towards older documents. Older documents are more likely to be the target of links, while new documents with few, or no, incoming links are unlikely to rank highly in search results.
       We describe a novel system for 'new-Web' search based on links retrieved from the Twitter micro-blogging service. The Twitter service allows individuals, organisations and governments to rapidly disseminate very short messages to a wide variety of interested parties. When a Twitter message contains a URL, we use the Twitter message as a description of the URL's target. As Twitter is frequently used for discussion of current events, these messages offer useful, up-to-date annotations and instantaneous popularity readings for a small, but timely, portion of the Web.
       Our working system is simple and fast and we believe may offer a significant advantage in revealing new information on the Web that would otherwise be hidden from searchers. Beyond the basic system, we anticipate the Twitter messages may add supplementary terms for a URL, or add weight to existing terms, and that the reputation or authority of each message sender may serve to weight both annotations and query-independent popularity.
    Keywords: demonstration, information retrieval, microblogging, search, twitter, web search
    A framework for querying graph-based business process models BIBAKFull-Text 1297-1300
      Sherif Sakr; Ahmed Awad
    We present a framework for querying and reusing graph-based business process models. The framework is based on a new visual query language for business processes called BPMN-Q. The language addresses processes definitions and extends the standard BPMN visual notations for modeling business processes for its concrete syntax. BPMN-Q is used to query process models by matching a process model graph to a query graph. Moreover, the reusing framework is enhanced with a semantic query expander component. This component provides the users with the flexibility to get not only the perfectly matched process models to their queries but also the models with high similarity. The query engine of the framework is built on top of traditional RDBMS. A novel decomposition based and selectivity-aware relational processing mechanism is employed to achieve an efficient and scalable performance for graph-based BPMN-Q queries.
    Keywords: querying business process -- BPMN -- process models
    Sig.ma: live views on the web of data BIBAKFull-Text 1301-1304
      Giovanni Tummarello; Richard Cyganiak; Michele Catasta; Szymon Danielczyk; Renaud Delbru; Stefan Decker
    We demonstrate Sig.ma, both a service and an end user application to access the Web of Data as an integrated information space.
       Sig.ma uses an holistic approach in which large scale semantic web indexing, logic reasoning, data aggregation heuristics, ad hoc ontology consolidation, external services and responsive user interaction all play together to create rich entity descriptions. These consolidated entity descriptions then form the base for embeddable data mashups, machine oriented services as well as data browsing services. Finally, we discuss Sig.ma's peculiar characteristics and report on lessons learned and ideas it inspires.
    Keywords: aggregated search, RDFa, semantic web, web of data
    Not so creepy crawler: easy crawler generation with standard xml queries BIBAKFull-Text 1305-1308
      Franziska von dem Bussche; Klara Weiand; Benedikt Linse; Tim Furche; François Bry
    Web crawlers are increasingly used for focused tasks such as the extraction of data from Wikipedia or the analysis of social networks like last.fm. In these cases, pages are far more uniformly structured than in the general Web and thus crawlers can use the structure of Web pages for more precise data extraction and more expressive analysis.
       In this demonstration, we present a focused, structure-based crawler generator, the "Not so Creepy Crawler" (nc2 ). What sets nc2 apart, is that all analysis and decision tasks of the crawling process are delegated to an (arbitrary) XML query engine such as XQuery or Xcerpt. Customizing crawlers just means writing (declarative) XML queries that can access the currently crawled document as well as the metadata of the crawl process. We identify four types of queries that together suffice to realize a wide variety of focused crawlers.
       We demonstrate nc2 with two applications: The first extracts data about cities from Wikipedia with a customizable set of attributes for selecting and reporting these cities.
       It illustrates the power of nc2 where data extraction from Wiki-style, fairly homogeneous knowledge sites is required.
       In contrast, the second use case demonstrates how easy nc2 makes even complex analysis tasks on social networking sites, here exemplified by last.fm.
    Keywords: data extraction, web crawler, web query, xml
    MindFinder: image search by interactive sketching and tagging BIBAKFull-Text 1309-1312
      Changhu Wang; Zhiwei Li; Lei Zhang
    In this technical demonstration, we showcase the MindFinder system $-$ a novel image search engine. Different from existing interactive image search engines, most of which only provide image-level relevance feedback, MindFinder enables users to sketch and tag query images at object level. By considering the image database as a huge repository, MindFinder is able to help users present and refine their initial thoughts in their mind, and finally turn thoughts to a beautiful image(s). Multiple actions are enabled for users to flexibly design their queries in a bilateral interactive manner by leveraging the whole image database, including tagging, refining query by dragging and dropping objects from search results, as well as editing objects. After each action, the search results will be updated in real time to provide users up-to-date materials to further formulate the query. By the deliberate but easy design of the query, MindFinder not only tries to enable users to present on the query panel whatever they are imagining, but also returns to users the most similar images to the picture in users' mind. By scaling up the image database to 10 million, MindFinder has the potential to reveal whatever in users' mind, that is where the name MindFinder comes from.
    Keywords: MindFinder, interactive search, sketching, tagging
    FormSys: form-processing web services BIBAKFull-Text 1313-1316
      Ingo M. Weber; Hye-young Paik; Boualem Benatallah; Zifei Gong; Liangliang Zheng; Corren Vorwerk
    In this paper we present FormSys, a Web-based system that service-enables form documents. It offers two main services: filling in forms based on Web services' incoming SOAP messages, and invoking Web services based on filled-in forms. This can be applied to benefit individuals to reduce the number of often repetitive form fields they have to complete manually in many scenarios. It can also help organisations to remove the need for manual data entry by automatically triggering business process implementations based on incoming case data from filled-in forms. While the concept applies to forms of any type of document, our implementation uses Adobe AcroForms due to its universal applicability, availability of a usable API, and end-user appeal. In the demo, we will show the two core functions, namely soap2pdf and pdf2soap, along with use case applications of the services developed from real world scenarios. Essentially, this work demonstrates how PDFs can be used as a channel for interacting with Web services.
    Keywords: documents, processes, service-enabled forms, web services
    Falconer: once SIOC meets semantic search engine BIBAKFull-Text 1317-1320
      Gang Wu; Mengdong Yang; Ke Wu; Guilin Qi; Yuzhong Qu
    Falconer is a semantic Web search engine enhanced SIOC (Semantically-Interlinked Online Communities) application, which is designed to demonstrate the ability of accelerating the creation and reuse process of semantic Web data with easy-to-use user interfaces. In this process, semantic Web search engines feed existing semantic data into the SIOC framework, where new semantic data are composed by the community and indexed again by those search engines. Compared to existing social (semantic) Web applications, Falconer inherently conforms to SIOC specification. It provides semantic search engine based user registration suggestion, friends auto-discovery, and semantic annotation for forum post content. Another distinctive feature is that it enables users to subscribe any resource having a URI as the topic they are interested in. The relationships among users, topics, and posts are further visualized for analyzing the topic trends in the community. As all semantic data are formatted in RDF and RDFa, they can be queried with SPARQL query language.
    Keywords: SIOC, semantic search engine, social semantic web
    Interactive image search by 2D semantic map BIBAKFull-Text 1321-1324
      Hao Xu; Jingdong Wang; Xian-Sheng Hua; Shipeng Li
    In this demo, we present a novel interactive image search system, image search by 2D semantic map. This system enables users to indicate what semantic concepts are expected to appear and even how these concepts are spatially distributed in the desired images. To this end, we design an intuitive interface for users to formulate a query in the form of 2D semantic map, called concept map, by typing textual queries in a blank canvas. In the ranking process, by interpreting each textual concept as a set of representative visual instances, the concept map query is translated into a visual instance map, which is then used for comparison with the images in the database. Besides, in this demo, we also show an image search system with a simplest semantic map, a 2D color map, where the concepts are limited from the colors.
    Keywords: color map, concept map, interactive image search
    eduKEN: a tool for fine-grained video comment collection and analysis BIBAKFull-Text 1325-1328
      Wai Gen Yee; Andrew Yates; Ophir Frieder; Armin Moehrle
    An increasing amount of Web information is in video format. Today's search technology allows videos to be found using graphical features and textual descriptions. However, the information gleaned from video features is coarse, while textual descriptions are often short and fail to capture the precise content of videos. We hypothesize that user comments contain supplemental information that effectively describes the content of a video. This information, once extracted, can be applied to a search engine index to improve video search accuracy.
       A complete comment-based indexing system must encourage users to post comments and be able to analyze comment data. To support research in the relevant areas of interface design, text analysis, and information retrieval, we present eduKEN, a flexible Web-based tool that allows comment collection, analysis, and search. To support the collection of real-world comment data in an educational setting, eduKEN includes an additional user interface designed for classroom use.
    Keywords: annotations, search
    ZoomRDF: semantic fisheye zooming on RDF data BIBAKFull-Text 1329-1332
      Kang Zhang; Haofen Wang; Duc Thanh Tran; Yong Yu
    With the development of Semantic Web in recent years, an increasing amount of semantic data has been created in form of Resource Description Framework (RDF). Current visualization techniques help users quickly understand the underlying RDF data by displaying its structure in an overview. However, detailed information can only be accessed by further navigation. An alternative approach is to display the global context as well as the local details simultaneously in a unified view. This view supports the visualization and navigation on RDF data in an integrated way. In this demonstration, we present ZoomRDF, a framework that: i) adapts a space-optimized visualization algorithm for RDF, which allows more resources to be displayed, thus maximizes the utilization of display space, ii) combines the visualization with a fisheye zooming concept, which assigns more space to some individual nodes while still preserving the overview structure of the data, iii) considers both the importance of resources and the user interaction on them, which offers more display space to those elements the user may be interested in. We implement the framework based on the Gene Ontology and demonstrate that it facilitates tasks like RDF data exploration and editing.
    Keywords: RDF data, information visualization, semantic web
    Optimizing user interaction for web-based mobile tasks BIBAKFull-Text 1333-1336
      Dong Zhou; Ajay Chander; Hiroshi Inamura
    This paper describes the design and prototype implementation of MIntOS (Mobile Interaction Optimization System), a system for improving mobile interaction in web-based activities. MIntOS monitors users' interactions both for gathering interaction history and for the runtime construction of interaction context. A simple approach based on interaction burstiness is used to break interaction sequences into Trails, which approximates user tasks. Such Trails are used to generate rules for online, context-sensitive prediction of future interaction sequences. Predicted user interaction sequences are then optimized to reduce the amount of user input and user wait time using techniques such as interaction short-cuts, automatic text copying and form-filling, as well as page pre-fetching. Such optimized interaction sequences are, at real-time, recommended to the user through UI enhancements in a non-intrusive manner.
    Keywords: interaction trail, trail optimization, user interaction monitoring, web-based mobile tasks


    Search is dead!: long live search BIBAKFull-Text 1337-1338
      Andrei Broder; Elizabeth F. Churchill; Marti Hearst; Barney Pell; Prabhakar Raghavan; Andrew Tomkins
    Back in the heady days of 1999 and WWW8 (Toronto) we held a panel titled "Finding Anything in the Billion Page Web: Are Algorithms the Key?" In retrospect the answer to this question seems laughably obvious -- the search industry has burgeoned on a foundation of algorithms, cloud computing and machine learning. As we move into the second decade of this millennium, we are confronted with a dizzying array of new paradigms for finding content, including social networks and location-based search and advertising. This panel pulls together senior experts from academia and the major search principals to debate whether search will continue to look anything like the 2-keywords-give-10-blue-links paradigm that Google has popularized. What do emerging approaches and paradigms -- natural language search, social search, location-based search -- mean for the future of search in general?
    Keywords: algorithms, location-based search, search, social search
    Video search: are algorithms all we need? BIBAKFull-Text 1339-1340
      Jeff Ubois; Jamie Davidson; Marko Grobelnik; Paul Over; Hans Westerhof
    This panel will debate various approaches to improving video search, and explore how professional cataloguing, crowd sourced metadata, and improvements in search algorithms will evolve over the next ten years. Panelists will explore the needs of large scale video archives, and compare these against the current capabilities of video search.
    Keywords: algorithms, archives, design, search, video
    What the web can't do BIBAKFull-Text 1341-1342
      David A. Shamma; Seth Fitzsimmonds; Joe Gregorio; Adam Hupp; Ramesh Jain; Kevin Marks
    This panel discusses how polling in the HTTPd protocol affects how we are building the next generation of the web and its applications. As other technologies (HTML, Javascript, etc.) move forward, we ask should the web's protocol also evolve or is it sufficient for the web to continue through just GET and POST?
    Keywords: http, rest, streaming, www, xmpp


    Web search engine metrics: (direct metrics to measure user satisfaction) BIBAKFull-Text 1343-1344
      Ali Dasdan; Kostas Tsioutsiouliklis; Emre Velipasaoglu
    Search engines are important resources for finding information on the Web. They are also important for publishers and advertisers to present their content to users. Thus, user satisfaction is key and must be quantified. In this tutorial, we give a practical review of web search metrics from a user satisfaction point of view. We cover metrics for relevance, comprehensiveness, coverage, diversity, discovery freshness, content freshness, and presentation. We will also describe how these metrics can be mapped to proxy metrics for the stages of a generic search engine pipeline. The practitioners can apply these metrics readily and the researchers can get motivation for new problems to work on, especially in formalizing and refining metrics.
    Keywords: comprehensiveness metrics, content freshness metrics, coverage metrics, discovery freshness metrics, diversity metrics, presentation metrics, relevance metrics, web metrics, web search
    Enterprise and desktop search BIBAKFull-Text 1345-1346
      Pavel Dmitriev; Pavel Serdyukov; Sergey Chernov
    With the growing amount of information on users' desktops and increasing scale and complexity of intranets, Enterprise and Desktop Search are becoming two increasingly important Information Retrieval applications. While the challenges arising there are not completely different from those that the web community has faced for years, advanced web search solutions are often unable to address them properly. In this tutorial we give a research prospective on distinctive features of both Enterprise and Desktop Search, explain typical search scenarios, and review existing ranking techniques and algorithms.
    Keywords: desktop search, enterprise search, expert search, exploratory search, user feedback
    How to consume linked data on the web: tutorial description BIBAKFull-Text 1347-1348
      Olaf Hartig; Juan Sequeda; Jamie Taylor; Patrick Sinclair
    In the past two years, the amount of data published in RDF and following the Linked Data principles has increased dramatically. Everyday people are publishing datasets as Linked Data. However, applications that consume Linked Data are not mainstream yet. To overcome this issue, we present a beginners tutorial on consuming Linked Data. We will discuss existing techniques how users can currently consume Linked Data and use it in their current applications.
    Keywords: application, consuming, linked data, semantic web, web of data, web of linked data
    Optimal marketing and pricing over social networks BIBAKFull-Text 1349-1350
      Nicole Immorlica; Vahab S. Mirrokni
    We discuss the use of social networks in implementing viral marketing strategies. In the first part of this tutorial, we study influence maximization or how the structure of the social network affects the spread of behaviors and technologies. In the second part, we then consider how one might monopolize these natural processes to generate revenue in a revenue maximization setting.
    Keywords: influence, revenue maximization, social networks
    Web search/browse log mining: challenges, methods, and applications BIBAKFull-Text 1351-1352
      Daxin Jiang; Jian Pei; Hang Li
    Huge amounts of search and browse log data has been accumulated in various search engines. Such massive search/browse log data, on the one hand, provides great opportunities to mine the wisdom of crowds and improve Web search as well as online advertisement. On the other hand, designing effective and efficient algorithms and tools to clean, model, and process large scale log data presents great challenges.
       In this tutorial, we give a systematic survey on the applications, challenges, fundamental principles and state-of-the-art methods of mining large scale search and browse log data. We start with an introduction of search and browse log data and an overview of various log mining applications. Then, we focus on four popular areas of log mining applications, namely query understanding, document understanding, query-document matching, and user understanding. For each area, we review the major tasks, analyze the challenges, and exemplify several representative solutions. Finally, we discuss several new directions in search/browse log mining.
       The tutorial slides are available at the authors' homepages after the tutorial is presented.
    Keywords: log mining, search and advertisement applications, search and browse logs
    Applications of open search tools BIBAKFull-Text 1353-1354
      Rosie Jones; Ted Drake
    It costs about 300M dollars to just build a search system that scales to the web. Web services which open up web search results to the public allow academics, developers, and entrepreneurs to achieve instant web search parity for free and enable them to focus on building their additional secret sauce on top to create even grander relevant services. For example, a social network could leverage open search and their data about users to personalize web search. Additionally, one of the best ways to gather data about web search behavior is to build your own search system. Proto-type IR and web search systems based on open search can be used to gather user interaction data and test the applicability of research ideas. Open Source tools like Lucene and Nutch and open search services like from major search engines can greatly help developers implement these types of systems quickly, allowing for fast production and evaluation. We will give detailed overviews of the popular open search tools, showcasing examples of search and IR algorithms and systems implemented using them, as well as discussing how evaluation can be performed.
    Keywords: open search tools, open source search
    Introduction to social recommendation BIBAKFull-Text 1355-1356
      Irwin King; Michael R. Lyu; Hao Ma
    As the exponential growth of information generated on the World Wide Web, Social Recommendation has emerged as one of the hot research topics recently. Social Recommendation forms a specific type of information filtering technique that attempts to suggest information (blogs, news, music, travel plans, web pages, images, tags, etc.) that are likely to interest the users. Social Recommendation involves the investigation of collective intelligence by using computational techniques such as machine learning, data mining, natural language processing, etc. on social behavior data collected from blogs, wikis, recommender systems, question & answer communities, query logs, tags, etc. from areas such as social networks, social search, social media, social bookmarks, social news, social knowledge sharing, and social games. In this tutorial, we will introduce Social Recommendation and elaborate on how the various characteristics and aspects are involved in the social platforms for collective intelligence. Moreover, we will discuss the challenging issues involved in Social Recommendation in the context of theory and models of social networks, methods to improve recommender systems using social contextual information, ways to deal with partial and incomplete information in the social context, scalability and algorithmic issues with social computational techniques.
    Keywords: collaborative filtering, collective intelligence, recommender systems, social network model and theories, social recommendation
    Linking content in unstructured sources BIBAKFull-Text 1357-1358
      Marie-Francine Moens
    This tutorial focuses on the task of automated information linking in text and multimedia sources. In any task where information is fused from different sources, this linking is a necessary step. To solve the problem we borrow methods from computational linguistics, computer vision and data mining. Although the main focus is on finding equivalence relations in the sources, the tutorial opens views on the recognition of other relation types.
    Keywords: content alignment and linking
    RESTful web services: principles, patterns, emerging technologies BIBAKFull-Text 1359-1360
      Cesare Pautasso; Erik Wilde
    Recent technology trends in Web services indicate that a solution eliminating the perceived complexity of the WS-* standard technology stack may be in sight: advocates of Representational State Transfer (REST) have come to believe that their ideas explaining why the World Wide Web works are just as applicable to solve enterprise application integration problems and to radically simplify the plumbing required to implement a Service-Oriented Architecture (SOA). In this tutorial we give an introduction to the REST architectural style as the foundation for RESTful Web services. The tutorial starts from the basic design principles of REST and how they are applied to service oriented computing. Service-orientation concentrates on identifying self-contained units of functionality, which should then be exposed as easily reusable and repurposable services. This tutorial focuses not on the identification of those units, but on how to design the services representing them. We explain how decisions on the SOA level already shape the architectural style that will be used for the eventual IT architecture, and how the SOA process itself has to be controlled to yield services which can then be implemented RESTfully. We do not claim that REST is the only architectural style that can be used for SOA design, but we do argue that it does have distinct advantages for loosely coupled services and massive scale, and that any SOA approach already has to be specifically RESTful on the business level to yield meaningful input for IT architecture design.
    Keywords: REST, SOA, patterns, web services

    Developers' track

    Implementing the media fragments URI specification BIBAKFull-Text 1361-1364
      Davy Van Deursen; Raphaël Troncy; Erik Mannens; Silvia Pfeiffer; Yves Lafon; Rik Van de Walle
    In this paper, we describe two examples of implementations of the Media Fragments URI specification which is currently being developed by the W3C Media Fragments Working Group. The group's mission is to create standard addressing schemes for media fragments on the Web using Uniform Resource Identifiers (URIs). We describe two scenarios to illustrate the implementations. More specifically, we show how User Agents (UA) will either be able to resolve media fragment URIs without help from the server, or will make use of a media fragments-aware server. Finally, we present some ongoing discussions and issues regarding the implementation of the Media Fragments specification.
    Keywords: media fragments, video accessibility, video url
    Exposing audio data to the web: an API and prototype BIBAKFull-Text 1365-1368
      David Humphrey; Corban Brook; Alistair MacDonald
    The HTML5 specification introduces the audio and video media elements, and with them the opportunity to change the way media is integrated on the web. The current HTML5 media API provides ways to play and get limited information about audio and video, but no way to programatically access or create such media. In this paper we present an enhanced API for these media elements, as well as details about a Mozilla Firefox implementation created by the authors, which allows web developers to read and write raw audio data.
    Keywords: audio, fft, firefox, html5
    Enabling WebGL BIBAKFull-Text 1369-1370
      Catherine Leung; Andor Salga
    WebGL leverages the power of OpenGL to present accelerated 3D graphics on a webpage. The ability to put hardware-accelerated 3D content in the browser will provide a means for the creation of new web based applications that were previously the exclusive domain of the desktop environment. It will also allow the inclusion of features that standalone 3D applications do not have. While WebGL succeeds in bringing the power and low-level API of OpenGL to the browser, it also expects a lot of web developers, who are used to the DOM and JavaScript libraries like jQuery. This paper will look at how mid level APIs can help web developers create unique 3D content that is more than just duplicates of a standalone desktop application on a web page. We will present one such web application named Motionview, built with C3DL, that provides a new means for artist and motion capture studios to communicate with each other. We will also highlight some upcoming project ideas that make use of 3D browser technology in a way that would not have been possible in a desktop environment.
    Keywords: 3d, browser, canvas, javascript, motion capture, processing.js, WebGL
    The spoken web: software development and programming through voice BIBAKFull-Text 1371-1374
      Arun Kumar; Sheetal K. Agarwal; Priyanka Manwani
    It has been a constant aim of computer scientists, programming language designers and practitioners to raise the level of programming abstractions and bring them as close to the user's natural context as possible. The efforts started right from our transition from machine code programming to assembly language programming, from there to high level procedural languages, followed by object oriented programming. Nowadays, service oriented software development and composition are the norm.
       There have also been notable efforts such as Alice system from CMU to simplify the programming experience through the use of 3D virtual worlds. The holy grail has been to enable non-technical users such as kids or non-technical people to be able to understand and pick up programming and software development easily. We present a novel approach to software development that lets people use their voice to program or create new software through composition. We demonstrate some basic programming tasks achieved by simply talking to a system over an ordinary phone. Such programs constructed by talking can be created in user's local language and do not require IT literacy or even literacy as a prerequisite.
       We believe this approach will have a deep impact on software development, especially development of web based software in the very near future.
    Keywords: programming by voice, spoken web, voicesites
    'Follow me': a web-based, location-sharing architecture for large, indoor environments BIBAKFull-Text 1375-1378
      Polychronis Ypodimatopoulos; Andrew Lippman
    We leverage the ubiquity of bluetooth-enabled devices and propose a decentralized, web-based architecture that allows users to share their location by following each other in the style of Twitter. We demonstrate a prototype that operates in a large building which generates a dataset of detected bluetooth devices at a rate of 30 new devices per day, including the respective location where they were last detected. Users then query the dataset using their unique bluetooth ID and share their current location with their followers by means of unique URIs that they control. Our separation between producers (the building) and consumers (the users) of bluetooth device location data allows us to create socially-aware applications that respect user's privacy while limiting the software necessary to run on mobile devices to just a web browser.
    Keywords: agents, bluetooth, location sharing
    IBM's jazz integration architecture: building a tools integration architecture and community inspired by the web BIBAKFull-Text 1379-1382
      Scott Rich
    In this paper, we describe our experience in the Jazz project, beginning from a "Classical" repository- and Java-centric design and evolving towards an architecture which borrows heavily from the architecture of the Web. Along the way, we formed an open community to collaborate on adapting this architecture to various tool domains. Finally, we discuss our experience delivering the first generation of tools built and integrated using these techniques.
    Keywords: representational state transfer, rest, software development
    TWC data-gov corpus: incrementally generating linked government data from data.gov BIBAKFull-Text 1383-1386
      Li Ding; Dominic DiFranzo; Alvaro Graves; James R. Michaelis; Xian Li; Deborah L. McGuinness; James A. Hendler
    The Open Government Directive is making US government data available via websites such as Data.gov for public access. In this paper, we present a Semantic Web based approach that incrementally generates Linked Government Data (LGD) for the US government. In focusing on the trade-off between high quality LGD generation (requiring non-trivial human expert input) and massive LGD generation (requiring low human processing cost), our work is highlighted by the following features: (i) supporting low-cost and extensible LGD publishing for massive government data; (ii) using Social Semantic Web (Web3.0) technologies to incrementally enhance published LGD via crowdsourcing, and (iii) facilitating mash-ups by declaratively reusing cross-dataset mappings which usually are hard-coded in applications.
    Keywords: crowdsourcing, linked data, linked government data, open government, social semantic web