HCI Bibliography Home | HCI Conferences | WWW Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
WWW Tables of Contents: 04-104-205-105-2060708091011-111-212-112-213-113-214-114-215-115-2

Proceedings of the 2011 International Conference on the World Wide Web

Fullname:Proceedings Companion of the 20th International Conference on World Wide Web
Editors:S. Sadagopan; Krithi Ramamritham; Arun Kumar; M. P. Ravindra; Elisa Bertino; Ravi Kumar
Location:Hyderabad, India
Dates:2011-Mar-28 to 2011-Apr-01
Volume:2
Publisher:ACM
Standard No:ISBN 1-4503-0637-3, 978-1-4503-0637-9; ACM DL: Table of Contents hcibib: WWW11-2
Papers:166
Pages:524
Links:Conference Home Page
  1. WWW 2011-03-28 Volume 2
    1. Poster session
    2. Demo session
    3. Tutorials
    4. Workshop summaries
    5. Panel session
    6. PhD symposium
    7. Emerging regions

WWW 2011-03-28 Volume 2

Poster session

Toward optimal vaccination strategies for probabilistic models BIBAFull-Text 1-2
  Zeinab Abbassi; Hoda Heidari
Epidemic outbreaks such as the recent H1N1 influenza show how susceptible large communities are toward the spread of such outbreaks. The occurrence of a widespread disease transmission raises the question of vaccination strategies that are appropriate and close to optimal. The seemingly different problem of viruses disseminating through email networks, shares a common structure with disease epidemics. While it is not possible to vaccinate every individual during a virus outbreak, due to economic and logistical constraints, fortunately, we can leverage the structure and properties of face-to-face social networks to identify individuals whose vaccination would result in a lower number of infected people.
   The models that have been studied so far [3, 4] assume that once an individual is infected all its adjacent individuals would be infected with probability 1. However, this assumption is not realistic. In reality, if an individual is infected by a virus, the neighboring individuals would get infected with some probability (depending on the type of the disease and the contact). This modification to the model makes the problem more challenging as the simple version is already NP-complete [3].
   Here we consider the following epidemiological model computationally: A number of individuals in the community get vaccinated which makes them immune to the disease. The disease then outbreaks and a number of nodes that are not vaccinated get infected at random. These nodes can transmit the infection to their friends with some probability. In this work we consider the optimization problem in which the number of nodes that get vaccinated is limited to k and our objective is to minimize the number of infected people overall. We design various algorithms that take into account the properties of social networks to select k nodes for vaccination in order to achieve the goal. We perform experiments on a real dataset of 34,546 vertices and 421,578 edges and assess their effectiveness and scalability.
Timestamp-based cache invalidation for search engines BIBAFull-Text 3-4
  Sadiye Alici; Ismail Sengor Altingovde; Rifat Ozcan; B. Barla Cambazoglu; Ozgür Ulusoy
We propose a new mechanism to predict stale queries in the result cache of a search engine. The novelty of our approach is in the use of timestamps in staleness predictions. We show that our approach incurs very little overhead on the system while its prediction accuracy is comparable to earlier works.
Towards automatic quality assurance in Wikipedia BIBAFull-Text 5-6
  Maik Anderka; Benno Stein; Nedim Lipka
Featured articles in Wikipedia stand for high information quality, and it has been found interesting to researchers to analyze whether and how they can be distinguished from "ordinary" articles. Here we point out that article discrimination falls far short of writer support or automatic quality assurance: Featured articles are not identified, but are made. Following this motto we compile a comprehensive list of information quality flaws in Wikipedia, model them according to the latest state of the art, and devise one-class classification technology for their identification.
Measuring the effectiveness of display advertising: a time series approach BIBAFull-Text 7-8
  Joel Barajas; Ram Akella; Marius Holtan; Jaimie Kwon; Brad Null
We develop an approach for measuring the effectiveness of online display advertising at the campaign level. We present a Kalman filtering approach to deseasonalize and estimate the percentage changes of online sales on a daily basis. For this study, we analyze 3828 campaigns for 961 products on the Advertising.com network.
A middleware for securing mobile mashups BIBAFull-Text 9-10
  Florent Batard; Karima Boudaoud; Michel Riveill
Mashups on traditional desktop devices are a well-known source of security risks. In this paper, we examine how these risks translate to mobile mashups and identify new risks caused by mobile-specific characteristics such as access to device features or offline operation. We describe the design of SCCM, a platform independent approach to handle the various mobile mashup security risks in a consistent and systematic manner. Evaluating an SCCM implementation for Android, we find that SCCM successfully protects against common attacks such as inserting a malicious widget from the outside.
Language independent identification of parallel sentences using Wikipedia BIBAFull-Text 11-12
  Rohit G. Bharadwaj; Vasudeva Varma
This paper details a novel classification based approach to identify parallel sentences between two languages in a language independent way. We substitute the required language specific resources by the richly structured multilingual content, Wikipedia. Our approach is particularly useful to extract parallel sentences for under-resourced languages like most Indian and African languages, where resources are not readily available with necessary accuracies. We extract various statistics based on the cross lingual links present in Wikipedia and use them to generate feature vectors for each sentence pair. Binary classification of each pair of sentences into parallel or non-parallel has been done using these feature vectors. We achieved a precision up to 78% which is encouraging when compared to other state-of-art approaches. These results support our hypothesis of using Wikipedia to evaluate the parallel coefficient between sentences that can be used to build bilingual dictionaries.
From actors, politicians, to CEOs: domain adaptation of relational extractors using a latent relational mapping BIBAFull-Text 13-14
  Danushka Bollegala; Yutaka Matsuo; Mitsuru Ishizuka
We propose a method to adapt an existing relation extraction system to extract new relation types with minimum supervision. Our proposed method comprises two stages: learning a lower-dimensional projection between different relations, and learning a relational classifier for the target relation type with instance sampling. We evaluate the proposed method using a dataset that contains 2000 instances for 20 different relation types. Our experimental results show that the proposed method achieves a statistically significant macro-average F-score of 62.77. Moreover, the proposed method outperforms numerous baselines and a previously proposed weakly-supervised relation extraction method.
Recommendations for the long tail by term-query graph BIBAFull-Text 15-16
  Francesco Bonchi; Raffaele Perego; Fabrizio Silvestri; Hossein Vahabi; Rossano Venturini
We define a new approach to the query recommendation problem. In particular, our main goal is to design a model enabling the generation of query suggestions also for rare and previously unseen queries. In other words we are targeting queries in the long tail. The model is based on a graph having two sets of nodes: Term nodes, and Query nodes. The graph induces a Markov chain on which a generic random walker starts from a subset of Term nodes, moves along Query nodes, and restarts (with a given probability) only from the same initial subset of Term nodes. Computing the stationary distribution of such a Markov chain is equivalent to extracting the so-called Center-piece Subgraph from the graph associated with the Markov chain itself. Given a query, we extract its terms and we set the restart subset to this term set. Therefore, we do not require a query to have been previously observed for the recommending model to be able to generate suggestions.
Efficient diversification of search results using query logs BIBAFull-Text 17-18
  Gabriele Capannini; Franco Maria Nardini; Raffaele Perego; Fabrizio Silvestri
We study the problem of diversifying search results by exploiting the knowledge mined from query logs. Our proposal exploits the presence of different "specializations" of queries in query logs to detect the submission of ambiguous/faceted queries, and manage them by diversifying the search results returned in order to cover the different possible interpretations of the query. We present an original formulation of the results diversification problem in terms of an objective function to be maximized that admits the finding of an optimal solution in linear time.
EntityTagger: automatically tagging entities with descriptive phrases BIBAFull-Text 19-20
  Kaushik Chakrabarti; Surajit Chaudhuri; Tao Cheng; Dong Xin
We consider the problem of entity tagging: given one or more named entities from a specific domain, the goal is to automatically associate descriptive phrases, referred to as etags (entity tags), to each entity. Consider a product catalog containing product names and possibly short descriptions. For a product in the catalog, say Ricoh G600 Digital Camera, we want to associate etags such as "water resistant", "rugged" and "outdoor" to it, even though its name or description does not mention those phrases. Entity tagging can enable more effective search over entities. We propose to leverage signals in web documents to perform such tagging. We develop techniques to perform such tagging in a domain independent manner while ensuring high precision and high recall.
Web-scale entity-relation search architecture BIBAFull-Text 21-22
  Soumen Chakrabarti; Devshree Sane; Ganesh Ramakrishnan
Enabling entity search and ranking at Web-scale is fraught with many challenges: annotating the corpus with entities and types, query language design, index design, query processing logic, and answer consolidation. We describe a Web-scale entity search engine we are building to handle over a billion Web pages, over 200,000 types, over 1,500,000 entities, and hundreds of entity annotations per page. We describe the design of compressed, token span oriented indices for entity and type annotations. Our prototype demonstrates the practicality of Web-scale entity-relation search.
Survivability-oriented self-tuning of web systems BIBAFull-Text 23-24
  Bihuan Chen; Xin Peng; Yijun Yu; Wenyun Zhao
Running in a highly uncertain and changing environment, Web systems cannot always provide full set of services with optimal quality, especially when the workload is high or failures in subsystems occur frequently. It is thus desirable to continuously maintain a high satisfaction level of the system value proposition, hereafter survivability assurance, while relaxing/sacrificing certain quality/functional requirements that are not crucial to the survival of the Web systems. In this paper, we propose a requirements-driven self-tuning method for survivability assurance of Web systems. Using a value-based feedback controller plus a requirements-oriented reasoner, our method makes both quality and functional requirements tradeoffs decisions at runtime.
Learning facial attributes by crowdsourcing in social media BIBAFull-Text 25-26
  Yan-Ying Chen; Winston H. Hsu; Hong-Yuan Mark Liao
Facial attributes such as gender, race, age, hair style, etc., carry rich information for locating designated persons and profiling the communities from image/video collections (e.g., surveillance videos or photo albums). For plentiful facial attributes in photos and videos, collecting costly manual annotations for training detectors is time-consuming. We propose an automatic facial attribute detection method by exploiting the great amount of weakly labelled photos in social media. Our work can (1) automatically extract training images from the semantic-consistent user groups and (2) filter out noisy training photos by multiple mid-level features (by voting). Moreover, we introduce a method to harvest less-biased negative data for preventing uneven distribution of certain attributes. The experiments show that our approach can automatically acquire training photos for facial attributes and is on par with that by manual annotations.
Generating summaries for ontology search BIBAFull-Text 27-28
  Gong Cheng; Weiyi Ge; Yuzhong Qu
This poster proposes a novel approach for generating summaries for ontology search. Following previous work, we define ontology summarization as the problem of ranking and selecting RDF sentences, for which we examine three aspects. Firstly, to assess the salience of RDF sentences in an ontology, we devise a bipartite graph model for representing the ontology and analyze random walks on this graph. Secondly, to reflect how an ontology is matched with user needs expressed via keyword queries, we incorporate query relevance into the selection of RDF sentences. Finally, to improve the unity of a summary, we optimize its cohesion in terms of the connections between constituent RDF sentences. We have implemented an online prototype system called Falcons Ontology Search.
Enhancing web search with entity intent BIBAFull-Text 29-30
  Na Dai; Xiaoguang Qi; Brian D. Davison
Web entities, such as documents and hyperlinks, are created for different purposes, or intents. Existing intent-based retrieval methods largely focus on information seekers' intent expressed by queries, ignoring the other side of the problem: web content creators' intent. We argue that understanding why the content was created is also important. In this work, we propose to classify such intents into two broad categories: "navigational" and "informational". Then we incorporate such intents into traditional retrieval models, and show their effect on ranking performance.
Query completion without query logs for song search BIBAFull-Text 31-32
  Nitin Dua; Kanika Gupta; Monojit Choudhury; Kalika Bali
We describe a new method for query completion for Bollywood song search without using query logs. Since song titles in non-English languages (Hindi in our case) are mostly present as Roman transliterations of the native script, both the queries and documents have a large number of valid variations. We address this problem by using a Roman to Hindi transliteration engine coupled with appropriate ranking and implementation strategies. Out of 100 test cases, our system could generate the correct suggestion for 91 queries.
OntoWiki mobile: knowledge management in your pocket BIBAFull-Text 33-34
  Timofey Ermilov; Norman Heino; Sören Auer
As comparatively powerful mobile computing devices become more common, mobile web applications have started gaining in popularity. Such mobile web applications as Google Mail or Calendar are already in everyday use of millions of people. Some first examples of these applications use Semantic Web technologies and information in the form of RDF (e.g. TripIt). An important feature of these applications is their ability to provide offline functionality with local updates for later synchronization with a web server. The key problem to this is the reconciliation, i.e. the problem of potentially conflicting updates from disconnected clients. In this paper we present an approach for a mobile semantic collaboration platform based on the OntoWiki framework [1]. It allows users to collect instance data and refine the structure knowledge bases on the go. A crucial part of OntoWiki Mobile is the advanced conflict resolution for RDF stores. The approach is based on the EvoPat [2] method for data evolution and ontology refactoring.
HyLiEn: a hybrid approach to general list extraction on the web BIBAFull-Text 35-36
  Fabio Fumarola; Tim Weninger; Rick Barber; Donato Malerba; Jiawei Han
We consider the problem of automatically extracting general lists from the web. Existing approaches are mostly dependent upon either the underlying HTML markup or the visual structure of the Web page. We present HyLiEn an unsupervised, Hybrid approach for automatic List discovery and Extraction on the Web. It employs general assumptions about the visual rendering of lists, and the structural representation of items contained in them. We show that our method significantly outperforms existing methods.
WonderWhat: real-time event determination from photos BIBAFull-Text 37-38
  Mingyan Gao; Xian-Sheng Hua; Ramesh Jain
How often did you feel disappointed in a foreign country, when you had been craving for participating in authentic native events but miserably ended up with being lost in the crowd, due to ignorance of the local culture? Have you ever imagined that with merely a simple click, a tool can identify the events that are right in front of you? As a step in this direction, in this paper, we propose a system that provides users with information of the public events that they are attending by analyzing in real time their photos taken at the event, leveraging both spatio-temporal context and photo content. To fulfill the task, we designed the system to collect event information, maintain dedicated event database, build photo content model for event types, and rank the final results. Extensive experiments were conducted to prove the effectiveness of each component.
Identifying overlapping communities in folksonomies or tripartite hypergraphs BIBAFull-Text 39-40
  Saptarshi Ghosh; Pushkar Kane; Niloy Ganguly
Online folksonomies are modeled as tripartite hypergraphs, and detecting communities from such networks is a challenging and well-studied problem. However, almost every existing algorithm known to us for community detection in hypergraphs assign unique communities to nodes, whereas in reality, nodes in folksonomies belong to multiple overlapping communities e.g. users have multiple topical interests, and the same resource is often tagged with semantically different tags. In this paper, we propose an algorithm to detect overlapping communities in folksonomies by customizing a recently proposed edge-clustering algorithm (that is originally for traditional graphs) for use on hypergraphs.
Spammers' networks within online social networks: a case-study on Twitter BIBAFull-Text 41-42
  Saptarshi Ghosh; Gautam Korlam; Niloy Ganguly
We analyze the strategies employed by contemporary spammers in Online Social Networks (OSNs) by identifying a set of spam-accounts in Twitter and monitoring their link-creation strategies. Our analysis reveals that spammers adopt intelligent 'collaborative' strategies of link-formation to avoid detection and to increase the reach of their generated spam, such as forming 'spam-farms' and creating large number of links with targeted legitimate users. The observations are verified through the analysis of a giant 'spam-farm' embedded within the Twitter OSN.
Networked hierarchies for web directories BIBAFull-Text 43-44
  Nazli Goharian; Saket S. R. Mengle
The hierarchical nature of existing Web directories, ontologies, and folksonomies, are known to provide meaningful information that guide users and applications. We hypothesize that such hierarchical structures provide richer information if they are further enriched by incorporating additional links besides parents, and siblings, namely, between non-sibling nodes. We call such structure a networked hierarchy. Our empirical results indicate that such a networked hierarchy introduces interesting links between nodes (non-sibling) that otherwise in a hierarchical structure are not evident.
A study on the impact of product images on user clicks for online shopping BIBAFull-Text 45-46
  Anjan Goswami; Naren Chittar; Chung H. Sung
In this paper we study the importance of image based features on the click-through rate (CTR) in the context of a large scale product search engine. Typically product search engines use text based features in their ranking function. We present a novel idea of using image based features, common in the photography literature, in addition to text based features. We used a stochastic gradient boosting based regression model to learn relationships between features and CTR. Our results indicate statistically significant correlations between the image features and CTR. We also see improvements to NDCG and mean standard regression.
CELF++: optimizing the greedy algorithm for influence maximization in social networks BIBAFull-Text 47-48
  Amit Goyal; Wei Lu; Laks V. S. Lakshmanan
Kempe et al. [4] (KKT) showed the problem of influence maximization is NP-hard and a simple greedy algorithm guarantees the best possible approximation factor in PTIME. However, it has two major sources of inefficiency. First, finding the expected spread of a node set is #P-hard. Second, the basic greedy algorithm is quadratic in the number of nodes. The first source is tackled by estimating the spread using Monte Carlo simulation or by using heuristics [4, 6, 2, 5, 1, 3]. Leskovec et al. proposed the CELF algorithm for tackling the second. In this work, we propose CELF++ and empirically show that it is 35-55% faster than CELF.
Rolling boles, optimal XML structure integrity for updating operations BIBAFull-Text 49-50
  Sebastian Graf; Sebastian Kay Belle; Marcel Waldvogel
While multiple techniques exist to utilize the tree structure of the Extensible Markup Language (XML) regarding integrity checks, they all rely on adaptations of the Merkle Tree: All children are acting as one slice regarding the check-sum of one node with the help of an one-way hash concatenation. This results in postorder traversals regarding the (re-)computation of the integrity structure within modification operations. With our approach we perform nearly in-time updates of the entire integrity structure. We therefore equipped an XHash-based approach with an incremental hash function. This replaces postorder traversals by adapting only the incremental modifications to the check-sums of a node and its ancestors. With experimental results we prove that our approach only generates a constant overhead depending on the depth of the tree while native DOMHash implementations produce an overhead based on the depth and the number of all nodes in the tree. Consequently, our approach called Rolling Boles generates sustainable impact since it facilitates instant integrity updates in constant time.
SmartInt: using mined attribute dependencies to integrate fragmented web databases BIBAFull-Text 51-52
  Ravi Gummadi; Anupam Khulbe; Aravind Kalavagattu; Sanil Salvi; Subbarao Kambhampati
Many web databases can be seen as providing partial and overlapping information about entities in the world. To answer queries effectively, we need to integrate the information about the individual entities that are fragmented over multiple sources. At first blush this is just the inverse of traditional database normalization problem -- rather than go from a universal relation to normalized tables, we want to reconstruct the universal relation given the tables (sources). The standard way of reconstructing the entities will involve joining the tables. Unfortunately, because of the autonomous and decentralized way in which the sources are populated, they often do not have Primary Key -- Foreign Key relations. While tables do share attributes, direct joins over these shared attributes can result in reconstruction of many spurious entities thus seriously compromising precision. We present a unified approach that supports intelligent retrieval over fragmented web databases by mining and using inter-table dependencies. Experiments with the prototype implementation, SmartInt, show that its retrieval strikes a good balance between precision and recall.
Trust analysis with clustering BIBAFull-Text 53-54
  Manish Gupta; Yizhou Sun; Jiawei Han
Web provides rich information about a variety of objects. Trustability is a major concern on the web. Truth establishment is an important task so as to provide the right information to the user from the most trustworthy source. Trustworthiness of information provider and the confidence of the facts it provides are inter-dependent on each other and hence can be expressed iteratively in terms of each other. However, a single information provider may not be the most trustworthy for all kinds of information. Every information provider has its own area of competence where it can perform better than others. We derive a model that can evaluate trustability on objects and information providers based on clusters (groups). We propose a method which groups the set of objects for which similar set of providers provide "good" facts, and provides better accuracy in addition to high quality object clusters.
Automatic sanitization of social network data to prevent inference attacks BIBAFull-Text 55-56
  Raymond D. Heatherly; Murat Kantarcioglu
As the privacy concerns related to information release in social networks become a more mainstream concern, data owners will need to utilize a variety of privacy-preserving methods of examining this data. Here, we propose a method of data generalization that applies to social networks and present some initial findings for the utility/privacy tradeoff required for its use.
Predicting popular messages in Twitter BIBAFull-Text 57-58
  Liangjie Hong; Ovidiu Dan; Brian D. Davison
Social network services have become a viable source of information for users. In Twitter, information deemed important by the community propagates through retweets. Studying the characteristics of such popular messages is important for a number of tasks, such as breaking news detection, personalized message recommendation, viral marketing and others. This paper investigates the problem of predicting the popularity of messages as measured by the number of future retweets and sheds some light on what kinds of factors influence information propagation in Twitter. We formulate the task into a classification problem and study two of its variants by investigating a wide spectrum of features based on the content of the messages, temporal information, metadata of messages and users, as well as structural properties of the users' social graph on a large scale dataset. We show that our method can successfully predict messages which will attract thousands of retweets with good performance.
Automatically generating labels based on unified click model BIBAFull-Text 59-60
  Guichun Hua; Min Zhang; Yiqun Liu; Shaoping Ma; Liyun Ru
Ground truth labels are one of the most important parts in many test collections for information retrieval. Each label, depicting the relevance between a query-document pair, is usually judged by a human, and this process is time-consuming and labor-intensive. Automatically Generating labels from click-through data has attracted increasing attention. In this paper, we propose a Unified Click Model to predict the multi-level labels, which aims at comprehensively considering the advantages of the Position Models and Cascade Models. Experiments show that the proposed click model outperforms the existing click models in predicting the multi-level labels, and could replace the labels judged by humans for test collections.
Allocating inverted index into flash memory for search engines BIBFull-Text 61-62
  Bojun Huang; Zenglin Xia
Domain-independent entity extraction from web search query logs BIBAFull-Text 63-64
  Alpa Jain; Marco Pennacchiotti
Query logs of a Web search engine have been increasingly used as a vital source for data mining. This paper presents a study on large-scale domain-independent entity extraction from search query logs. We present a completely unsupervised method to extract entities by applying pattern-based heuristics and statistical measures. We compare against existing techniques that use Web documents as well as search logs, and show that we improve over the state of the art. We also provide an in-depth qualitative analysis outlining differences and commonalities between these methods.
Ranking in context-aware recommender systems BIBAFull-Text 65-66
  Minsuk Kahng; Sangkeun Lee; Sang-goo Lee
As context is acknowledged as an important factor that can affect users' preferences, many researchers have worked on improving the quality of recommender systems by utilizing users' context. However, incorporating context into recommender systems is not a simple task in that context can influence users' item preferences in various ways depending on the application. In this paper, we propose a novel method for context-aware recommendation, which incorporates several features into the ranking model. By decomposing a query, we propose several types of ranking features that reflect various contextual effects. In addition, we present a retrieval model for using these features, and adopt a learning to rank framework for combining proposed features. We evaluate our approach on two real-world datasets, and the experimental results show that our approach outperforms several baseline methods.
Ranking related entities for web search queries BIBAFull-Text 67-68
  Changsung Kang; Srinivas Vadrevu; Ruiqiang Zhang; Roelof van Zwol; Lluis Garcia Pueyo; Nicolas Torzec; Jianzhang He; Yi Chang
Entity ranking is a recent paradigm that refers to retrieving and ranking related objects and entities from different structured sources in various scenarios. Entities typically have associated categories and relationships with other entities. In this work, we present an extensive analysis of Web-scale entity ranking, based on machine learned ranking models using an ensemble of pairwise preference models. Our proposed system for entity ranking uses structured knowledge bases, entity relationship graphs and user data to derive useful features to facilitate semantic search with entities directly within the learning to rank framework. The experimental results are validated on a large-scale graph containing millions of entities and hundreds of millions of entity relationships. We show that our proposed ranking solution clearly improves a simple user behavior based ranking model.
GeoVisualRank: a ranking method of geotagged images considering visual similarity and geo-location proximity BIBFull-Text 69-70
  Hidetoshi Kawakubo; Keiji Yanai
Anytime algorithm for QoS web service composition BIBAFull-Text 71-72
  Hyunyoung Kil; Wonhong Nam
The QoS-aware web service composition (WSC) problem aims at the automatic construction of a composite web service with the optimal accumulated QoS value. It is, however, intractable to solve the QoS-aware WSC problem for large scale instances, since the problem corresponds to a global optimization problem. In this paper, we propose a novel anytime algorithm for the QoS-aware WSC problem to identify composite web services with high quality much earlier than an optimal algorithm and the beam stack search [3].
Smart news feeds for social networks using scalable joint latent factor models BIBAFull-Text 73-74
  Himabindu Lakkaraju; Angshu Rai; Srujana Merugu
Social networks such as Facebook and Twitter offer a huge opportunity to tap the collective wisdom (both published and yet to be published) of all the participating users in order to address the information needs of individual users in a highly contextualized fashion using rich user-specific information. Realizing this opportunity, however, requires addressing two key limitations of current social networks: (a) difficulty in discovering relevant content beyond the immediate neighborhood, (b) lack of support for information filtering based on semantics, content source and linkage.
   We propose a scalable framework for constructing smart news feeds based on predicting user-post relevance using multiple signals such as text content and attributes of users and posts, and various user-user, post-post and user-post relations (e.g. friend, comment, author relations). Our solution comprises of two steps where the first step ensures scalability by selecting a small set of user-post dyads with potentially interesting interactions using inverted feature indexes. The second step models the interactions associated with the selected dyads via a joint latent factor model, which assumes that the user/post content and relationships can be effectively captured by a common latent representation of the users and posts. Experiments on a Facebook dataset using the proposed model lead to improved precision/recall on relevant posts indicating potential for constructing superior quality news feeds.
Finding influential mediators in social networks BIBAFull-Text 75-76
  Cheng-Te Li; Shou-De Lin; Man-Kwan Shan
Given a social network, who are the key players controlling the bottlenecks of influence propagation if some persons would like to activate specific individuals? In this paper, we tackle the problem of selecting a set of k mediator nodes as the influential gateways whose existence determines the activation probabilities of targeted nodes from some given seed nodes. We formally define the k-Mediators problem. To have an effective and efficient solution, we propose a three-step greedy method by considering the probabilistic influence and the structural connectivity on the pathways from sources to targets. To the best of our knowledge, this is the first work to consider the k-Mediators problem in networks. Experiments on the DBLP co-authorship graph show the effectiveness and efficiency of the proposed method.
Hypergraph-based inductive learning for generating implicit key phrases BIBAFull-Text 77-78
  Decong Li; Sujian Li
This paper presents a novel approach to generate implicit key phrases which are ignored in previous researches. Recent researches prefer to extract key phrases with semi-supervised transductive learning methods, which avoid the problem of training data. In this paper, based on a transductive learning method, we formulate the phrases in the document as a hypergraph and expand the hypergraph to include implicit phrases, which are ranked by an inductive learning approach. The highest ranked phrases are seen as implicit key phrases, and experimental results demonstrate the satisfactory performance of this approach.
Open and decentralized access across location-based services BIBAFull-Text 79-80
  Yiming Liu; Rui Yang; Erik Wilde
Users now interact with multiple Location-Based Services (LBS) through a myriad set of location-aware devices and interfaces. However, current LBS tend to be centralized silos with ad-hoc APIs, which limits potential for information sharing and reuse. Further, LBS subscriptions and user experiences are not easily portable across devices. We propose a general architecture for providing open and decentralized access to LBS, based on Tiled Feeds -- a RESTful protocol for access and interactions with LBS using feeds, and Feed Subscription Management (FSM) -- a generalized feed-based service management protocol. We describe two client designs, and demonstrate how they enable standardized access to LBS services, promote information sharing and mashup creation, and offer service management across various types of location-enabled devices.
Personalized search on Flickr based on searcher's preference prediction BIBAFull-Text 81-82
  Dongyuan Lu; Qiudan Li
In this paper, we propose a personalized search model to assist users in obtaining interested photos on Flickr, which exploits the favorite marks of the searcher's friends to predict the searcher's preference on the returned photos. The proposed model utilizes a co-clustering method to extract latent interest dimensions from users' implicit interests, and employs a discriminative learning method to predict searcher's preference on the returned photos. Preliminary experiments demonstrate the improvement of the proposed model compared to existing one-fit-all methods and a user-based collaborative filtering method.
A classification based framework for concept summarization BIBAFull-Text 83-84
  Dhruv Kumar Mahajan; Sundararajan Sellamanickam; Subhajit Sanyal; Amit Madaan
In this paper we propose a novel classification based framework for finding a small number of images summarizing a concept. Our method exploits metadata information available with the images to get the category information using Latent Dirichlet Allocation. We modify the import vector machine formulation based on kernel logistic regression to solve the underlying classification problem. We show that the import vectors provide a good summary satisfying important properties such as coverage, diversity and balance. Furthermore, the framework allows users to specify desired distributions over category, time etc, that a summary should satisfy. Experimental results show that the proposed method performs better than state-of-the-art summarization methods in terms of satisfying important visual and semantic properties.
A feature-pair-based associative classification approach to look-alike modeling for conversion-oriented user-targeting in tail campaigns BIBAFull-Text 85-86
  Ashish Mangalampalli; Adwait Ratnaparkhi; Andrew O. Hatch; Abraham Bagherjeiran; Rajesh Parekh; Vikram Pudi
Online advertising offers significantly finer granularity, which has been leveraged in state-of-the-art targeting methods, like Behavioral Targeting (BT). Such methods have been further complemented by recent work in Look-alike Modeling (LAM) which helps in creating models which are customized according to each advertiser's requirements and each campaign's characteristics, and which show ads to users who are most likely to convert on them, not just click them. In Look-alike Modeling given data about converters and nonconverters, obtained from advertisers, we would like to train models automatically for each ad campaign. Such custom models would help target more users who are similar to the set of converters the advertiser provides. The advertisers get more freedom to define their preferred sets of users which should be used as a basis to build custom targeting models.
   In behavioral data, the number of conversions (positive class) per campaign is very small (conversions per impression for the advertisers in our data set are much less than 10-4), giving rise to a highly skewed training dataset, which has most records pertaining to the negative class. Campaigns with very few conversions are called as tail campaigns, and those with many conversions are called head campaigns. Creation of Look-alike Models for tail campaigns is very challenging and tricky using popular classifiers like Linear SVM and GBDT, because of the very few number of positive class examples such campaigns contain. In this paper, we present an Associative Classification (AC) approach to LAM for tail campaigns. Pairs of features are used to derive rules to build a Rule-based Associative Classifier, with the rules being sorted by frequency-weighted log-likelihood ratio (F-LLR). The top k rules, sorted by F-LLR, are then applied to any test record to score it. Individual features can also form rules by themselves, though the number of such rules in the top k rules and the whole rule-set is very small. Our algorithm is based on Hadoop, and is thus very efficient in terms of speed.
Casting a web of trust over Wikipedia: an interaction-based approach BIBAFull-Text 87-88
  Silviu Maniu; Talel Abdessalem; Bogdan Cautis
We report in this short paper results on inferring a signed network (a "web of trust") from user interactions. On the Wikipedia network of contributors, from a collection of articles in the politics domain and their revision history, we investigate mechanisms by which relationships between contributors -- in the form of signed directed links -- can be inferred from their interactions. Our preliminary study provides valuable insight into principles underlying a signed network of Wikipedia contributors that is captured by social interaction. We look into whether this network (called hereafter WikiSigned) represents indeed a plausible configuration of link signs. We assess connections to social theories such as structural balance and status, which have already been considered in online communities. We also evaluate on this network the accuracy of a learned predictor for edge signs. Equipped with learning techniques that have been applied before on three explicit signed networks, we obtain good accuracy over the WikiSigned edges. Moreover, by cross training-testing we obtain strong evidence that our network does reveal an implicit signed configuration and that it has similar characteristics to the explicit ones, even though it is inferred from interactions. We also report on an application of the resulting signed network that impacts Wikipedia readers, namely the classification of Wikipedia articles by importance and quality.
Mobile topigraphy: large-scale tag cloud visualization for mobiles BIBAFull-Text 89-90
  Tatsushi Matsubayashi; Katsuhiko Ishiguro
We introduce a new mobile topigraphy system that uses the contour map metaphor to display large-scale tag clouds. We introduce the technical issues for topigraphy, and recent requirements for and developments in mobile interfaces. We also present some applications for our mobile topigraphy system and describe the assessment on two initial applications.
Unsupervised query segmentation using only query logs BIBAFull-Text 91-92
  Nikita Mishra; Rishiraj Saha Roy; Niloy Ganguly; Srivatsan Laxman; Monojit Choudhury
We introduce an unsupervised query segmentation scheme that uses query logs as the only resource and can effectively capture the structural units in queries. We believe that Web search queries have a unique syntactic structure which is distinct from that of English or a bag-of-words model. The segments discovered by our scheme help understand this underlying grammatical structure. We apply a statistical model based on Hoeffding's Inequality to mine significant word n-grams from queries and subsequently use them for segmenting the queries. Evaluation against manually segmented queries shows that this technique can detect rare units that are missed by our Pointwise Mutual Information (PMI) baseline.
Detecting group review spam BIBAFull-Text 93-94
  Arjun Mukherjee; Bing Liu; Junhui Wang; Natalie Glance; Nitin Jindal
It is well-known that many online reviews are not written by genuine users of products, but by spammers who write fake reviews to promote or demote some target products. Although some existing works have been done to detect fake reviews and individual spammers, to our knowledge, no work has been done on detecting spammer groups. This paper focuses on this task and proposes an effective technique to detect such groups.
A self organizing document map algorithm for large scale hyperlinked data inspired by neuronal migration BIBAFull-Text 95-96
  Kotaro Nakayama; Yutaka Matsuo
Web document clustering is one of the research topics that is being pursued continuously due to the large variety of applications. Since Web documents usually have variety and diversity in terms of domains, content and quality, one of the technical difficulties is to find a reasonable number and size of clusters. In this research, we pay attention to SOMs (Self Organizing Maps) because of their capability of visualized clustering that helps users to investigate characteristics of data in detail. The SOM is widely known as a "scalable" algorithm because of its capability to handle large numbers of records. However, it is effective only when the vectors are small and dense. Although several research efforts on making the SOM scalable have been conducted, technical issues on scalability and performance for sparse high-dimensional data such as hyperlinked documents still remain. In this paper, we introduce MIGSOM, an SOM algorithm inspired by a recent discovery on neuronal migration. The two major advantages of MIGSOM are its scalability for sparse high-dimensional data and its clustering visualization functionality. In this paper, we describe the algorithm and implementation, and show the practicality of the algorithm by applying MIGSOM to a huge scale real data set: Wikipedia's hyperlink data.
Collaborative classification over P2P networks BIBAFull-Text 97-98
  Odysseas Papapetrou; Wolf Siberski; Stefan Siersdorfer
We propose a novel collaborative approach for distributed document classification, combining the knowledge of multiple users for improved organization of data such as individual document repositories or emails. The approach builds on top of a P2P network and outperforms the state of the art approaches in collaborative classification.
Generalized fact-finding BIBAFull-Text 99-100
  Jeff Pasternack; Dan Roth
Once information retrieval has located a document, and information extraction has provided its contents, how do we know whether we should actually believe it? Fact-finders are a state-of-the-art class of algorithms that operate in a manner analogous to Kleinberg's Hubs and Authorities, iteratively computing the trustworthiness of an information source as a function of the believability of the claims it makes, and the believability of a claim as a function of the trustworthiness of those sources asserting it. However, as fact-finders consider only "who claims what", they ignore a great deal of relevant background and contextual information. We present a framework for "lifting" (generalizing) the fact-finding process, allowing us to elegantly incorporate knowledge such as the confidence of the information extractor and the attributes of the information sources. Experiments demonstrate that leveraging this information significantly improves performance over existing, "unlifted" fact-finding algorithms.
Investigating topic models for social media user recommendation BIBAFull-Text 101-102
  Marco Pennacchiotti; Siva Gurumurthy
This paper presents a user recommendation system that recommends to a user new friends having similar interests. We automatically discover users' interests using Latent Dirichlet Allocation (LDA), a linguistic topic model that represents users as mixtures of topics. Our system is able to recommend friends for 4 million users with high recall, outperforming existing strategies based on graph analysis.
On using the real-time web for news recommendation & discovery BIBAFull-Text 103-104
  Owen Phelan; Kevin McCarthy; Mike Bennett; Barry Smyth
In this work we propose that the high volumes of data on real-time networks like Twitter can be harnessed as a useful source of recommendation knowledge. We describe Buzzer, a news recommendation system that is capable of adapting to the conversations that are taking place on Twitter. Buzzer uses a content-based approach to ranking RSS news stories by mining trending terms from both the public Twitter timeline and from the timeline of tweets generated by a user's own social graph (friends and followers). We also describe the result of a live-user trial which demonstrates how these ranking strategies can add value to conventional RSS ranking techniques, which are largely recency-based.
Extracting events and event descriptions from Twitter BIBAFull-Text 105-106
  Ana-Maria Popescu; Marco Pennacchiotti; Deepa Paranjpe
This paper describes methods for automatically detecting events involving known entities from Twitter and understanding both the events as well as the audience reaction to them. We show that NLP techniques can be used to extract events, their main actors and the audience reactions with encouraging results.
Understanding the functions of business accounts on Twitter BIBAFull-Text 107-108
  Ana-Maria Popescu; Alpa Jain
This paper performs an initial exploration of business Twitter accounts in order to start understanding how businesses interact with their users and vice versa. We provide an analysis of business tweet types and topics and show that specific business tweet classes such as deals and events can be reliably identified for customer use.
A framework for evaluating network measures for functional importance BIBAFull-Text 109-110
  Tieyun Qian; Qing Li; Jaideep Srivastava
Many metrics such as degree, closeness, and PageRank have been introduced to determine the relative importance of a node within a network. The desired function of a network, however, is domain-specific. For example, the robustness can be crucial for a communication network, while efficiency is more preferred for fast spreading of advertisements in viral marketing. The information provided by some widely used measures are often conflicting under such varying demands. In this paper, we present a novel framework for evaluating network metrics regarding typical functional requirements. We also propose an analysis of five well established measures to compare their performance of ranking nodes on functional importance in a real-life network.
Comparative study of clustering techniques for short text documents BIBAFull-Text 111-112
  Aniket Rangrej; Sayali Kulkarni; Ashish V. Tendulkar
We compare various document clustering techniques including K-means, SVD-based method and a graph-based approach and their performance on short text data collected from Twitter. We define a measure for evaluating the cluster error with these techniques. Observations show that graph-based approach using affinity propagation performs best in clustering short text data with minimal cluster error.
Influence and passivity in social media BIBAFull-Text 113-114
  Daniel M. Romero; Wojciech Galuba; Sitaram Asur; Bernardo A. Huberman
The ever-increasing amount of information flowing through Social Media forces the members of these networks to compete for attention and influence by relying on other people to spread their message. A large study of information propagation within Twitter reveals that the majority of users act as passive information consumers and do not forward the content to the network. Therefore, in order for individuals to become influential they must not only obtain attention and thus be popular, but also overcome user passivity. We propose an algorithm that determines the influence and passivity of users based on their information forwarding activity. An evaluation performed with a 2.5 million user dataset shows that our influence measure is a good predictor of URL clicks, outperforming several other measures that do not explicitly take user passivity into account. We demonstrate that high popularity does not necessarily imply high influence and vice-versa.
Web information extraction using Markov logic networks BIBAFull-Text 115-116
  Sandeepkumar Satpal; Sahely Bhadra; Sundararajan Sellamanickam; Rajeev Rastogi; Prithviraj Sen
In this paper, we consider the problem of extracting structured data from web pages taking into account both the content of individual attributes as well as the structure of pages and sites. We use Markov Logic Networks (MLNs) to capture both content and structural features in a single unified framework, and this enables us to perform more accurate inference. We show that inference in our information extraction scenario reduces to solving an instance of the maximum weight subgraph problem. We develop specialized procedures for solving the maximum subgraph variants that are far more efficient than previously proposed inference methods for MLNs that solve variants of MAX-SAT. Experiments with real-life datasets demonstrate the effectiveness of our approach.
Towards identifying arguments in Wikipedia pages BIBAFull-Text 117-118
  Hoda Sepehri Rad; Denilson Barbosa
Wikipedia is one of the most widely used repositories of human knowledge today, contributed mostly by a few hundred thousand regular editors. In this open environment, inevitably, differences of opinion arise among editors of the same article. Especially for polemical topics such as religion and politics, difference of opinions among editors may lead to intense "edit wars" in which editors compete to have their opinions and points of view accepted. While such disputes can compromise the reliability of the article (or at least portions of it), they are recorded in the edit history of the articles. We posit that exposing such disputes to the reader, and pointing to the portions of the text where they manifest most prominently can be beneficial in helping concerned readers in understanding such topics. In this paper, we discuss our initial efforts towards the problem of automatic evaluation of extracting controversial points in Wikipedia pages.
How to choose combinations in a join of search results BIBAFull-Text 119-120
  Mirit Shalem; Yaron Kanza
We present novel measures for estimating the effectiveness of duplication-removal operations over a join of ranked lists. We introduce a duplication-removal approach, namely optimality rank, that outperforms existing approaches, according to the new measures.
REACTOR: a framework for semantic relation extraction and tagging over enterprise data BIBAFull-Text 121-122
  Wei Shen; Jianyong Wang; Ping Luo; Min Wang; Conglei Yao
Relation extraction from Web data has attracted a lot of attention in recent years. However, little work has been done when it comes to relation extraction from enterprise data regardless of the urgent needs to such work in real applications (e.g., E-discovery). In this paper, we propose a novel unsupervised hybrid framework, called REACTOR (abbreviated for a fRamework for sEmantic relAtion extraCtion and Tagging Over enteRprise data). We evaluate REACTOR over a real-world enterprise data set and empirical results show the effectiveness of REACTOR.
Harnessing the wisdom of crowds: video event detection based on synchronous comments BIBAFull-Text 123-124
  Xingtian Shi; Zhenglu Yang; Masashi Toyoda; Masaru Kitsuregawa
With the recent explosive growth of the number of videos on the Web, it becomes more important to facilitate users' demand for locating their preferred event clips in the lengthy and voluminous programs. Although there has been a great deal of study on generic event detection in recent years, the performance of existing approaches is still far from satisfactory. In this paper, we propose an integrated framework for general event detection. The key idea is that we utilize the synchronous comments to segment the video into clips with semantic text analysis, while taking into account the relationship between the users who write the comments. By borrowing the power of "the wisdom of crowds", we experimentally demonstrate that our approach can effectively detect video events.
ReadAlong: reading articles and comments together BIBAFull-Text 125-126
  Dyut Kumar Sil; Srinivasan H. Sengamedu; Chiranjib Bhattacharyya
We propose a new paradigm for displaying comments: showing comments alongside parts of the article they correspond to. We evaluate the effectiveness of various approaches for this task and show that a combination of bag of words and topic models performs the best.
Effective summarization of large collections of personal photos BIBFull-Text 127-128
  Pinaki Sinha; Sharad Mehrotra; Ramesh Jain
Learning to tokenize web domains BIBAFull-Text 129-130
  Sriram Srinivasan; Sourangshu Bhattachaya
Domain Match is an Internet monetization product offered by web companies like Yahoo! The product offers display of ads and search results, when a user requests a webpage from a domain which is non-existent or does not have any content. This product earns significant amount of advertising revenue for major internet companies like Yahoo! Hence it is an important product receiving millions of queries per day. Domain Match (DM) works by tokenizing the input domains and sub-folders into keywords and then displaying ads and search results queried on the keywords. In this poster, we describe a machine learning based solution, which automatically learns to tokenize new domains, given a training dataset containing a set of domains and their tokenizations. We use positional frequency and parts of speech as features for scoring tokens. Tokens are scored combined using various scoring models. We compare two ways of training the models: a simple gain function based training and a large margin training. Experimental results are encouraging.
Coverage patterns for efficient banner advertisement placement BIBAFull-Text 131-132
  Bhargav Sripada; Krishna Reddy Polepalli; Uday Kiran Rage
In an online banner advertising scenario, an advertiser expects that the banner advertisement should be displayed to certain percentage of web site visitors. In this context, to generate more revenue for a given web site, the publisher has to meet the demands of several advertisers by providing appropriate sets of web pages. To help the publishers and advertisers, in this paper, we propose a model of coverage patterns and a methodology to extract potential coverage patterns by analyzing click stream data. Given web pages of a site, a coverage pattern is a set of web pages visited by a certain percentage of visitors. The proposed approach has the potential to enable the publisher in meeting the demands of several advertisers. The efficiency and advantages of the proposed approach is shown by conducting experiments on real world data sets.
Using complex network features for fast clustering in the web BIBAFull-Text 133-134
  Jintao Tang; Ting Wang; Ji Wang; Qin Lu; Wenjie Li
Applying graph clustering algorithms in real world networks needs to overcome two main challenges: the lack of prior knowledge and the scalability issue. This paper proposes a novel method based on the topological features of complex networks to optimize the clustering algorithms in real-world networks. More specifically, the features are used for parameter estimation and performance optimization. The proposed method is evaluated on real-world networks extracted from the web. Experimental results show improvement both in terms of Adjusted Rand index values as well as runtime efficiency.
Identifying primary content from web pages and its application to web search ranking BIBAFull-Text 135-136
  Srinivas Vadrevu; Emre Velipasaoglu
Web pages are usually highly structured documents. In some documents, content with different functionality is laid out in blocks, some merely supporting the main discourse. In other documents, there may be several blocks of unrelated main content. Indexing a web page as if it were a linear document can cause problems because of the diverse nature of its content. If the retrieval function treats all blocks of the web page equally without attention to structure, it may lead to irrelevant query matches. In this paper, we describe how content quality of different blocks of a web page can be utilized to improve a retrieval function. Our method is based on segmenting a web page into semantically coherent blocks and learning a predictor of segment content quality. We also describe how to use segment content quality estimates as weights in the BM25F formulation. Experimental results show our method improves relevance of retrieved results by as much as 4.5% compared to BM25F that treats the body of a web page as a single section, and by a larger margin of over 9% for difficult queries.
A non-syntactic approach for text sentiment classification with stopwords BIBAFull-Text 137-138
  Suresh Venkatasubramanian; Ashok Veilumuthu; Avanthi Krishnamurthy; Veni Madhavan C. E; Kaushik Nath; Sunil Arvindam
The present approach uses stopwords and the gaps that occur between successive stopwords -- formed by contentwords -- as features for sentiment classification.
Evaluation of valuable user generated content on social news web sites BIBAFull-Text 139-140
  Yana Volkovich; Andreas Kaltenbrunner
Social news websites have gained significant popularity in the last few years. The participants of such websites are not only allowed to share news links but also to annotate, to evaluate and to comment them. To quantify interestingness and attractiveness of the user generated content in respect to the original link source we introduce the User Generated Content add-on (UGC+) index. Based on the definition of UGC+ we also propose a concept for comparing groups of links filtered by different properties, e.g. authorship or topic-categories. We apply the proposed measure on the Spanish Digg-clone Menéame.
Is pay-per-click efficient?: an empirical analysis of click values BIBAFull-Text 141-142
  Dakan Wang; Gang Wang; Pinyan Lu; Yajun Wang; Zheng Chen; Botao Hu
Current sponsored search auction adopts per-click bidding. It implicitly assumes that an advertiser treats all clicks to be equally valuable. This is not always true in real world situations. Clicks which lead to conversions are definitely more valuable than those fraudulent clicks. In this work, we use post-ad-click behavior to measure a click's value and empirically show that for an advertiser, values of different clicks are highly variant. Thus for many clicks, the advertiser's single bid does not reflect his true valuations. This indicates that the sponsored search system under PPC mechanism is not efficient, or does not always give a slot to the advertiser who needs it most.
Scalable spatio-temporal knowledge harvesting BIBAFull-Text 143-144
  Yafang Wang; Bin Yang; Spyros Zoupanos; Marc Spaniol; Gerhard Weikum
Knowledge harvesting enables the automated construction of large knowledge bases. In this work, we made a first attempt to harvest spatio-temporal knowledge from news archives to construct trajectories of individual entities for spatio-temporal entity tracking. Our approach consists of an entity extraction and disambiguation module and a fact generation module which produce pertinent trajectory records from textual sources. The evaluation on the 20 years' New York Times news article corpus showed that our methods are effective and scalable.
Growing parallel paths for entity-page discovery BIBAFull-Text 145-146
  Tim Weninger; Fabio Fumarola; Cindy Xide Lin; Rick Barber; Jiawei Han; Donato Malerba
In this paper, we use the structural and relational information on the Web to find entity-pages. Specifically, given a Web site and an entity-page (e.g., department and faculty member homepage) we seek to find all of the entity-pages of the same type (e.g., all faculty members in the department). To do this, we propose a web structure mining method which grows parallel paths through the web graph and DOM trees. We show that by utilizing these parallel paths we can efficiently discover all entity-pages of the same type. Finally, we demonstrate the accuracy of our method with a case study on various domains.
Finding our way on the web: exploring the role of waypoints in search interaction BIBAFull-Text 147-148
  Ryen W. White; Adish Singla
Information needs are rarely satisfied directly on search engine result pages. Searchers usually need to click through to search results (landing pages) and follow search trails beyond those pages to fulfill information needs. We use the term waypoints to describe pages visited by searchers between the trail origin (the landing page) and the trail destination. The role that waypoints play in search interaction is poorly understood yet can be vital in determining search success. In this poster we analyze log data to determine the arrangement and function of waypoints, and study how these are affected by variations in information goals. Our findings have implications for understanding search behavior and for the design of interactive search support based on waypoints.
An adaptive ontology-based approach to identify correlation between publications BIBAFull-Text 149-150
  Huayu Wu; Hideaki Takeda; Masahiro Hamasaki; Tok Wang Ling; Liang Xu
In this paper, we propose an adaptive ontology-based approach for related paper identification, to meet most researchers' practical needs. By searching ontology, we can return a diverse set of papers that are explicitly and implicitly related to an input paper. Moreover, our approach does not rely on known ontology. Instead, we build and update ontology for a collection with any domain of interest. Being independent from known ontology, our approach is much more adaptive for different domains.
Mining collective local knowledge from Google MyMaps BIBAFull-Text 151-152
  Shaomei Wu; Shenwei Liu; Dan Cosley; Michael Macy
The emerging popularity of location-aware devices and location-based services has generated a growing archive of digital traces of people's activities and opinions in physical space. In this study, we leverage geo-referenced user-generated content from Google MyMaps to discover collective local knowledge and understand the differing perceptions of urban space. Working with the large collection of publicly available, annotation-rich MyMaps data, we propose a highly parallelizable approach in order to merge identical places, discover landmarks, and recommend places. Additionally, we conduct interviews with New York City residents/visitors to validate the quantitative findings.
A kernel approach to addressing term mismatch BIBAFull-Text 153-154
  Jun Xu; Wei Wu; Hang Li; Gu Xu
This paper addresses the problem of dealing with term mismatch in web search using 'blending'. In blending, the input query as well as queries similar to it are used to retrieve documents, the ranking results of documents with respect to the queries are combined to generate a new ranking list. We propose a principled approach to blending, using a kernel method and click-through data. Our approach consists of three elements: a way of calculating query similarity using click-through data, a mixture model for combination of rankings using relevance, query similarity, and document similarity scores, and an algorithm for learning the weights of blending model based on the kernel method. Large scale experiments on web search and enterprise search data sets show that our approach can effectively solve term mismatch problem and significantly outperform the baseline methods of query expansion and heuristic blending.
A probabilistic model for opinionated blog feed retrieval BIBAFull-Text 155-156
  Xueke Xu; Tao Meng; Xueqi Cheng; Yue Liu
In this poster, we study the problem of Opinionated Blog Feed Retrieval which can be considered as a particular type of the faceted blog distillation introduced by TREC 2009. It is a task of finding blogs not only having a principle and recurring interest in a given topic but also having a clear inclination towards expressing opinions on it. We propose a novel probabilistic model for this task which combines its two factors, topical relevance and opinionatedness, in a unified probabilistic framework. Experiments conducted in the context of the TREC 2009 & 2010 Blog Track show the effectiveness of the proposed model.
A finegrained digestion of news webpages through Event Snippet Extraction BIBAFull-Text 157-158
  Rui Yan; Liang Kong; Yu Li; Yan Zhang; Xiaoming Li
We describe a framework to digest news webpages in finer granularity: to extract event snippets from contexts. "Events" are atomic text snippets and a news article is constituted by more than one event snippet. Event Snippet Extraction (ESE) aims to mine these snippets out. The problem is important because its solutions may be applied to many information mining and retrieval tasks. The challenge is to exploit rich features to detect snippet boundaries, including various semantic, syntactic and visual features. We run experiments to present the effectiveness of our approaches.
Caching intermediate result of SPARQL queries BIBAFull-Text 159-160
  Mengdong Yang; Gang Wu
The complexity and growing scale of RDF data has made data management back end the performance bottleneck of Semantic Web applications. Caching is one of the ways that could solve this problem. However, few existing research projects focus on caching in RDF data processing. We present an adaptive caching scheme that caches intermediate result of basic graph pattern SPARQL queries. Benchmark test results are provided to illustrate the effectiveness of our caching scheme.
Autopedia: automatic domain-independent Wikipedia article generation BIBAFull-Text 161-162
  Conglei Yao; Xu Jia; Sicong Shou; Shicong Feng; Feng Zhou; Hongyan Liu
This paper proposes a general framework, named Autopedia, to generate high-quality wikipedia articles for given concepts in any domains, by automatically selecting the best wikipedia template consisting the sub-topics to organize the article for the input concept. Experimental results on 4,526 concepts validate the effectiveness of Autopedia, and the wikipedia template selection approach which takes into account both the template quality and the semantic relatedness between the input concept and its sibling concepts, performs the best.
Location relevance classification for travelogue digests BIBAFull-Text 163-164
  Mao Ye; Rong Xiao; Wang-Chien Lee; Xing Xie
In this paper, we aim to develop a travelogue service to discover and convey various travelogue digests, in form of theme locations and geographical scope to their readers. In this service, theme locations in a travelogue are the core information to discover. Due to the inherent ambiguity of location relevance, we explore the textual (e.g., surrounding words) and geographical (e.g., geographical relationship among locations) features of locations to perform location relevance classification for theme location discovery. Finally, we conduct comprehensive experiments on collected travelogues to evaluate the performance of our location relevance classification technique and demonstrate the effectiveness of the travelogue service.
Mobile search pattern evolution: the trend and the impact of voice queries BIBAFull-Text 165-166
  Jeonghe Yi; Farzin Maghoul
In this paper we study the characteristics of search queries submitted from mobile devices using Yahoo! Search for Mobile during a 2 months period in early of 2010, and compare the results with a similar study conducted in late 2007. The major findings include 1) mobile search queries have become much more diverse, and 2) user interest and information needs have been substantially changed at least in some areas of search topics, including adult and local intent queries. In addition we investigate the impact of voice query search interface offered by Yahoo!'s mobile search service. We examine how unstructured spoken queries differ from conventional search queries.
Exploiting session-like behaviors in tag prediction BIBAFull-Text 167-168
  Dawei Yin; Liangjie Hong; Brian D. Davison
In social bookmarking systems, existing methods in tag prediction have shown that the performance of prediction can be significantly improved by modeling users' preferences. However, these preferences are usually treated as constant over time, neglecting the temporal factor within users' behaviors. In this paper, we study the problem of session-like behavior in social tagging systems and demonstrate that the predictive performance can be improved by considering sessions. Experiments, conducted on three public datasets, show that our session-based method can outperform baselines and two state-of-the-art algorithms significantly.
On computing text-based similarity in scientific literature BIBAFull-Text 169-170
  Seok-Ho Yoon; Sang-Wook Kim; Ji-Soo Kim; Won-Seok Hwang
This paper addresses computing of similarity among papers using text-based measures. First, we analyze the accuracy of the similarities computed using different parts of a paper, and propose a method of Keyword-Extension, which is very useful when text information is incomplete.
Hierarchical organization of unstructured consumer reviews BIBAFull-Text 171-172
  Jianxing Yu; Zheng-Jun Zha; Meng Wang; Tat-Seng Chua
In this paper, we propose to organize the aspects of a specific product into a hierarchy by simultaneously taking advantages of domain structure knowledge as well as consumer reviews. Based on the derived hierarchy, we generate a hierarchical organization of the consumer reviews based on various aspects of the product, and aggregate consumer opinions on the aspects. With such hierarchical organization, people can easily grasp the overview of consumer reviews and opinions on various aspects, as well as seek consumer reviews and opinions on any specific aspect by navigating through the hierarchy. We conduct evaluation on two product review data sets: Liu et al.'s data set containing 314 reviews for five products [2], and our review corpus which is collected from forum Web sites containing 60,786 reviews for five popular products. The experimental results demonstrate the effectiveness of our approach.
The freshman handbook: a hint for the server placement of social networks BIBAFull-Text 173-174
  Ying Zhang; Mallik Tatipamula
There has been a recent unprecedented increase in the use of Online Social Networks (OSNs) to expand our social life, exchange information and share common interests. Many popular OSNs today attract hundreds of millions of users who share tremendous amount of data on it such as Facebook, Twitter, and Buzz. Given the huge business opportunities OSNs may bring, more and more new social applications has emerged on the Internet. For these newcomers in the social network business, one of the first key decisions to make is to where to deploy the computational resources to best accommodate future client requests. In this work, we aim at providing useful suggests to the new born social network providers (freshman) on the intelligent server placement, by exploring available public information from existing social network communities. In this work, we first propose three scalable server placement strategies for OSNs. Our solution can scalably select server locations among all the possible locations, at the same time reducing the cost for inter-user data sharing.
Leveraging auxiliary text terms for automatic image annotation BIBAFull-Text 175-176
  Ning Zhou; Yi Shen; Jinye Peng; Xiaoyi Feng; Jianping Fan
This paper proposes a novel algorithm to annotate web images by automatically aligning the images with their most relevant auxiliary text terms. First, the DOM-based web page segmentation is performed to extract images and their most relevant auxiliary text blocks. Second, automatic image clustering is used to partition the web images into a set of groups according to their visual similarity contexts, which significantly reduces the uncertainty on the relatedness between the images and their auxiliary terms. The semantics of the visually-similar images in the same cluster are then described by the same ranked list of terms which frequently co-occur in their text blocks. Finally, a relevance re-ranking process is performed over a term correlation network to further refine the ranked term list. Our experiments on a large-scale database of web pages have provided very positive results.

Demo session

OntoTrix: a hybrid visualization for populated ontologies BIBAFull-Text 177-180
  Benjamin Bach; Emmanuel Pietriga; Ilaria Liccardi; Gennady Legostaev
Most Semantic Web data visualization tools structure the representation according to the concept definitions and interrelations that constitute the ontology's vocabulary. Instances are often treated as somewhat peripheral information, when considered at all. These instances, that populate ontologies, represent an essential part of any knowledge base, and are often orders of magnitude more numerous than the concept definitions that give them machine-processable meaning. We present a visualization technique designed to enable users to visualize large instance sets and the relations that connect them. This hybrid visualization uses both node-link and adjacency matrix representations of graphs to visualize different parts of the data depending on their semantic and local structural properties, exploiting ontological knowledge to drive the graph layout. The representation is embedded in an environment that features advanced interaction techniques for easy navigation, including support for smooth continuous zooming and coordinated views.
Factal: integrating deep web based on trust and relevance BIBAFull-Text 181-184
  Raju Balakrishnan; Subbarao Kambhampati
We demonstrate "Factal" -- a system for integrating deep web sources. Factal is based on the recently introduced source selection method "SourceRank"; which is a measure of trust and relevance based on the agreement between the sources. SourceRank selects popular and trustworthy sources from autonomous and open collections like the deep web. This trust and popularity awareness distinguishes Factal from the existing systems like Google Product Search. Factal selects and searches active online databases on multiple domains. The demonstration scenarios include improved trustworthiness, relevance of results, and comparison shopping. We believe that by incorporating effective source selection based on the SourceRank, Factal demonstrates a significant step towards a deep-web-scale integration system.
Automatically building probabilistic databases from the web BIBAFull-Text 185-188
  Lorenzo Blanco; Mirko Bronzi; Valter Crescenzi; Paolo Merialdo; Paolo Papotti
A relevant number of web sites publish structured data about recognizable concepts (such as stock quotes, movies, restaurants, etc.). There is a great chance to create applications that rely on a huge amount of data taken from the Web. We present an automatic and domain independent system that performs all the steps required to benefit from these data: it discovers data intensive web sites containing information about an entity of interest, extracts and integrate the published data, and finally performs a probabilistic analysis to characterize the impreciseness of the data and the accuracy of the sources. The results of the processing can be used to populate a probabilistic database.
Exploratory search in multi-domain information spaces with liquid query BIBAFull-Text 189-192
  Alessandro Bozzon; Marco Brambilla; Stefano Ceri; Piero Fraternali; Salvatore Vadacca
Search Computing (SeCo) aims at building search applications that bridge the gap between general-purpose and vertical search engines. SeCo queries extract ranked information about several interconnected domains, such as "hotels", "restaurants" or "concerts", by interacting with Web data sources which are wrapped as search services; an example of query is: "Find a good Jazz concert close to the user's current location, together with close-by good restaurants and hotels". The SeCo system supports the deployment of search applications, by providing a generic software architecture and the tools for service and query registration, for query formulation and execution, and for result browsing. In this demo paper, we focus on the Liquid Query (LQ) interface which supports the iteration over query formulation, result visualization and query refinement, with commands for perusing the result set, changing the visualization of data based on their type (e.g., geographical or temporal) and interacting with the remote search services. It also supports an exploratory search approach, where the user starts by accessing one data source (e.g., an event listing for finding interesting concerts), then is assisted in progressively joining other correlated sources in an interactive exploration of the search space. The exploration paths can be chosen on the fly and the navigation history can be browsed back and forth for cross-checking the retrieved options.
LivePulse: tapping social media for sentiments in real-time BIBAFull-Text 193-196
  Malu Castellanos; Riddhiman Ghosh; Yue Lu; Lei Zhang; Perla Ruiz; Mohamed Dekhil; Umeshwar Dayal; Meichun Hsu
The rise of Twitter, blogs, review sites and social sites has motivated people to express their opinions publicly and more frequently than ever before. This has fueled the emerging field known as sentiment analysis whose goal is to translate the vagaries of human emotion into hard data. LivePulse is a tool that taps into the growing business interest in what is being said online with the particular characteristic of doing so in real-time. LivePulse integrates novel algorithms for sentiment analysis and a configurable dashboard with different kinds of dynamic charts that change as new data is ingested. It also provides support to drill down and visually explore the sentiment scores to understand how they were computed and what are the emotions expressed about a given aspect or topic. Our tool has been researched and prototyped at HP Labs in close interaction with internal and external customers whose valuable feedback has been crucial for improving the tool. This paper presents an overview of LivePulse's architecture and functionality, and illustrates how it would be demoed.
Filtering microblogging messages for social tv BIBAFull-Text 197-200
  Ovidiu Dan; Junlan Feng; Brian Davison
Social TV was named one of the ten most important emerging technologies in 2010 by the MIT Technology Review. Manufacturers of set-top boxes and televisions have recently started to integrate access to social networks into their products. Some of these systems allow users to read microblogging messages related to the TV program they are currently watching. However, such systems suffer from low precision and recall when they use the title of the show as keywords when retrieving messages, without any additional filtering.
   We propose a bootstrapping approach to collecting microblogging messages related to a given TV program. We start with a small set of annotated data, in which, for a given show and a candidate message, we annotate the pair to be relevant or irrelevant. From this annotated data set, we train an initial classifier. The features are designed to capture the association between the TV program and the message. Using our initial classifier and a large dataset of unlabeled messages we derive broader features for a second classifier to further improve precision.
T-RecS: team recommendation system through expertise and cohesiveness BIBAFull-Text 201-204
  Anwitaman Datta; Jackson Tan Teck Yong; Anthony Ventresque
Searching for people by exploration of social networks structure is an interesting problem which has recently gathered a lot of attention. Expert recommendation is an important but also extensively researched problem. In contrast, the generalized problem of team recommendation has not been studied a lot. The purpose of this demo is to show a multidisciplinary team search and recommendation prototype. While the current demo uses specific (NTU academic) data-set, the framework is generic, and can be extended for other domains subject to availability of suitable information.
Accelerating instant question search with database techniques BIBAFull-Text 205-208
  Takeharu Eda; Toshio Uchiyama; Katsuji Bessho; Norifumi Katafuchi; Alice Chen; Ryoji Kataoka
Distributed question answering services, like Yahoo Answer and Aardvark, are known to be useful for end users and have also opened up numerous topics ranging in many research fields. In this paper, we propose a user-support tool for composing questions in such services. Our system incrementally recommends similar questions while users are typing their question in a sentence, which gives the users opportunities to know that there are similar questions that have already been solved. A question database is semantically analyzed and searched in the semantic space by boosting the performance of similarity searches with database techniques such as server/client caching and LSH (Locality Sensitive Hashing). The more text the user enters, the more similar the recommendations will become to the ultimately desired question. This unconscious editing-as-a-sequence-of-searches approach helps users to form their question incrementally through interactive supplementary information. Not only askers nor repliers, but also service providers have advantages such as that the knowledge of the service will be autonomously refined by avoiding for novice users to repeat questions which have been already solved.
CoSi: context-sensitive keyword query interpretation on RDF databases BIBAFull-Text 209-212
  Haizhou Fu; Sidan Gao; Kemafor Anyanwu
The demo will present CoSi, a system that enables context-sensitive interpretation of keyword queries on RDF databases. The techniques for representing, managing and exploiting query history are central to achieving this objective. The demonstration will show the effectiveness of our approach for capturing a user's querying context from their query history. Further, it will show how context is utilized to influence the interpretation of a new query. The demonstration is based on DBPedia, the RDF representation of Wikipedia.
Blognoon: exploring a topic in the blogosphere BIBAFull-Text 213-216
  Maria Grineva; Maxim Grinev; Dmitry Lizorkin; Alexander Boldakov; Denis Turdakov; Andrey Sysoev; Alexander Kiyko
We demonstrate Blognoon, a semantic blog search engine with the focus on topic exploration and navigation. Blognoon provides concept search instead of traditional keywords search and improves ranking by identifying main topics of posts. It enhances navigation over the Blogosphere with faceted interfaces and recommendations.
Visual query system for analyzing social semantic web BIBAFull-Text 217-220
  Jinghua Groppe; Sven Groppe; Andreas Schleifer
The social web is becoming increasingly popular and important, because it creates the collective intelligence, which can produce more value than the sum of individuals. The social web uses the Semantic Web technology RDF to describe the social data in a machine-readable way. RDF query languages play certainly an important role in the social data analysis for extracting the collective intelligence. However, constructing such queries is not trivial since the social data is often quite large and assembled from a large number of different sources. In order to solve these challenges, we develop a Visual Query System (VQS) for helping the analysts of social data and other semantic data to formulate such queries easily and exactly. In this VQS, we suggest a condensed data view, a browser-like query creation system for absolute beginners and a Visual Query Language (VQL) for beginners and experienced users. Using the browser-like query creation or the VQL, the analysts of social data and other semantic data can construct queries with no or little syntax knowledge; using the condensed view, they can determine easily what queries should to be used. Furthermore, our system also supports precise suggestions to extend and refine existing queries.
Embedding MindMap as a service for user-driven composition of web applications BIBAFull-Text 221-224
  Adnene Guabtni; Stuart Clarke; Boualem Benatallah
The World Wide Web is evolving towards a very large distributed platform allowing ubiquitous access to a wide range of Web applications with minimal delay and no installation required. Such Web applications range from having users undertake simple tasks, such as filling a form, to more complex tasks including collaborative work, project management, and more generally, creating, consulting, annotating, and sharing Web content. However, users are lacking a simple but yet powerful mechanism to compose Web applications, similarly to what desktop environments allowed for decades using the file explorer paradigm and the desktop metaphor. Attempts have been made to adapt the desktop metaphor to the Web environment giving birth to Webtops (Web desktops). It essentially consisted of embedding a desktop environment in a Web browser and provide access to various Web applications within the same User Interface. However, those attempts did not take into consideration to the radical differences between Web and desktop environments and applications. In this work, we introduce a new approach for Web application composition based on the mindmap metaphor. It allows browsing artifacts (Web resources) and enabling user-driven composition of their associated Web applications. Essentially, a mindmap is a graph of widgets representing artifacts created or used by Web applications and allow to list and launch all possible Web applications associated to each artifact. A tool has been developed to experiment the new metaphor and is provided as a service to be embedded in Web applications via a Web browser's plug-in. We demonstrate in this paper three case studies regarding the DBLP Web site, Wikipedia and Google Picasa Web applications.
Helix: online enterprise data analytics BIBAFull-Text 225-228
  Oktie Hassanzadeh; Songyun Duan; Achille Fokoue; Anastasios Kementsietsidis; Kavitha Srinivas; Michael J. Ward
The size, heterogeneity and dynamicity of data within an enterprise makes indexing, integration and analysis of the data increasingly difficult tasks. On the other hand, there has been a massive increase in the amount of high-quality open data available on the Web that could provide invaluable insights to data analysts and business intelligence specialists within the enterprise. The goal of Helix project is to provide users within the enterprise with a platform that allows them to perform online analysis of almost any type and amount of internal data using the power of external knowledge bases available on the Web. Such a platform requires a novel, data-format agnostic indexing mechanism, and light-weight data linking techniques that could link semantically related records across internal and external data sources of various characteristics. We present the initial architecture of our system and discuss several research challenges involved in building such a system.
YAGO2: exploring and querying world knowledge in time, space, context, and many languages BIBAFull-Text 229-232
  Johannes Hoffart; Fabian M. Suchanek; Klaus Berberich; Edwin Lewis-Kelham; Gerard de Melo; Gerhard Weikum
We present YAGO2, an extension of the YAGO knowledge base with focus on temporal and spatial knowledge. It is automatically built from Wikipedia, GeoNames, and WordNet, and contains nearly 10 million entities and events, as well as 80 million facts representing general world knowledge. An enhanced data representation introduces time and location as first-class citizens. The wealth of spatio-temporal information in YAGO can be explored either graphically or through a special time- and space-aware query language.
A demo search engine for products BIBAFull-Text 233-236
  Beibei Li; Anindya Ghose; Panagiotis G. Ipeirotis
Most product search engines today build on models of relevance devised for information retrieval. However, the decision mechanism that underlies the process of buying a product is different than the process of locating relevant documents or objects. We propose a theory model for product search based on expected utility theory from economics. Specifically, we propose a ranking technique in which we rank highest the products that generate the highest surplus, after the purchase. We instantiate our research by building a demo search engine for hotels that takes into account consumer heterogeneous preferences, and also accounts for the varying hotel price. Moreover, we achieve this without explicitly asking the preferences or purchasing histories of individual consumers but by using aggregate demand data. This new ranking system is able to recommend consumers products with "best value for money" in a privacy-preserving manner. The demo is accessible at http://nyuhotels.appspot.com/
DIDO: a disease-determinants ontology from web sources BIBAFull-Text 237-240
  Victoria Nebot Romero; Jae-Hong Eom; Min Ye; Gehard Weikum
This paper introduces DIDO, a system providing convenient access to knowledge about factors involved in human diseases, automatically extracted from textual Web sources. The knowledge base is bootstrapped by integrating entities from hand-crafted sources like MeSH and OMIM. As these are short on relationships between different types of biomedical entities, DIDO employs flexible and robust pattern learning and constraint-based reasoning methods to automatically extract new relational facts from textual sources. These facts can then be iteratively added to the knowledge base. The result is a semantic graph of typed entities and relations between diseases, their symptoms, and their factors, with emphasis on environmental factors but covering also molecular determinants. We demonstrate the value of DIDO for knowledge discovery about causal factors and properties of complex diseases, including factor-disease chains.
A tool for fast indexing and querying of graphs BIBAFull-Text 241-244
  Dipali Pal; Praveen R. Rao
We present a tool called GiS for indexing and querying a large database of labeled, undirected graphs. Such graphs can model chemical compounds, represent contact maps constructed from 3D structure of proteins, and so forth. GiS supports exact subgraph matching and approximate graph matching queries. It adopts a suite of new techniques and algorithms for (a) fast construction of disk-based indexes with small index sizes, and (b) efficient query processing with high precision of matching. During the demo, the user can index real graph datasets using a recommendation facility in GiS, pose exact subgraph matching and approximate graph matching queries, and view matching graphs using the Jmol browser.
A user-tunable approach to marketplace search BIBAFull-Text 245-248
  Nish Parikh; Neel Sundaresan
The notion of relevance is key to the performance of search engines as they interpret the user queries and respond with matching results. Online search engines have used other features beyond pure IR features to return relevant matching documents. However, over-emphasis on relevance could lead to redundancy in search results. In document search, diversity is simply the variety of documents that span the result set. In an online marketplace the diversity in the result set is represented by items for sale by different sellers at different prices with different sales options. For such a marketplace, in order to minimize query abandonment and the risk of dissatisfaction to the average user, several factors like diversity, trust and value need to be taken into account. Previous work in this field [4] has shown an impossibility result that there exists no such function that can optimize for all these factors. Since these factors and the measures associated with the factors could be subjective we take an approach of giving the control back to the user.
   In this paper we describe an interface which enables users to have more control over the optimization function used to present the results. We demonstrate this for search on eBay -- one of the largest online marketplaces with a vibrant user community and dynamic inventory. We use an algorithm based on bounded greedy selection [5] to construct the result set based on parameters specified by the user.
Truthy: mapping the spread of astroturf in microblog streams BIBAFull-Text 249-252
  Jacob Ratkiewicz; Michael Conover; Mark Meiss; Bruno Gonçalves; Snehal Patil; Alessandro Flammini; Filippo Menczer
Online social media are complementing and in some cases replacing person-to-person social interaction and redefining the diffusion of information. In particular, microblogs have become crucial grounds on which public relations, marketing, and political battles are fought. We demonstrate a web service that tracks political memes in Twitter and helps detect astroturfing, smear campaigns, and other misinformation in the context of U.S. political elections. We also present some cases of abusive behaviors uncovered by our service. Our web service is based on an extensible framework that will enable the real-time analysis of meme diffusion in social media by mining, visualizing, mapping, classifying, and modeling massive streams of public microblogging events.
VoiSTV: voice-enabled social TV BIBAFull-Text 253-256
  Bernard Renger; Junlan Feng; Ovidiu Dan; Harry Chang; Luciano Barbosa
Until recently, the TV viewing experience has not been a very social activity compared to activities on the World Wide Web. In this work, we will present a Voice-enabled Social TV system (VoiSTV) which allows users to interact, follow and monitor the online social media messages related to a TV show while watching it. Users can create, send, and reply to messages using spoken language. VoiSTV also provides metadata information about TV shows such as trends, hot topics, popularity as well as aggregated sentiment of show-related messages, all of which are valuable for TV program search and recommendation.
Adapting a map query interface for a gesturing touch screen interface BIBAFull-Text 257-260
  Hanan Samet; Benjamin E. Teitler; Marco D. Adelfio; Michael D. Lieberman
NewsStand is an example application of a general framework that we are developing to enable searching for information using a map query interface, where the information results from monitoring the output of over 8,000 RSS news sources and is available for retrieval within minutes of publication. The user interface of NewsStand was recently adapted so that NewsStand can execute on mobile and tablet devices with a gesturing touch screen interface such as the iPhone, iPod Touch, and iPad. This action led to a discovery of some shortcomings of current mapping APIs as well as devising some interesting new widgets. These issues are discussed, and the realization can be seen by a demo at http://newsstand.umiacs.umd.edu on any of the above Apple devices as well as other devices that support gestures such as an Android phone.
OXPath: little language, little memory, great value BIBAFull-Text 261-264
  Andrew Jon Sellers; Tim Furche; Georg Gottlob; Giovanni Grasso; Christian Schallhart
Data about everything is readily available on the web-but often only accessible through elaborate user interactions. For automated decision support, extracting that data is essential, but infeasible with existing heavy-weight data extraction systems. In this demonstration, we present OXPath, a novel approach to web extraction, with a system that supports informed job selection and integrates information from several different web sites. By carefully extending XPath, OXPath exploits its familiarity and provides a light-weight interface, which is easy to use and embed. We highlight how OXPath guarantees optimal page buffering, storing only a constant number of pages for non-recursive queries.
CONQUER: a system for efficient context-aware query suggestions BIBAFull-Text 265-268
  Christian Sengstock; Michael Gertz
Many of today's search engines provide autocompletion while the user is typing a query string. This type of dynamic query suggestion can help users to formulate queries that better represent their search intent during Web search interactions. In this paper, we demonstrate our query suggestion system called CONQUER, which allows to efficiently suggest queries for a given partial query and a number of available query context observations. The context-awareness allows for suggesting queries tailored to a given context, e.g., the user location or the time of day. CONQUER uses a suggestion model that is based on the combined probabilities of sequential query patterns and context observations. For this, the weight of a context in a query suggestion can be adjusted online, for example, based on the learned user behavior or user profiles. We demonstrate the functionality of CONQUER based on 6 million queries from an AOL query log using the time of day and the country domain of the clicked URLs in the search result as context observations.
CATE: context-aware timeline for entity illustration BIBAFull-Text 269-272
  Tran Anh Tuan; Shady Elbassuoni; Nicoleta Preda; Gerhard Weikum
Wikipedia has become one of the most authoritative information sources on the Web. Each article in Wikipedia provides a portrait of a certain entity. However, such a portrait is far from complete. An informative portrait of an entity should also reveal the context the entity belongs to. For example, for a person, major historical, political and cultural events that coincide with her life are important and should be included in that person's portrait. Similarly, the person's interactions with other people are also important. All this information should be summarized and presented in an appealing and interactive visual interface that enables users to quickly scan the entity's portrait.
   We demonstrate CATE which is a system that utilizes Wikipedia to create a portrait of a given entity of interest. We provide a visualization tool that summarizes the important events related to the entity. The novelty of our approach lies in seeing the portrait of an entity in a broader context, synchronous with its time.
Einstein: physicist or vegetarian? summarizing semantic type graphs for knowledge discovery BIBAFull-Text 273-276
  Tomasz Tylenda; Mauro Sozio; Gerhard Weikum
The Web and, in particular, knowledge-sharing communities such as Wikipedia contain a huge amount of information encompassing disparate and diverse fields. Knowledge bases such as DBpedia or Yago represent the data in a concise and more structured way bearing the potential of bringing database tools to Web Search. The wealth of data, however, poses the challenge of how to retrieve important and valuable information, which is often intertwined with trivial and less important details. This calls for an efficient and automatic summarization method.
   In this demonstration proposal, we consider the novel problem of summarizing the information related to a given entity, like a person or an organization. To this end, we utilize the rich type graph that knowledge bases provide for each entity, and define the problem of selecting the best cost-restricted subset of types as summary with good coverage of salient properties.
   We propose a demonstration of our system which allows the user to specify the entity to summarize, an upper bound on the cost of the resulting summary, as well as to browse the knowledge base in a more simple and intuitive manner.

Tutorials

Social media analytics: tracking, modeling and predicting the flow of information through networks BIBAFull-Text 277-278
  Jure Leskovec
Online social media represent a fundamental shift of how information is being produced, transferred and consumed. User generated content in the form of blog posts, comments, and tweets establishes a connection between the producers and the consumers of information. Tracking the pulse of the social media outlets, enables companies to gain feedback and insight in how to improve and market products better. For consumers, the abundance of information and opinions from diverse sources helps them tap into the wisdom of crowds, to aid in making more informed decisions.
   The present tutorial investigates techniques for social media modeling, analytics and optimization. First we present methods for collecting large scale social media data and then discuss techniques for coping with and correcting for the effects arising from missing and incomplete data. We proceed by discussing methods for extracting and tracking information as it spreads among the users. Then we examine methods for extracting temporal patterns by which information popularity grows and fades over time. We show how to quantify and maximize the influence of media outlets on the popularity and attention given to particular piece of content, and how to build predictive models of information diffusion and adoption. As the information often spreads through implicit social and information networks we present methods for inferring networks of influence and diffusion. Last, we discuss methods for tracking the flow of sentiment through networks and emergence of polarization.
Distributed web retrieval BIBAFull-Text 279-280
  Ricardo Baeza-Yates
In the ocean of Web data, Web search engines are the primary way to access content. As the data is on the order of petabytes, current search engines are very large centralized systems based on replicated clusters. Web data, however, is always evolving. The number of Web sites continues to grow rapidly (over 270 millions at the beginning of 2011) and there are currently more than 20 billion indexed pages. On the other hand, Internet users are above one billion and hundreds of million of queries are issued each day. In the near future, centralized systems are likely to become less effective against such a data-query load, thus suggesting the need of fully distributed search engines. Such engines need to maintain high quality answers, fast response time, high query throughput, high availability and scalability; in spite of network latency and scattered data. In this tutorial we present the architecture of current search engines and we explore the main challenges behind the design of all the processes of a distributed Web retrieval system crawling, indexing, and query processing.
WWW 2011 invited tutorial overview: latent variable models on the internet BIBAFull-Text 281-282
  Amr Ahmed; Alexander Smola
Graphical models are an effective tool for analyzing structured and relational data. In particular, they allow us to arrive at insights that are implicit, i.e. latent in the data. Dealing with such data on the internet poses a range of challenges. Firstly, the sheer size renders many well-known inference algorithms infeasible. Secondly, the problems arising on the internet do not always fit well into the known categories for latent variable inference such as Latent Dirichlet Allocation or clustering.
   In this tutorial we address a number of aspects. Firstly, we present a variety of applications ranging from general purpose document analysis, ideology detection, clustering of sequential data, and dynamic user profiling to recommender systems and data integration. Secondly we give an overview over a number of popular models such as mixture models, topic models, nonparametric variants of temporal dependence, and an integrated analysis and clustering approach, all of which can be used to solve a range of data analysis problems at hand. Thirdly, we present a range of sampling based algorithms for large scale distributed inference using multicore systems and clusters of workstations.
Social recommender systems BIBAFull-Text 283-284
  Ido Guy; David Carmel
The goal of this tutorial is to expose participants to the current research on social recommender systems (i.e., recommender systems for the social web). Participants will become familiar with state-of-the-art recommendation methods, their classifications according to various criteria, common evaluation methodologies, and potential applications that can utilize social recommender systems. Additionally, open issues and challenges in the field will be discussed.
Ranking on large-scale graphs with rich metadata BIBAFull-Text 285-286
  Bin Gao; Taifeng Wang; Tie-Yan Liu
For many Web applications, one needs to deal with the ranking problem on large-scale graphs with rich metadata. However, it is non-trivial to perform efficient and effective ranking on them. On one aspect, we need to design scalable algorithms. On another aspect, we also need to develop powerful computational infrastructure to support these algorithms. This tutorial aims at giving a timely introduction to the promising advances in the aforementioned aspects in recent years, and providing the audiences with a comprehensive view on the related literature.
Managing crowdsourced human computation: a tutorial BIBAFull-Text 287-288
  Panagiotis G. Ipeirotis; Praveen K. Paritosh
The tutorial covers an emerging topic of wide interest: Crowdsourcing. Specifically, we cover areas of crowdsourcing related to managing structured and unstructured data in a web-related content. Many researchers and practitioners today see the great opportunity that becomes available through easily-available crowdsourcing platforms. However, most newcomers face the same questions: How can we manage the (noisy) crowds to generate high quality output? How to estimate the quality of the contributors? How can we best structure the tasks? How can we get results in small amounts of time and minimizing the necessary resources? How to setup the incentives? How should such crowdsourcing markets be setup? Their presented material will cover topics from a variety of fields, including computer science, statistics, economics, and psychology. Furthermore, the material will include real-life examples and case studies from years of experience in running and managing crowdsourcing applications in business settings.
Citizen sensor data mining, social media analytics and development centric web applications BIBAFull-Text 289-290
  Meena Nagarajan; Amit Sheth; Selvam Velmurugan
With the rapid rise in the popularity of social media (500M+ Facebook users, 100M+ twitter users), and near ubiquitous mobile access (4+ billion actively-used mobile phones), the sharing of observations and opinions has become common-place (nearly 100M tweets a day, 1.8 trillion SMSs in US last year). This has given us an unprecedented access to the pulse of a populace and the ability to perform analytics on social data to support a variety of socially intelligent applications -- be it towards targeted online content delivery, crisis management, organizing revolutions or promoting social development in underdeveloped and developing countries.
   This tutorial will address challenges and techniques for building applications that support a broad variety of users and types of social media. This tutorial will focus on social intelligence applications for social development, and cover the following research efforts in sufficient depth: 1) understanding and analysis of informal text, esp. microblogs (e.g., issues of cultural entity extraction and role of semantic/background knowledge enhanced techniques), and 2) building social media analytics platforms. Technical insights will be coupled with identification of computational techniques and real-world examples.
Game theoretic models for social network analysis BIBAFull-Text 291-292
  Narahari Yadati; Ramasuri Narayanam
The existing methods and techniques for social network analysis are inadequate to capture both the behavior (such as rationality and intelligence) of individuals and the strategic interactions that occur among these individuals. Game theory is a natural tool to overcome this inadequacy since it provides rigorous mathematical models of strategic interaction among autonomous, intelligent, and rational agents. Motivated by the above observation, this tutorial provides the conceptual underpinnings of the use of game theoretic models in social network analysis. In the first part of the tutorial, we provide rigorous foundations of relevant concepts in game theory and social network analysis. In the second part of the tutorial, we present a comprehensive study of four contemporary and pertinent problems in social networks: social network formation, determining in influential individuals for viral marketing, query incentive networks, and community detection.
Speech and multimodal interaction in mobile search BIBAFull-Text 293-294
  Junlan Feng; Michael Johnston; Srinivas Bangalore
This tutorial highlights the characteristics of mobile search comparing with its desktop counterpart, reviews the state of art technologies of speech-based mobile search, and presents opportunities for exploiting multimodal interaction to optimize the efficiency of mobile search. It is suitable for students, researchers and practitioners working in the areas of spoken language processing, multimodal and search with an emphasis on a synergistic integration of these technologies for applications on mobile devices. We will provide detailed bibliography and sufficient literature for everyone interested to jumpstart work on this topic.
Scalable integration and processing of linked data BIBAFull-Text 295-296
  Andreas Harth; Aidan Hogan; Spyros Kotoulas; Jacopo Urbani
The goal of this tutorial is to introduce, motivate and detail techniques for integrating heterogeneous structured data from across the Web. Inspired by the growth in Linked Data publishing, our tutorial aims at educating Web researchers and practitioners about this new publishing paradigm. The tutorial will show how Linked Data enables uniform access, parsing and interpretation of data, and how this novel wealth of structured data can potentially be exploited for creating new applications or enhancing existing ones.
   As such, the tutorial will focus on Linked Data publishing and related Semantic Web technologies, introducing scalable techniques for crawling, indexing and automatically integrating structured heterogeneous Web data through reasoning.
Web-based open-domain information extraction BIBAFull-Text 297-298
  Marius Pasca
This tutorial provides an overview of extraction methods developed in the area of Web-based open-domain information extraction, whose purpose is the acquisition of open-domain classes, instances and relations from Web text. The extraction methods operate over unstructured or semi-structured text. They take advantage of weak supervision provided in the form of seed examples or small amounts of annotated data, or draw upon knowledge already encoded within resources created strictly by experts or collaboratively by users. The tutorial teaches the audience about existing resources that include instances and relations; details of methods for extracting such data from structured and semi-structured text available on the Web; and strengths and limitations of resources extracted from text as part of recent literature, with applications in knowledge discovery and information retrieval.
The web of things BIBAFull-Text 299-300
  Carolina Fortuna; Marko Grobelnik
The Web, similar to other successful man made systems is continuously evolving. With the miniaturization and increased performance of computing devices which are also being embedded in common physical objects, it is natural that the Web evolved to also include these -- therefore the Web of Things. This tutorial provides an overview of the system vertical structure by identifying the relevant components, illustrating their functionality and showing existing tools and systems. The aim is to show how small devices can be connected to the Web at various levels of abstraction and transform them into "first-class" Web residents.

Workshop summaries

Eighth workshop on information integration on the web (IIWeb 2011) BIBAFull-Text 301-302
  Ullas B. Nambiar; L. Venkata Subramaniam
The goal of the 8th Workshop on Information Integration on the Web (IIWeb) is to bring together academic researchers and industry practitioners in Information Integration with a special focus on integrating cyber physical systems for building a sustainable ecosystem for life on our planet. Towards this goal the workshop program consists of an engaging set of talks and papers.
4th linked data on the web workshop (LDOW2011) BIBAFull-Text 303-304
  Christian Bizer; Tom Heath; Tim Berners-Lee; Michael Hausenblas
The Web has developed into a global information space consisting not just of linked documents, but also of Linked Data. In 2010, we have seen significant growth in the size of the Web of Data, as well as in the number of communities contributing to its creation. In addition to publishing and interlinking datasets, there is intensive work on developing Linked Data browsers, Linked Data crawlers, Web of Data search engines and other applications that consume Linked Data from the Web.
   The goal of the 4th Linked Data on the Web workshop (LDOW2011) is to provide a forum for exposing high quality research on Linked Data as well as to showcase innovative Linked Data applications. In addition, by bringing together researchers in this field, we expect the event to further shape the Linked Data research agenda.
USEWOD2011: 1st international workshop on usage analysis and the web of data BIBAFull-Text 305-306
  Bettina Berendt; Laura Hollink; Vera Hollink; Markus Luczak-Rösch; Knud Möller; David Vallet
The USEWOD2011 workshop investigates combinations of usage data with semantics and the web of data. The analysis of usage data may be enhanced using semantic information. Now that more and more explicit knowledge is represented on the Web, the question arises how these semantics can be used to aid large scale web usage analysis and mining.
   Conversely, usage data analysis can enhance semantic resources as well as Semantic Web applications. Traces of users can be used to evaluate, adapt or personalize Semantic Web applications. Also, new ways of accessing information enabled by the Web of Data imply the need to develop or adapt algorithms, methods, and techniques to analyze and interpret the usage of Web data instead of Web pages.
   The USEWOD2011 program includes a challenge to the workshop participants: three months before the workshop two datasets consisting of server log files of Linked Open Data sources were released. Participants are invited to come up with interesting analyses, applications, alignments, etc. for these datasets.
The 1st temporal web analytics workshop (TWAW) BIBAFull-Text 307-308
  Ricardo Baeza-Yates; Julien Masanès; Marc Spaniol
The objective of the 1st Temporal Web Analytics Workshop (TWAW) is to provide a venue for researchers of all domains (IE/IR, Web mining etc.) where the temporal dimension opens up an entirely new range of challenges and possibilities. The workshop's ambition is to help shaping a community of interest on the research challenges and possibilities resulting from the introduction of the time dimension in Web analysis. The maturity of the Web, the emergence of large scale repositories of Web material, makes this very timely and a growing sets of research and services (recorded future1, truthy2 launched just in the last months) are emerging that have this focus in common. Having a dedicated workshop will help, we believe, to take a rich and cross-domain approach to this new research challenge with a strong focus on the temporal dimension.
First international workshop on social media engagement (SoME 2011) BIBAFull-Text 309-310
  Alejandro Jaimes; Mounia Lalmas; Yana Volkovich
The goal of this workshop is to encourage discussion and sharing of ideas and research results on social media engagement. We aim to promote interdisciplinary research and exchange of ideas in this area, not only between industry and academia, but also between different fields (e.g., computer science, mathematics, physics, psychology, sociology, cultural anthropology, etc.). In particular, we would like to discuss approaches to address some of the serious research challenges we face in devising engagement metrics, in developing methodologies, and in understanding how different technical approaches can be used to enhance our understanding of user behavior in social media.
Second international workshop on RESTful design (WS-REST 2011) BIBAFull-Text 311-312
  Cesare Pautasso; Erik Wilde; Rosa Alarcon
Over the past few years, the discussion between the two major architectural styles for designing and implementing Web services, the RPC-oriented approach and the resource-oriented approach, has been mainly held outside of traditional research communities. Mailing lists, forums and developer communities have seen long and fascinating debates around the assumptions, strengths, and weaknesses of these two approaches. The Second International Workshop on RESTful Design (WS-REST 2011) has the goal of getting more researchers involved in the debate by providing a forum where discussions around the resource-oriented style of Web services design take place. Representational State Transfer (REST) is an architectural style and as such can be applied in different ways, can be extended by additional constraints, or can be specialized with more specific interaction patterns. WS-REST is the premier forum for discussing research ideas, novel applications and results centered around REST at the World Wide Web conference, which provides a great setting to host this second edition of the workshop dedicated to research on the architectural style underlying the Web.
Joint WICOW/AIRWeb workshop on web quality (WebQuality 2011) BIBAFull-Text 313-314
  Carlos Castillo; Zoltan Gyongyi; Adam Jatowt; Katsumi Tanaka
In this paper we overview the Joint WICOW/AIRWeb Workshop on Web Quality (WebQuality 2011) that was held in conjunction with the 20th International World Wide Web Conference in Hyderabad, India.
   In this paper we overview the Joint WICOW/AIRWeb Workshop on Web Quality (WebQuality 2011) that was held in conjunction with the 20th International World Wide Web Conference in Hyderabad, India.
SemSearch'11: the 4th semantic search workshop BIBAFull-Text 315-316
  Thanh Tran; Peter Mika; Haofen Wang; Marko Grobelnik
The use of semantics and semantic technologies for search and retrieval has attracted interests both from academia and industry in recent years. What is now commonly known as Semantic Search is in fact a broad field encompassing ideas and concepts from different areas, including Information Retrieval, Semantic Web and database. This is the fourth edition of the Semantic Search workshop which aims to bring together researchers and practitioners from various communities, to provide a forum for dissemination, discussion, and for the exchange and transfer of knowledge related to the use of semantics for search and retrieval. This year's workshop will continue to push and promote efforts towards an evaluation benchmark for Semantic Search systems.
Second international workshop on web science and information exchange in the medical web (MedEx 2011) BIBAFull-Text 317-318
  Kerstin Denecke; Peter Dolog
The amount of Social Media Data dealing with medical and health issues increased significantly in the last couple of years. Medical Social Media Data now provides a new source of information within information gaining contexts. Facts, experiences, opinions or information on behavior can be found in the Medicine or Health 2.0 and could support a broad range of applications. This workshop is devoted to the technologies for dealing with social- and multi media for medical information gathering and exchange. This specific data and the processes of information gathering poses many challenges given the increasing content on the Web and the trade off of filtering noise at the cost of losing information which is potentially relevant.
DiversiWeb 2011: first international workshop on knowledge diversity on the web BIBAFull-Text 319-320
  Elena Simperl; Devika P. Madalli; Denny Vrandecic; Enrique Alfonseca
The workshop provides an interdisciplinary forum for researchers and practitioners to present and discuss their ideas related to the challenges posed by diversity on the Web. We address a wide array of interdisciplinary questions, which need to be tackled in order to preserve the fragile balance between a world that is continually converging and growing together, the rich diversity of the global society, and the dangers of fragmentation and splintering.
Workshop on online reputation: context, privacy, and reputation management BIBAFull-Text 321-322
  Judd Antin; Elizabeth F. Churchill; Bee-Chung Chen
In this workshop we bring together researchers and practitioners from diverse disciplines to discuss the future of online reputation systems. Our goal is to combine social and technical perspectives to address three challenges: (1) the social challenges around reputation, privacy, and online identity, (2) the technical challenges around designing adaptable reputation systems which cater to users' privacy concerns, and (3) the user experience challenges around transparency and the design or reputation management tools.
PlayIT 2011: first international workshop on games for knowledge acquisition BIBAFull-Text 323-324
  Katharina Siorpaes; Elena Simperl; Arpita Ghosh; Michael Fink
Many problems in knowledge acquisition, such as image labeling, still rely on extensive human input and intervention. In order to attract people to invest the necessary time into such tasks, rewarding incentives and motivation mechanisms have been employed. While recruiting "human cycles" for such tasks is difficult, online games manage to attract plenty of attention (due to the fact that they provide inherent incentives, such as fun and competition). This workshop focuses on games that embed various knowledge acquisition tasks into the context of online games, with the end goal of attracting the sufficient manual labor.

Panel session

The computer is the new sewing machine: benefits and perils of crowdsourcing BIBAFull-Text 325-326
  Praveen Paritosh; Panos Ipeirotis; Matt Cooper; Siddharth Suri
There is increased participation by the developing world in the global manufacturing marketplace: the sewing machine in Bangladesh can be a means to support an entire family. Crowdsourcing for cognitive tasks consists of asking humans for questions that are otherwise impossible to answer by algorithms, e.g., is this image pornographic, are these two addresses the same, what is the translation for this text in French? In the last five years, there has been an exponential growth in the size of the global cognitive marketplace: Amazon.com's Mechanical Turk has an estimated 500,000 active workers in over 100 countries, and there are dozens of other companies in this space. This turns the computer into a modern-day sewing machine, where cognitive work of various levels of difficulty will pay anywhere from 5 to 50 dollars a day. Unlike outsourcing, which usually requires college education, competence at these tasks might be a month or even less of training. At its best, this could be a powerful bootstrap for a billion people. At its worst, this can lead to unprecedented exploitation. In this panel, we discuss the technical, social and economic questions and implications that a global cognitive marketplace raises.
Social media: source of information or bunch of noise BIBAFull-Text 327-328
  Amr El Abaddi; Lars Backstrom; Soumen Chakrabarti; Alejandros Jaimes; Jure Leskovec; Andrew Tomkins
Social media has witnessed an explosive growth in the past few years. Wikipedia has over 3.5 million pages with descriptions of entities. Flickr members have uploaded over 5 billion photos, YouTube has 35 hours of videos uploaded to the site each minute, and Twitter users generate 65 million tweets a day. While some forms of social media like Wikipedia clearly have valuable information embedded in them, the jury is still out on other forms like tweets, comments, and social network (e.g., Facebook) updates. Some of the key questions that the panel will debate include: Is there useful information in social media like tweets? How to extract structured records from unstructured user-generated content like reviews? How to sift through the vast amounts of social media and filter out the spam/offensive content? How to rank social media like blogs and comments based on relevance or importance?
   How can social media be leveraged to achieve tasks like entity disambiguation, question answering, improved search, etc.? What are the novel Web applications where social media can be leveraged?
Connecting the next billion web users BIBAFull-Text 329-330
  Rajeev Rastogi; Ed Cutrell; Manish Gupta; Ashok Jhunjhunwala; Ramkumar Narayan; Rajeev Sanghal
With 2 billion users, the World Wide Web has indeed come a long way. However, of the 4.8 billion people living in Asia and Africa, only 1 in 5 has access to the Web. For instance, in India, the 100 million Web users constitute less than 10% of the total population of 1.2 billion. So it is universally accepted that the next billion users will come from emerging markets like Brazil, China, India, Indonesia and Russia. Emerging markets have a number of unique characteristics: Large dense populations with low incomes, Lack of infrastructure in terms of broadband, electricity, etc., Poor PC penetration due to limited affordability, High illiteracy rates and inability to read/write, Plethora of local languages and dialects, General paucity of local content, especially in local languages, Explosive growth in the number of mobile phones. The panel will debate the various technical challenges in overcoming the digital divide, and potential approaches to bring the Web to the underserved populations of the developing world.

PhD symposium

Addressing the RDFa publishing bottleneck BIBAFull-Text 331-336
  Xi Bai
In the more dynamic environments emerging from ad hoc and peer-to-peer networks, our research has explored the extent to which Web-based knowledge sharing as well as community formation require automation to understand human-readable content in a more distributed manner. RDFa is a syntactic format which can leverage this issue by allowing machine-readable data to be easily integrated into XHTML Web pages. Although there is a growing number of tools and techniques for generating and distilling RDFa, comparatively little work has been carried out on publishing existing RDF data sets as an XHTML+RDFa serialization. This paper proposes a generic approach to integrating RDF data into Web pages using the concept of automatically discovered "topic nodes". RDFa² is a proof-of-concept implementation of this approach and provides an on-line service assisting users in generating and personalizing pages with RDFa. We provide experimental results that support the viability of our approach to generating Web documents such as FOAF-based online profiles as well as RDF vocabularies with little user intervention.
Towards liquid service oriented architectures BIBAFull-Text 337-342
  Daniele Bonetta; Cesare Pautasso
The advent of Cloud computing platforms, and the growing pervasiveness of Multicore processor architectures have revealed the inadequateness of traditional programming models based on sequential computations, opening up many challenges for research on parallel programming models for building distributed, service-oriented systems. More in detail, the dynamic nature of Cloud computing and its virtualized infrastructure pose new challenges in term of application design, deployment and dynamic reconfiguration. An application developed to be delivered as a service in the Cloud has to deal with poorly understood issues such as elasticity, infinite scalability and portability across heterogeneous virtualized environments. In this position paper we define the problem of providing a novel parallel programming model for building application services that can be transparently deployed on multicore and cloud execution environments. To this end, we introduce and motivate a research plan for the definition of a novel programming framework for Web service-based applications. Our vision called "Liquid Architecture" is based on a programming model inspired by core ideas tied to the REST architectural style coupled with a self-configuring runtime that allows transparent deployment of Web services on a broad range of heterogeneous platforms, from multicores to clouds.
Analysis and tracking of emotions in English and Bengali texts: a computational approach BIBAFull-Text 343-348
  Dipankar Das
The present discussion highlights the aspects of an ongoing doctoral thesis grounded on the analysis and tracking of emotions from English and Bengali texts. Development of lexical resources and corpora meets the preliminary urgencies. The research spectrum aims to identify the evaluative emotional expressions at word, phrase, sentence, and document level granularities along with their associated holders and topics. Tracking of emotions based on topic or event was carried out by employing sense based affect scoring techniques. The labeled emotion corpora are being prepared from unlabeled examples to cope with the scarcity of emotional resources, especially for the resource constraint language like Bengali. Different unsupervised, supervised and semi-supervised strategies, adopted for coloring each outline of the research spectrum produce satisfactory outcomes.
Computational advertising: leveraging user interaction & contextual factors for improved ad retrieval & ranking BIBAFull-Text 349-354
  Kushal S. Dave
Computational advertising, popularly known as Online advertising or Web advertising, refers to finding the most relevant ads matching a particular context on the web. It is a scientific sub-discipline at the intersection of information retrieval, statistical modeling, machine learning, optimization, large scale search and text analysis. The core problem attacked in computational advertising (CA) is of the match making between the ads and the context. Based on the context, CA can be broadly compartmentalized into following three areas: Sponsored search, Contextual advertising and Social advertising. Sponsored search refers to the placement of ads on search results page. Contextual advertising deals with matching advertisements to the third party web pages. We refer the placements of ads on a social networking page, leveraging user's social contacts as social advertising.
   My research work aims at leveraging various user interactions, ad and advertiser related information and contextual information for these three areas of advertising. The research work focuses on the identification of various factors that contribute in retrieving and ranking the most relevant set of ads that match best with the context. Specifically, information associated with the user, publisher and advertiser is leveraged for this purpose.
Standing on the shoulders of ants: stigmergy in the web BIBAFull-Text 355-360
  Aiden Charles Dipple
Stigmergy is a biological term used when discussing insect or swarm behaviour, and describes a model supporting environmental communication separately from artefacts or agents. This phenomenon is demonstrated in the behavior of ants and their food gathering process when following pheromone trails, or similarly termites and their termite mound building process. What is interesting with this mechanism is that highly organized societies are achieved without an apparent management structure.
   Stigmergic behavior is implicit in the Web where the volume of users provides a self-organizing and self-contextualization of content in sites which facilitate collaboration. However, the majority of content is generated by a minority of the Web participants. A significant contribution from this research would be to create a model of Web stigmergy, identifying virtual pheromones and their importance in the collaborative process.
   This paper explores how exploiting stigmergy has the potential of providing a valuable mechanism for identifying and analyzing online user behavior recording actionable knowledge otherwise lost in the existing web interaction dynamics. Ultimately this might assist our building better collaborative Web sites.
Ranked answer graph construction for keyword queries on RDF graphs without distance neighbourhood restriction BIBAFull-Text 361-366
  A Parthasarathy K.; Sreenivasa P. Kumar; Dominic Damien
RDF and RDFS have recently become very popular as frameworks for representing data and meta-data in form of a domain description, respectively. RDF data can also be thought of as graph data. In this paper, we focus on keyword-based querying of RDF data. In the existing approaches for answering such keyword queries, keywords are mapped to nodes in the graph and their neighborhoods are explored to extract subgraph(s) of the data graph that contain(s) information relevant to the query. In order to restrict the computational effort, a fixed distance bound is used to define the neighborhoods of nodes. In this paper we present an elegant algorithm for keyword query processing on RDF data that does not assume such a fixed bound. The approach adopts a pruned exploration mechanism where closely related nodes are identified, subgraphs are pruned and joined using suitable hook nodes. The system dynamically manages the distance depending on the closeness between the keywords. The working of the algorithm is illustrated using a fragment of AIFB institute data represented as an RDF graph.
A politeness recognition tool for Hindi: with special emphasis on online texts BIBAFull-Text 367-372
  Ritesh Kumar
This paper gives an overview of a politeness recognition tool (PoRT) for Hindi that is currently under preparation. It describes the the kind of problems that need to be tackled with before developing the tool, the approach and the methodology that will be adopted for the development and testing of the tool, the current progress and the future plan to achieve this goal.
Measurement and analysis of cyberlocker services BIBAFull-Text 373-378
  Aniket Mahanti
Cyberlocker Services (CLS) such as RapidShare and Megaupload have recently become popular. The decline of Peer-to-Peer (P2P) file sharing has prompted various services including CLS to replace it. We propose a comprehensive multi-level characterization of the CLS ecosystem. We answer three research questions: (a) what is a suitable measurement infrastructure for gathering CLS workloads; (b) what are the characteristics of the CLS ecosystem; and (c) what are the implications of CLS on Web 2.0 (and the Internet). To the best of our knowledge, this work is the first to characterize the CLS ecosystem. The work will highlight the content, usage, performance, infrastructure, quality of service, and evolution characteristics of CLS.
Fuzzy associative rule-based approach for pattern mining and identification and pattern-based classification BIBAFull-Text 379-384
  Ashish Mangalampalli; Vikram Pudi
Associative Classification leverages Association Rule Mining (ARM) to train Rule-based classifiers. The classifiers are built on high quality Association Rules mined from the given dataset. Associative Classifiers are very accurate because Association Rules encapsulate all the dominant and statistically significant relationships between items in the dataset. They are also very robust as noise in the form of insignificant and low-frequency itemsets are eliminated during the mining and training stages. Moreover, the rules are easy-to-comprehend, thus making the classifier transparent.
   Conventional Associative Classification and Association Rule Mining (ARM) algorithms are inherently designed to work only with binary attributes, and expect any quantitative attributes to be converted to binary ones using ranges, like "Age = [25, 60]". In order to mitigate this constraint, Fuzzy logic is used to convert quantitative attributes to fuzzy binary attributes, like "Age = Middle-aged", so as to eliminate any loss of information arising due to sharp partitioning, especially at partition boundaries, and then generate Fuzzy Association Rules using an appropriate Fuzzy ARM algorithm. These Fuzzy Association Rules can then be used to train a Fuzzy Associative Classifier. In this paper, we also show how Fuzzy Associative Classifiers so built can be used in a wide variety of domains and datasets, like transactional datasets and image datasets.
Performance enhancement of scheduling algorithms in clusters and grids using improved dynamic load balancing techniques BIBAFull-Text 385-390
  Hemant Kumar Mehta; Priyesh Kanungo; Manohar Chandwani
This paper describes the research work done for during PhD study. Cluster computing, grid computing and cloud computing are distributed computing environments (DCEs) widely accepted for the next generation Web based commercial and scientific applications. These applications work around the globally distributed data of petabyte scale that can only be processed by the aggregating the capability of globally distributed resources. The resource management and process scheduling in large scale distributed computing environment are a challenging task. In this research work we have devised new scheduling algorithms and resource management strategies specially designed for the cluster and grid cloud and peer-to-peer computing. The research work finally presented the distributed computing solutions to one scientific and one commercial application viz. e-Learning and data mining.
Wikipedia vandalism detection BIBAFull-Text 391-396
  Santiago M. Mola-Velasco
Wikipedia is an online encyclopedia that anyone can access and edit. It has become one of the most important sources of knowledge online and many third party projects rely on it for a wide-range of purposes. The open model of Wikipedia allows pranksters, lobbyists and spammers to attack the integrity of the encyclopedia and this endangers it as a public resource. This is known in the community as vandalism.
   A plethora of methods have been developed within the Wikipedia and the scientific community to tackle this problem. We have participated in this effort and developed one of the leading approaches. Our research aims to create a fully-working antivandalism system and get it working in the real world.
Dynamic learning-based mechanism design for dependent valued exchange economies BIBAFull-Text 397-402
  Swaprava Nath
Learning private information from multiple strategic agents poses challenge in many Internet applications. Sponsored search auctions, crowdsourcing, Amazon's mechanical turk, various online review forums are examples where we are interested in learning true values of the advertisers or true opinion of the reviewers. The common thread in these decision problems is that the optimal outcome depends on the private information of all the agents, while the decision of the outcome can be chosen only through reported information which may be manipulated by the strategic agents. The other important trait of these applications is their dynamic nature. The advertisers in an online auction or the users of mechanical turk arrive and depart, and when present, interact with the system repeatedly, giving the opportunity to learn their types. Dynamic mechanisms, which learn from the past interactions and make present decisions depending on the expected future evolution of the game, has been shown to improve performance over repeated versions of static mechanisms. In this paper, we will survey the past and current state-of-the-art dynamic mechanisms and analyze a new setting where the agents consist of buyers and sellers, known as exchange economies, and agents having value interdependency, which are relevant in applications illustrated through examples. We show that known results of dynamic mechanisms with independent value settings cannot guarantee certain desirable properties in this new significantly different setting. In the future work, we propose to analyze similar settings with dynamic types and population.
Sentence-level contextual opinion retrieval BIBAFull-Text 403-408
  Sylvester Olubolu Orimaye
Existing opinion retrieval techniques do not provide context-dependent relevant results. Most of the approaches used by state-of-the-art techniques are based on frequency of query terms, such that all documents containing query terms are retrieved, regardless of contextual relevance to the intent of the human seeking the opinion. However, in a particular opinionated document, words could occur in different contexts, yet meet the frequency attached to a certain opinion threshold, thus explicitly creating a bias in overall opinion retrieved. In this paper we propose a sentence-level contextual model for opinion retrieval using grammatical tree derivations and approval voting mechanism. Model evaluation performed between our contextual model, BM25, and language model shows that the model can be effective for contextual opinion retrieval such as faceted opinion retrieval.
The OXPath to success in the deep web BIBAFull-Text 409-414
  Andrew Jon Sellers
The world wide web provides access to a wealth of data. Collecting and maintaining such large amounts of data necessitates automated processing for extraction, since appropriate automation can perform extraction tasks that would be otherwise infeasible. Modern web interfaces, however, are generally designed primarily for human users, delivering sophisticated interactions through the use of client-side scripting and asynchronous server communication. To this end, we introduce OXPath, a careful extension of XPath that facilitates data extraction from the deep web. OXPath exploits XPath's familiarity and theoretical foundations. OXPath, then, achieves favourable evaluation complexity and optimal page buffering, storing only a constant number of pages for non-recursive queries. Further, OXPath provides a lightweight interface, which is easy to use and embed. This paper outlines the motivation, theoretical framework, current implementation, and preliminary results obtained so far. We conclude with proposed future work on OXPath, including an investigation of how to deploy OXPath efficiently in a highly elastic computing framework (cloud).
Cooperative anti-spam system based on multilayer agents BIBAFull-Text 415-420
  Wenxuan Shi; Maoqiang Xie; Yalou Huang
Spam is unsolicited bulk email which is extremely annoying to the recipients and the ISPs. However, most of the traditional spam filtering methods commonly neglect the bulk character of spam. This paper proposes a model of cooperative anti-spam system based on multilayer agents. We compared our model to the state-of-the-art and found that our model achieved better performance and robustness on several known corpora.
Summarization of archived and shared personal photo collections BIBAFull-Text 421-426
  Pinaki Sinha
The volume of personal photos hosted on photo archives and social sharing platforms has been increasing exponentially. It is difficult to get an overview of a large collection of personal photos without browsing though the entire database manually. In this research, we propose a framework to generate representative subset summaries from photo collections hosted on web archives or social networks. We define salient properties of an effective photo summary and model summarization as an optimization of these properties, given the size constraints. We also introduce metrics for evaluating photo summaries based on their information content and the ability to satisfy user's information needs. Our experiments show that our summarization framework performs better than baseline algorithms.
Application of semantic web technologies for multimedia interpretation BIBAFull-Text 427-432
  Ruben Verborgh; Rik Van de Walle
Despite numerous outstanding results, highly complex and specialized multimedia algorithms have not been able to fulfill the promise of fully automated multimedia interpretation. An essential problem is that they are insufficiently aware of the context they operate in. Algorithms that do take a form of context in consideration, often function in a domain-specific environment. The generic framework proposed in this paper stimulates algorithm collaboration on an interpretation task by continuously actualizing the context of the multimedia item under interpretation. Semantic Web knowledge, combined with reasoning methods, forms the corner stone of the integration of these various interacting agents. We believe that this framework will enable an advanced interpretation of multimedia data that goes beyond the capabilities of individual algorithms. A basic platform implementation already indicates the potential of the concept, clearing the path for even more complex interpretation scenarios.

Emerging regions

The Lwazi community communication service: design and piloting of a voice-based information service BIBAFull-Text 433-442
  Aditi Sharma Grover; Etienne Barnard
We present the design, development and pilot process of the Lwazi Community Communication Service (LCCS), a multilingual automated telephone-based information service. The service acts as a communication and dissemination tool that enables managers at local community centres to broadcast information (e.g. health, employment, social grants) to community workers and the communities they serve. The LCCS allows the recipients to obtain up-to-date, relevant information in a timely and efficient manner, overcoming the obstacles of transportation, time and costs incurred in trying to physically obtain information from the community centres. We discuss our experiences and fieldwork in piloting the LCCS at six locations nationally in the eleven official South African languages. We analyze the usage pattern from the pilot call logs and thereafter discuss the implications of these findings for future projects that design similar automated services for serving rural communities in developing world regions.
Analyzing and accelerating web access in a school in peri-urban India BIBAFull-Text 443-452
  Jay Chen; David Hutchful; William Thies; Lakshminarayanan Subramanian
While computers and Internet access have growing penetration amongst schools in the developing world, intermittent connectivity and limited bandwidth often prevent them from being fully utilized by students and teachers. In this paper, we make two contributions to help address this problem. First, we characterize six weeks of HTTP traffic from a primary school outside of Bangalore, India, illuminating opportunities and constraints for improving performance in such settings. Second, we deploy an aggressive caching and prefetching engine and show that it accelerates a user's overall browsing experience (apart from video content) by 2.8x. Our accelerator leverages innovative techniques that have been proposed, but not evaluated in detail, including the effectiveness of serving stale pages, cached page highlighting, and client-side prefetching. Unlike proxy-based techniques, our system is bundled as an open-source Firefox plugin and runs directly on client machines. This allows easy installation and configuration by end users, which is especially important in developing regions where a lack of permissions or technical expertise often prevents modification of internal network settings.
Design and implementation of contextual information portals BIBAFull-Text 453-462
  Jay Chen; Russell Power; Lakshminarayanan Subramanian; Jonathan Ledlie
This paper presents a system for enabling offline web use to satisfy the information needs of disconnected communities. We describe the design, implementation, evaluation, and pilot deployment of an automated mechanism to construct Contextual Information Portals (CIPs). CIPs are large searchable information repositories of web pages tailored to the information needs of a target population. We combine an efficient classifier with a focused crawler to gather the web pages for the portal for any given topic. Given a set of topics of interest, our system constructs a CIP containing the most relevant pages from the web across these topics. Using several secondary school course syllabi, we demonstrate the effectiveness of our system for constructing CIPs for use as an education resource. We evaluate our system across several metrics: classification accuracy, crawl scalability, crawl accuracy and harvest rate. We describe the utility and usability of our system based on a preliminary deployment study at an after-school program in India, and also outline our ongoing larger-scale pilot deployment at five schools in Kenya.
Location specific summarization of climatic and agricultural trends BIBAFull-Text 463-472
  Sunandan Chakraborty; Lakshminarayanan Subramanian
Climate change can directly impact agriculture. Failure in different aspects of agriculture due to climate change and other influencing factors, are extremely rampant in several agrarian economies, most of which go unnoticed. In this paper, we describe the design of a system that mines disparate information sources on the Web to automatically summarize important climatic and agricultural trends for any specific location and construct a location-specific climatic and agricultural information portal. We have evaluated the system across 605 different districts in India. The results revealed a pan-India scenario of different problem affected areas. The key findings from this work include, around 64.58% of the districts of India suffer from soil related issues and 76.02% have water related problems. We have also manually validated the authenticity of our information sources and validated our summarized results for specific locations with findings in reputed journals and authoritative sources.
Low-infrastructure methods to improve internet access for mobile users in emerging regions BIBAFull-Text 473-482
  Sibren Isaacman; Margaret Martonosi
As information technology supports more aspects of modern life, digital access has become an important tool for developing regions to lift themselves from poverty. Though broadband internet connectivity will not be universally available in the short-term, widely-employed mobile devices coupled with novel delay-tolerant networking do allow limited forms of connectivity. This paper explores the design space for internet access systems operating with constrained connectivity. Our starting point is C-LINK, a collaborative caching system that enhances the performance of interactive web access over DTN and cellular connectivity. We discuss our experiences and results from deploying C-LINK in Nicaragua, before moving on to a broader design study of other issues that further influence operation. We consider the impact of (i) storing web content collaboratively cached across all user nodes, (ii) hybrid transport layers exploiting the best attributes of limited cellular and DTN-style connectivity. We also explore the behavior of future systems under a range of usage and mobility scenarios. Even under adverse conditions, our techniques can improve average service latency for page requests by a factor of 2X. Our results point to the considerable power of leveraging user mobility and collaboration in providing very-low-infrastructure internet access to developing regions.
Identifying enrichment candidates in textbooks BIBAFull-Text 483-492
  Rakesh Agrawal; Sreenivas Gollapudi; Anitha Kannan; Krishnaram Kenthapadi
Many textbooks written in emerging countries lack clear and adequate coverage of important concepts. We propose a technological solution for algorithmically identifying those sections of a book that are not well written and could benefit from better exposition. We provide a decision model based on the syntactic complexity of writing and the dispersion of key concepts. The model parameters are learned using a tune set which is algorithmically generated using a versioned authoritative web resource as a proxy. We evaluate the proposed methodology over a corpus of Indian textbooks which demonstrates its effectiveness in identifying enrichment candidates.
Traffic characterization and internet usage in rural Africa BIBAFull-Text 493-502
  David L. Johnson; Veljko Pejovic; Elizabeth M. Belding; Gertjan van Stam
While Internet connectivity has reached a significant part of the world's population, those living in rural areas of the developing world are still largely disconnected. Recent efforts have provided Internet connectivity to a growing number of remote locations, yet Internet traffic demands cause many of these networks to fail to deliver basic quality of service needed for simple applications. For an in-depth investigation of the problem, we gather and analyze network traces from a rural wireless network in Macha, Zambia. We supplement our analysis with on-site interviews from Macha, Zambia and Dwesa, South Africa, another rural community that hosts a local wireless network. The results reveal that Internet traffic in rural Africa differs significantly from the developed world. We observe dominance of web-based traffic, as opposed to peer-to-peer traffic common in urban areas. Application-wise, online social networks are the most popular, while the majority of bandwidth is consumed by large operating system updates. Our analysis also uncovers numerous network anomalies, such as significant malware traffic. Finally, we find a strong feedback loop between network performance and user behavior. Based on our findings, we conclude with a discussion of new directions in network design that take into account both technical and social factors.
Two-stream indexing for spoken web search BIBAFull-Text 503-512
  Jitendra Ajmera; Anupam Joshi; Sougata Mukherjea; Nitendra Rajput; Shrey Sahay; Mayank Shrivastava; Kundan Srivastava
This paper presents two-stream processing of audio to index the audio content for Spoken Web search. The first stream indexes the meta-data associated with a particular audio document. The meta-data is usually very sparse, but accurate. This therefore results in a high-precision, low-recall index. The second stream uses a novel language-independent speech recognition to generate text to be indexed. Owing to the multiple languages and the noise in user generated content on the Spoken Web, the speech recognition accuracy of such systems is not high, thus they result in a low-precision, high-recall index. The paper attempts to use these two complementary streams to generate a combined index to increase the precision-recall performance in audio content search.
   The problem of audio content search is motivated by the real world implication of the Web in developing regions, where due to literacy and affordability issues, people use Spoken Web which consists of interconnected VoiceSites, which have content in audio. The experiments are based on more than 20,000 audio documents spanning over seven live VoiceSites and four different languages. The results suggest significant improvement over a meta-data-only or a speech-recognition only system, thus justifying the two-stream processing approach. Audio content search is a growing problem area and this paper wishes to be a first step to solving this at a large scale, across languages, in a Web context.
Assistive technology for vision-impairments: an agenda for the ICTD community BIBAFull-Text 513-522
  Joyojeet Pal; Manas Pradhan; Mihir Shah; Rakesh Babu
In recent years, ICTD (Information Communications Technology and Development) has grown in significance as an area of engineering research that has focused on low-cost appropriate technologies for the needs of a developing world largely underserved by the dominant modes of technology design. Assistive Technologies (AT) used by people with disabilities facilitate greater equity in the social and economic public sphere. However, by and large such technologies are designed in the industrialized world, for people living in those countries. This is especially true in the case of AT for people with vision impairments -- market-prevalent technologies are both very expensive and are built to support the language and infrastructure typical in the industrialized world. While the community of researchers in the Web Accessibility space have made significant strides, the operational concerns of networks in the developing world, as well as challenges in support for new languages and contexts raises a new set of challenges for technologists in this space. We discuss the state of various technologies in the context of the developing world and propose directions in scientific and community-contributed efforts to increase the relevance and access to AT and accessibility in the developing world.