HCI Bibliography Home | HCI Conferences | WWW Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
WWW Tables of Contents: 01020304-104-205-105-2060708091011-111-212-112-213-113-214-114-215-1

Proceedings of the 2009 International Conference on the World Wide Web

Fullname:Proceedings of the 18th International Conference on World Wide Web
Editors:Juan Quemada; Gonzalo León; Yoelle Maarek; Wolfgang Nejdl
Location:Madrid, Spain
Dates:2009-Apr-20 to 2009-Apr-24
Standard No:ISBN 1-60558-487-8, 978-1-60558-487-4; ACM DL: Table of Contents hcibib: WWW09
Links:Conference Home Page
  1. Data mining/session: click models
  2. Data mining/session: graph algorithms
  3. Data mining/session: text mining
  4. Data mining/session: statistical methods
  5. Data mining/session: opinions
  6. Data mining/session: web mining
  7. Data mining/session: learning
  8. Internet monetization/session: sponsored search
  9. Internet monetization/session: web monetization
  10. Performance, scalability and availability/session: performance
  11. Rich media/session: media applications
  12. Rich media/session: tagging and clustering
  13. Search/session: search UI
  14. Search/session: query processing
  15. Search/session: caching and indices
  16. Search/session: query categorization
  17. Search/session: ads and query expansion
  18. Security and privacy/session: web privacy
  19. Security and privacy/session: web security
  20. Semantic/data web/session: semantic data management
  21. Semantic/data web/session: linked data
  22. Semantic/data web/session: mining for semantics
  23. Social networks and web 2.0/session: recommender systems
  24. Social networks and web 2.0/session: diffusion and search in social networks
  25. Social networks and web 2.0/session: interactions in social communities
  26. Social networks and web 2.0/session: photos and web 2.0
  27. User interfaces and mobile web/session: mobile web
  28. User interfaces and mobile web/session: user interfaces
  29. Web engineering/session: end user web engineering
  30. Web engineering/session: service oriented development
  31. Web engineering/session: web architecture aspect
  32. Web engineering/session: client side web engineering
  33. XML and web data/session: XML extraction and crawling
  34. XML and web data/session: XML querying
  35. WWW in ibero-america
  36. Posters Wednesday, April 22, 2009
  37. Posters Thursday, April 23, 2009
  38. Posters Friday, April 24, 2009

Data mining/session: click models

A dynamic bayesian network click model for web search ranking BIBAKFull-Text 1-10
  Olivier Chapelle; Ya Zhang
As with any application of machine learning, web search ranking requires labeled data. The labels usually come in the form of relevance assessments made by editors. Click logs can also provide an important source of implicit feedback and can be used as a cheap proxy for editorial labels. The main difficulty however comes from the so called position bias -- urls appearing in lower positions are less likely to be clicked even if they are relevant. In this paper, we propose a Dynamic Bayesian Network which aims at providing us with unbiased estimation of the relevance from the click logs. Experiments show that the proposed click model outperforms other existing click models in predicting both click-through rate and relevance.
Keywords: click modeling, click-through rate, dynamic bayesian network, ranking, web search
Click chain model in web search BIBAKFull-Text 11-20
  Fan Guo; Chao Liu; Anitha Kannan; Tom Minka; Michael Taylor; Yi-Min Wang; Christos Faloutsos
Given a terabyte click log, can we build an efficient and effective click model? It is commonly believed that web search click logs are a gold mine for search business, because they reflect users' preference over web documents presented by the search engine. Click models provide a principled approach to inferring user-perceived relevance of web documents, which can be leveraged in numerous applications in search businesses. Due to the huge volume of click data, scalability is a must.
   We present the click chain model (CCM), which is based on a solid, Bayesian framework. It is both scalable and incremental, perfectly meeting the computational challenges imposed by the voluminous click logs that constantly grow. We conduct an extensive experimental study on a data set containing 8.8 million query sessions obtained in July 2008 from a commercial search engine. CCM consistently outperforms two state-of-the-art competitors in a number of metrics, with over 9.7% better log-likelihood, over 6.2% better click perplexity and much more robust (up to 30%) prediction of the first and the last clicked position.
Keywords: bayesian models, click log analysis, web search
Spatio-temporal models for estimating click-through rate BIBAKFull-Text 21-30
  Deepak Agarwal; Bee-Chung Chen; Pradheep Elango
We propose novel spatio-temporal models to estimate click-through rates in the context of content recommendation. We track article CTR at a fixed location over time through a dynamic Gamma-Poisson model and combine information from correlated locations through dynamic linear regressions, significantly improving on per-location model. Our models adjust for user fatigue through an exponential tilt to the first-view CTR (probability of click on first article exposure) that is based only on user-specific repeat-exposure features. We illustrate our approach on data obtained from a module (Today Module) published regularly on Yahoo! Front Page and demonstrate significant improvement over commonly used baseline methods. Large scale simulation experiments to study the performance of our models under different scenarios provide encouraging results. Throughout, all modeling assumptions are validated via rigorous exploratory data analysis.
Keywords: content recommendation, ctr positional correlation

Data mining/session: graph algorithms

Fast dynamic reranking in large graphs BIBAKFull-Text 31-40
  Purnamrita Sarkar; Andrew W. Moore
In this paper we consider the problem of re-ranking search results by incorporating user feedback. We present a graph theoretic measure for discriminating irrelevant results from relevant results using a few labeled examples provided by the user. The key intuition is that nodes relatively closer (in graph topology) to the relevant nodes than the irrelevant nodes are more likely to be relevant. We present a simple sampling algorithm to evaluate this measure at specific nodes of interest, and an efficient branch and bound algorithm to compute the top k nodes from the entire graph under this measure. On quantifiable prediction tasks the introduced measure outperforms other diffusion-based proximity measures which take only the positive relevance feedback into account. On the Entity-Relation graph built from the authors and papers of the entire DBLP citation corpus (1.4 million nodes and 2.2 million edges) our branch and bound algorithm takes about 1.5 seconds to retrieve the top 10 nodes w.r.t. this measure with 10 labeled nodes.
Keywords: harmonic function, random walk, search, semi-supervised learning
Estimating the impressionrank of web pages BIBAKFull-Text 41-50
  Ziv Bar-Yossef; Maxim Gurevich
The ImpressionRank of a web page (or, more generally, of a web site) is the number of times users viewed the page while browsing search results. ImpressionRank captures the visibility of pages and sites in search engines and is thus an important measure, which is of interest to web site owners, competitors, market analysts, and end users.
   All previous approaches to estimating the ImpressionRank of a page rely on privileged access to private data sources, like the search engine's query log. In this paper we present the first external algorithm for estimating the ImpressionRank of a web page. This algorithm relies on access to three public data sources: the search engine, the query suggestion service of the search engine, and the web. In addition, the algorithm is local and uses modest resources. It can therefore be used by almost any party to estimate the ImpressionRank of any page on any search engine.
   En route to estimating the ImpressionRank of a page, our algorithm solves a novel variant of the keyword extraction problem: it finds the most popular search keywords that drive impressions of a page.
   Empirical analysis of the algorithm on the Google and Yahoo! search engines indicates that it is accurate and provides interesting insights about sites and search queries.
Keywords: auto-completions, data mining, estimation, impressionrank, popular keyword extraction, search engines, suggestions
Learning to recognize reliable users and content in social media with coupled mutual reinforcement BIBAKFull-Text 51-60
  Jiang Bian; Yandong Liu; Ding Zhou; Eugene Agichtein; Hongyuan Zha
Community Question Answering (CQA) has emerged as a popular forum for users to pose questions for other users to answer. Over the last few years, CQA portals such as Naver and Yahoo! Answers have exploded in popularity, and now provide a viable alternative to general purpose Web search. At the same time, the answers to past questions submitted in CQA sites comprise a valuable knowledge repository which could be a gold mine for information retrieval and automatic question answering. Unfortunately, the quality of the submitted questions and answers varies widely -- increasingly so that a large fraction of the content is not usable for answering queries. Previous approaches for retrieving relevant and high quality content have been proposed, but they require large amounts of manually labeled data -- which limits the applicability of the supervised approaches to new sites and domains. In this paper we address this problem by developing a semi-supervised coupled mutual reinforcement framework for simultaneously calculating content quality and user reputation, that requires relatively few labeled examples to initialize the training process. Results of a large scale evaluation demonstrate that our methods are more effective than previous approaches for finding high-quality answers, questions, and users. More importantly, our quality estimation significantly improves the accuracy of search over CQA archives over the state-of-the-art methods.
Keywords: authority and expertise in online communities, community question answering

Data mining/session: text mining

Detecting the origin of text segments efficiently BIBAKFull-Text 61-70
  Ossama Abdel Hamid; Behshad Behzadi; Stefan Christoph; Monika Henzinger
In the origin detection problem an algorithm is given a set S of documents, ordered by creation time, and a query document D. It needs to output for every consecutive sequence of k alphanumeric terms in D the earliest document in $S$ in which the sequence appeared (if such a document exists). Algorithms for the origin detection problem can, for example, be used to detect the "origin" of text segments in D and thus to detect novel content in D. They can also find the document from which the author of D has copied the most (or show that D is mostly original.) We concentrate on solutions that use only a fixed amount of memory. We propose novel algorithms for this problem and evaluate them together with a large number of previously published algorithms. Our results show that (1) detecting the origin of text segments efficiently can be done with very high accuracy even when the space used is less than 1% of the size of the documents in $S$, (2) the precision degrades smoothly with the amount of available space, (3) various estimation techniques can be used to increase the performance of the algorithms.
Keywords: document overlap, shingling
Enhancing diversity, coverage and balance for summarization through structure learning BIBAKFull-Text 71-80
  Liangda Li; Ke Zhou; Gui-Rong Xue; Hongyuan Zha; Yong Yu
Document summarization plays an increasingly important role with the exponential growth of documents on the Web. Many supervised and unsupervised approaches have been proposed to generate summaries from documents. However, these approaches seldom simultaneously consider summary diversity, coverage, and balance issues which to a large extent determine the quality of summaries. In this paper, we consider extract-based summarization emphasizing the following three requirements: 1) diversity in summarization, which seeks to reduce redundancy among sentences in the summary; 2) sufficient coverage, which focuses on avoiding the loss of the document's main information when generating the summary; and 3) balance, which demands that different aspects of the document need to have about the same relative importance in the summary. We formulate the extract-based summarization problem as learning a mapping from a set of sentences of a given document to a subset of the sentences that satisfies the above three requirements. The mapping is learned by incorporating several constraints in a structure learning framework, and we explore the graph structure of the output variables and employ structural SVM for solving the resulted optimization problem. Experiments on the DUC2001 data sets demonstrate significant performance improvements in terms of F1 and ROUGE metrics.
Keywords: balance, coverage, diversity, structural svm, summarization
Efficient overlap and content reuse detection in blogs and online news articles BIBAKFull-Text 81-90
  Jong Wook Kim; K. Selçuk Candan; Junichi Tatemura
The use of blogs to track and comment on real world (political, news, entertainment) events is growing. Similarly, as more individuals start relying on the Web as their primary information source and as more traditional media outlets try reaching consumers through alternative venues, the number of news sites on the Web is also continuously increasing. Content-reuse, whether in the form of extensive quotations or content borrowing across media outlets, is very common in blogs and news entries outlets tracking the same real-world event. Knowledge about which web entries re-use content from which others can be an effective asset when organizing these entries for presentation. On the other hand, this knowledge is not cheap to acquire: considering the size of the related space web entries, it is essential that the techniques developed for identifying re-use are fast and scalable. Furthermore, the dynamic nature of blog and news entries necessitates incremental processing for reuse detection. In this paper, we develop a novel qSign algorithm that efficiently and effectively analyze the blogosphere for quotation and reuse identification. Experiment results show that with qSign processing time gains from 10X to 100X are possible while maintaining reuse detection rates of up to 90%. Furthermore, processing time gains can be pushed multiple orders of magnitude (from 100X to 1000X) for 70% recall.
Keywords: reuse detection, weblogs

Data mining/session: statistical methods

Latent space domain transfer between high dimensional overlapping distributions BIBAKFull-Text 91-100
  Sihong Xie; Wei Fan; Jing Peng; Olivier Verscheure; Jiangtao Ren
Transferring knowledge from one domain to another is challenging due to a number of reasons. Since both conditional and marginal distribution of the training data and test data are non-identical, model trained in one domain, when directly applied to a different domain, is usually low in accuracy. For many applications with large feature sets, such as text document, sequence data, medical data, image data of different resolutions, etc. two domains usually do not contain exactly the same features, thus introducing large numbers of "missing values" when considered over the union of features from both domains. In other words, its marginal distributions are at most overlapping. In the same time, these problems are usually high dimensional, such as, several thousands of features. Thus, the combination of high dimensionality and missing values make the relationship in conditional probabilities between two domains hard to measure and model. To address these challenges, we propose a framework that first brings the marginal distributions of two domains closer by "filling up" those missing values of disjoint features. Afterwards, it looks for those comparable sub-structures in the "latent-space" as mapped from the expanded feature vector, where both marginal and conditional distribution are similar. With these sub-structures in latent space, the proposed approach then find common concepts that are transferable across domains with high probability. During prediction, unlabeled instances are treated as "queries", the mostly related labeled instances from out-domain are retrieved, and the classification is made by weighted voting using retrieved out-domain examples. We formally show that importing feature values across domains and latent semantic index can jointly make the distributions of two related domains easier to measure than in original feature space, the nearest neighbor method employed to retrieve related out domain examples is bounded in error when predicting in-domain examples. Software and datasets are available for download.
Keywords: high dimensional, latent, missing value, text mining, transfer learning
StatSnowball: a statistical approach to extracting entity relationships BIBAKFull-Text 101-110
  Jun Zhu; Zaiqing Nie; Xiaojiang Liu; Bo Zhang; Ji-Rong Wen
Traditional relation extraction methods require pre-specified relations and relation-specific human-tagged examples. Bootstrapping systems significantly reduce the number of training examples, but they usually apply heuristic-based methods to combine a set of strict hard rules, which limit the ability to generalize and thus generate a low recall. Furthermore, existing bootstrapping methods do not perform open information extraction (Open IE), which can identify various types of relations without requiring pre-specifications. In this paper, we propose a statistical extraction framework called Statistical Snowball (StatSnowball), which is a bootstrapping system and can perform both traditional relation extraction and Open IE.
   StatSnowball uses the discriminative Markov logic networks (MLNs) and softens hard rules by learning their weights in a maximum likelihood estimate sense. MLN is a general model, and can be configured to perform different levels of relation extraction. In StatSnwoball, pattern selection is performed by solving an l1-norm penalized maximum likelihood estimation, which enjoys well-founded theories and efficient solvers. We extensively evaluate the performance of StatSnowball in different configurations on both a small but fully labeled data set and large-scale Web data. Empirical results show that StatSnowball can achieve a significantly higher recall without sacrificing the high precision during iterations with a small number of seeds, and the joint inference of MLN can improve the performance. Finally, StatSnowball is efficient and we have developed a working entity relation search engine called Renlifang based on it.
Keywords: Markov logic networks, relationship extraction, statistical models
Matchbox: large scale online bayesian recommendations BIBAKFull-Text 111-120
  David H. Stern; Ralf Herbrich; Thore Graepel
We present a probabilistic model for generating personalised recommendations of items to users of a web service. The Matchbox system makes use of content information in the form of user and item meta data in combination with collaborative filtering information from previous user behavior in order to predict the value of an item for a user. Users and items are represented by feature vectors which are mapped into a low-dimensional 'trait space' in which similarity is measured in terms of inner products. The model can be trained from different types of feedback in order to learn user-item preferences. Here we present three alternatives: direct observation of an absolute rating each user gives to some items, observation of a binary preference (like/don't like) and observation of a set of ordinal ratings on a user-specific scale. Efficient inference is achieved by approximate message passing involving a combination of Expectation Propagation (EP) and Variational Message Passing. We also include a dynamics model which allows an item's popularity, a user's taste or a user's personal rating scale to drift over time. By using Assumed-Density Filtering (ADF) for training, the model requires only a single pass through the training data. This is an on-line learning algorithm capable of incrementally taking account of new data so the system can immediately reflect the latest user preferences. We evaluate the performance of the algorithm on the MovieLens and Netflix data sets consisting of approximately 1,000,000 and 100,000,000 ratings respectively. This demonstrates that training the model using the on-line ADF approach yields state-of-the-art performance with the option of improving performance further if computational resources are available by performing multiple EP passes over the training data.
Keywords: advertising, bayesian inference, collaborative filtering, machine learning, online services, recommender system

Data mining/session: opinions

Learning consensus opinion: mining data from a labeling game BIBAKFull-Text 121-130
  Paul N. Bennett; David Maxwell Chickering; Anton Mityagin
We consider the problem of identifying the consensus ranking for the results of a query, given preferences among those results from a set of individual users. Once consensus rankings are identified for a set of queries, these rankings can serve for both evaluation and training of retrieval and learning systems. We present a novel approach to collecting the individual user preferences over image-search results: we use a collaborative game in which players are rewarded for agreeing on which image result is best for a query. Our approach is distinct from other labeling games because we are able to elicit directly the preferences of interest with respect to image queries extracted from query logs. As a source of relevance judgments, this data provides a useful complement to click data. Furthermore, the data is free of positional biases and is collected by the game without the risk of frustrating users with non-relevant results; this risk is prevalent in standard mechanisms for debiasing clicks. We describe data collected over 34 days from a deployed version of this game that amounts to about 18 million expressed preferences between pairs. Finally, we present several approaches to modeling this data in order to extract the consensus rankings from the preferences and better sort the search results for targeted queries.
Keywords: learning preferences, preference judgments
Rated aspect summarization of short comments BIBAKFull-Text 131-140
  Yue Lu; ChengXiang Zhai; Neel Sundaresan
Web 2.0 technologies have enabled more and more people to freely comment on different kinds of entities (e.g. sellers, products, services). The large scale of information poses the need and challenge of automatic summarization. In many cases, each of the user-generated short comments comes with an overall rating. In this paper, we study the problem of generating a "rated aspect summary" of short comments, which is a decomposed view of the overall ratings for the major aspects so that a user could gain different perspectives towards the target entity. We formally define the problem and decompose the solution into three steps. We demonstrate the effectiveness of our methods by using eBay sellers' feedback comments. We also quantitatively evaluate each step of our methods and study how well human agree on such a summarization task. The proposed methods are quite general and can be used to generate rated aspect summary automatically given any collection of short comments each associated with an overall rating.
Keywords: rated aspect summarization, rating prediction, short comments
How opinions are received by online communities: a case study on amazon.com helpfulness votes BIBAKFull-Text 141-150
  Cristian Danescu-Niculescu-Mizil; Gueorgi Kossinets; Jon Kleinberg; Lillian Lee
There are many on-line settings in which users publicly express opinions. A number of these offer mechanisms for other users to evaluate these opinions; a canonical example is Amazon.com, where reviews come with annotations like "26 of 32 people found the following review helpful." Opinion evaluation appears in many off-line settings as well, including market research and political campaigns. Reasoning about the evaluation of an opinion is fundamentally different from reasoning about the opinion itself: rather than asking, "What did Y think of X?", we are asking, "What did Z think of Y's opinion of X?" Here we develop a framework for analyzing and modeling opinion evaluation, using a large-scale collection of Amazon book reviews as a dataset. We find that the perceived helpfulness of a review depends not just on its content but also but also in subtle ways on how the expressed evaluation relates to other evaluations of the same product. As part of our approach, we develop novel methods that take advantage of the phenomenon of review "plagiarism" to control for the effects of text in opinion evaluation, and we provide a simple and natural mathematical model consistent with our findings. Our analysis also allows us to distinguish among the predictions of competing theories from sociology and social psychology, and to discover unexpected differences in the collective opinion-evaluation behavior of user populations from different countries.
Keywords: online communities, opinion mining, plagiarism, review helpfulness, review utility, sentiment analysis, social influence

Data mining/session: web mining

Exploiting web search to generate synonyms for entities BIBAKFull-Text 151-160
  Surajit Chaudhuri; Venkatesh Ganti; Dong Xin
Tasks recognizing named entities such as products, people names, or locations from documents have recently received significant attention in the literature. Many solutions to these tasks assume the existence of reference entity tables. An important challenge that needs to be addressed in the entity extraction task is that of ascertaining whether or not a candidate string approximately matches with a named entity in a given reference table.
   Prior approaches have relied on string-based similarity which only compare a candidate string and an entity it matches with. In this paper, we exploit web search engines in order to define new similarity functions. We then develop efficient techniques to facilitate approximate matching in the context of our proposed similarity functions. In an extensive experimental evaluation, we demonstrate the accuracy and efficiency of our techniques.
Keywords: entity extraction, similarity measure, synonym generation, web search
Smart Miner: a new framework for mining large scale web usage data BIBAKFull-Text 161-170
  Murat Ali Bayir; Ismail Hakki Toroslu; Ahmet Cosar; Guven Fidan
In this paper, we propose a novel framework called Smart-Miner for web usage mining problem which uses link information for producing accurate user sessions and frequent navigation patterns. Unlike the simple session concepts in the time and navigation based approaches, where sessions are sequences of web pages requested from the server or viewed in the browser, Smart Miner sessions are set of paths traversed in the web graph that corresponds to users' navigations among web pages. We have modeled session construction as a new graph problem and utilized a new algorithm, Smart-SRA, to solve this problem efficiently. For the pattern discovery phase, we have developed an efficient version of the Apriori-All technique which uses the structure of web graph to increase the performance. From the experiments that we have performed on both real and simulated data, we have observed that Smart-Miner produces at least 30% more accurate web usage patterns than other approaches including previous session construction methods. We have also studied the effect of having the referrer information in the web server logs to show that different versions of Smart-SRA produce similar results. Our another contribution is that we have implemented distributed version of the Smart Miner framework by employing Map/Reduce Paradigm. We conclude that we can efficiently process terabytes of web server logs belonging to multiple web sites by our scalable framework.
Keywords: graph mining, map/reduce, parallel data mining, web usage mining, web user modeling
Releasing search queries and clicks privately BIBAKFull-Text 171-180
  Aleksandra Korolova; Krishnaram Kenthapadi; Nina Mishra; Alexandros Ntoulas
The question of how to publish an anonymized search log was brought to the forefront by a well-intentioned, but privacy-unaware AOL search log release. Since then a series of ad-hoc techniques have been proposed in the literature, though none are known to be provably private. In this paper, we take a major step towards a solution: we show how queries, clicks and their associated perturbed counts can be published in a manner that rigorously preserves privacy. Our algorithm is decidedly simple to state, but non-trivial to analyze. On the opposite side of privacy is the question of whether the data we can safely publish is of any use. Our findings offer a glimmer of hope: we demonstrate that a non-negligible fraction of queries and clicks can indeed be safely published via a collection of experiments on a real search log. In addition, we select an application, keyword generation, and show that the keyword suggestions generated from the perturbed data resemble those generated from the original data.
Keywords: data release, differential privacy, query click graph, search logs

Data mining/session: learning

Incorporating site-level knowledge to extract structured data from web forums BIBAKFull-Text 181-190
  Jiang-Ming Yang; Rui Cai; Yida Wang; Jun Zhu; Lei Zhang; Wei-Ying Ma
Web forums have become an important data resource for many web applications, but extracting structured data from unstructured web forum pages is still a challenging task due to both complex page layout designs and unrestricted user created posts. In this paper, we study the problem of structured data extraction from various web forum sites. Our target is to find a solution as general as possible to extract structured data, such as post title, post author, post time, and post content from any forum site. In contrast to most existing information extraction methods, which only leverage the knowledge inside an individual page, we incorporate both page-level and site-level knowledge and employ Markov logic networks (MLNs) to effectively integrate all useful evidence by learning their importance automatically. Site-level knowledge includes (1) the linkages among different object pages, such as list pages and post pages, and (2) the interrelationships of pages belonging to the same object. The experimental results on 20 forums show a very encouraging information extraction performance, and demonstrate the ability of the proposed approach on various forums. We also show that the performance is limited if only page-level knowledge is used, while when incorporating the site-level knowledge both precision and recall can be significantly improved.
Keywords: Markov logic networks (MLNS), information extraction, site-level knowledge, structured data, web forums
Towards context-aware search by learning a very large variable length hidden markov model from search logs BIBAKFull-Text 191-200
  Huanhuan Cao; Daxin Jiang; Jian Pei; Enhong Chen; Hang Li
Capturing the context of a user's query from the previous queries and clicks in the same session may help understand the user's information need. A context-aware approach to document re-ranking, query suggestion, and URL recommendation may improve users' search experience substantially. In this paper, we propose a general approach to context-aware search. To capture contexts of queries, we learn a variable length Hidden Markov Model (vlHMM) from search sessions extracted from log data. Although the mathematical model is intuitive, how to learn a large vlHMM with millions of states from hundreds of millions of search sessions poses a grand challenge. We develop a strategy for parameter initialization in vlHMM learning which can greatly reduce the number of parameters to be estimated in practice. We also devise a method for distributed vlHMM learning under the map-reduce model. We test our approach on a real data set consisting of 1.8 billion queries, 2.6 billion clicks, and 840 million search sessions, and evaluate the effectiveness of the vlHMM learned from the real data on three search applications: document re-ranking, query suggestion, and URL recommendation. The experimental results show that our approach is both effective and efficient.
Keywords: context-aware search, variable length hidden Markov model
A class-feature-centroid classifier for text categorization BIBAKFull-Text 201-210
  Hu Guan; Jingyu Zhou; Minyi Guo
Automated text categorization is an important technique for many web applications, such as document indexing, document filtering, and cataloging web resources. Many different approaches have been proposed for the automated text categorization problem. Among them, centroid-based approaches have the advantages of short training time and testing time due to its computational efficiency. As a result, centroid-based classifiers have been widely used in many web applications. However, the accuracy of centroid-based classifiers is inferior to SVM, mainly because centroids found during construction are far from perfect locations.
   We design a fast Class-Feature-Centroid (CFC) classifier for multi-class, single-label text categorization. In CFC, a centroid is built from two important class distributions: inter-class term index and inner-class term index. CFC proposes a novel combination of these indices and employs a denormalized cosine measure to calculate the similarity score between a text vector and a centroid. Experiments on the Reuters-21578 corpus and 20-newsgroup email collection show that CFC consistently outperforms the state-of-the-art SVM classifiers on both micro-F1 and macro-F1 scores. Particularly, CFC is more effective and robust than SVM when data is sparse.
Keywords: centroid, denormalized cosine measure, inner-class, inter-class, text classification
Large scale multi-label classification via metalabeler BIBAKFull-Text 211-220
  Lei Tang; Suju Rajan; Vijay K. Narayanan
The explosion of online content has made the management of such content non-trivial. Web-related tasks such as web page categorization, news filtering, query categorization, tag recommendation, etc. often involve the construction of multi-label categorization systems on a large scale. Existing multi-label classification methods either do not scale or have unsatisfactory performance. In this work, we propose MetaLabeler to automatically determine the relevant set of labels for each instance without intensive human involvement or expensive cross-validation. Extensive experiments conducted on benchmark data show that the MetaLabeler tends to outperform existing methods. Moreover, MetaLabeler scales to millions of multi-labeled instances and can be deployed easily. This enables us to apply the MetaLabeler to a large scale query categorization problem in Yahoo!, yielding a significant improvement in performance.
Keywords: hierarchical classification, large scale, meta model, metalabeler, multi-label classification, query categorization

Internet monetization/session: sponsored search

Hybrid keyword search auctions BIBAKFull-Text 221-230
  Ashish Goel; Kamesh Munagala
Search auctions have become a dominant source of revenue generation on the Internet. Such auctions have typically used per-click bidding and pricing. We propose the use of hybrid auctions where an advertiser can make a per-impression as well as a per-click bid, and the auctioneer then chooses one of the two as the pricing mechanism. We assume that the advertiser and the auctioneer both have separate beliefs (called priors) on the click-probability of an advertisement. We first prove that the hybrid auction is truthful, assuming that the advertisers are risk-neutral. We then show that this auction is superior to the existing per-click auction in multiple ways: We show that risk-seeking advertisers will choose only a per-impression bid whereas risk-averse advertisers will choose only a per-click bid, and argue that both kind of advertisers arise naturally. Hence, the ability to bid in a hybrid fashion is important to account for the risk characteristics of the advertisers. For obscure keywords, the auctioneer is unlikely to have a very sharp prior on the click-probabilities. In such situations, we show that having the extra information from the advertisers in the form of a per-impression bid can result in significantly higher revenue. An advertiser who believes that its click-probability is much higher than the auctioneer's estimate can use per-impression bids to correct the auctioneer's prior without incurring any extra cost. The hybrid auction can allow the advertiser and auctioneer to implement complex dynamic programming strategies to deal with the uncertainty in the click-probability using the same basic auction. The per-click and per-impression bidding schemes can only be used to implement two extreme cases of these strategies. As Internet commerce matures, we need more sophisticated pricing models to exploit all the information held by each of the participants. We believe that hybrid auctions could be an important step in this direction. The hybrid auction easily extends to multiple slots, and is also applicable to scenarios where the hybrid bidding is per-impression and per-action (i.e. CPM and CPA), or per-click and per-action (i.e. CPC and CPA).
Keywords: internet, keyword auctions, mechanism design
Bid optimization for broad match ad auctions BIBAKFull-Text 231-240
  Eyal Even Dar; Vahab S. Mirrokni; S. Muthukrishnan; Yishay Mansour; Uri Nadav
Ad auctions in sponsored search support "broad match" that allows an advertiser to target a large number of queries while bidding only on a limited number. While giving more expressiveness to advertisers, this feature makes it challenging to optimize bids to maximize their returns: choosing to bid on a query as a broad match because it provides high profit results in one bidding for related queries which may yield low or even negative profits.
   We abstract and study the complexity of the {\em bid optimization problem} which is to determine an advertiser's bids on a subset of keywords (possibly using broad match) so that her profit is maximized. In the query language model when the advertiser is allowed to bid on all queries as broad match, we present a linear programming (LP)-based polynomial-time algorithm that gets the optimal profit. In the model in which an advertiser can only bid on keywords, ie., a subset of keywords as an exact or broad match, we show that this problem is not approximable within any reasonable approximation factor unless P=NP. To deal with this hardness result, we present a constant-factor approximation when the optimal profit significantly exceeds the cost. This algorithm is based on rounding a natural LP formulation of the problem. Finally, we study a budgeted variant of the problem, and show that in the query language model, one can find two budget constrained ad campaigns in polynomial time that implement the optimal bidding strategy. Our results are the first to address bid optimization under the broad match feature which is common in ad auctions.
Keywords: ad auctions, bid optimization, optimal bidding, sponsored search
General auction mechanism for search advertising BIBAKFull-Text 241-250
  Gagan Aggarwal; S. Muthukrishnan; Dávid Pál; Martin Pál
In sponsored search, a number of advertising slots is available on a search results page, and have to be allocated among a set of advertisers competing to display an ad on the page. This gives rise to a bipartite matching market that is typically cleared by the way of an automated auction. Several auction mechanisms have been proposed, with variants of the Generalized Second Price (GSP) being widely used in practice. There is a rich body of work on bipartite matching markets that builds upon the stable marriage model of Gale and Shapley and the assignment model of Shapley and Shubik. This line of research offers deep insights into the structure of stable outcomes in such markets and their incentive properties. In this paper, we model advertising auctions in terms of an assignment model with linear utilities, extended with bidder and item specific maximum and minimum prices. Auction mechanisms like the commonly used GSP or the well-known Vickrey-Clarke-Groves (VCG) can be interpreted as simply computing a bidder-optimal stable matching in this model, for a suitably defined set of bidder preferences, but our model includes much richer bidders and preferences. We prove that in our model the existence of a stable matching is guaranteed, and under a non-degeneracy assumption a bidder-optimal stable matching exists as well. We give an algorithm to find such matching in polynomial time, and use it to design truthful mechanism that generalizes GSP, is truthful for profit-maximizing bidders, correctly implements features like bidder-specific minimum prices and position-specific bids, and works for rich mixtures of bidders and preferences. Our main technical contributions are the existence of bidder-optimal matchings and strategyproofness of the resulting mechanism, and are proved by induction on the progress of the matching algorithm.
Keywords: game theory, sponsored search auctions, stable matchings

Internet monetization/session: web monetization

Adaptive bidding for display advertising BIBAKFull-Text 251-260
  Arpita Ghosh; Benjamin I. P. Rubinstein; Sergei Vassilvitskii; Martin Zinkevich
Motivated by the emergence of auction-based marketplaces for display ads such as the Right Media Exchange, we study the design of a bidding agent that implements a display advertising campaign by bidding in such a marketplace. The bidding agent must acquire a given number of impressions with a given target spend, when the highest external bid in the marketplace is drawn from an unknown distribution P. The quantity and spend constraints arise from the fact that display ads are usually sold on a CPM basis. We consider both the full information setting, where the winning price in each auction is announced publicly, and the partially observable setting where only the winner obtains information about the distribution; these differ in the penalty incurred by the agent while attempting to learn the distribution. We provide algorithms for both settings, and prove performance guarantees using bounds on uniform closeness from statistics, and techniques from online learning. We experimentally evaluate these algorithms: both algorithms perform very well with respect to both target quantity and spend; further, our algorithm for the partially observable case performs nearly as well as that for the fully observable setting despite the higher penalty incurred during learning.
Keywords: adaptive bidding, concentration bounds, display advertising, guaranteed delivery, guess-then-double algorithms
How much can behavioral targeting help online advertising? BIBAKFull-Text 261-270
  Jun Yan; Ning Liu; Gang Wang; Wen Zhang; Yun Jiang; Zheng Chen
Behavioral Targeting (BT) is a technique used by online advertisers to increase the effectiveness of their campaigns, and is playing an increasingly important role in the online advertising market. However, it is underexplored in academia when looking at how much BT can truly help online advertising in commercial search engines. To answer this question, in this paper we provide an empirical study on the click-through log of advertisements collected from a commercial search engine. From the comprehensively experiment results on the sponsored search log of the commercial search engine over a period of seven days, we can draw three important conclusions: (1) Users who clicked the same ad will truly have similar behaviors on the Web; (2) Click-Through Rate (CTR) of an ad can be averagely improved as high as 670% by properly segmenting users for behavioral targeted advertising in a sponsored search; (3) Using the short term user behaviors to represent users is more effective than using the long term user behaviors for BT. The statistical t-test verifies that all conclusions drawn in the paper are statistically significant. To the best of our knowledge, this work is the first empirical study for BT on the click-through log of real world ads.
Keywords: behavioral targeting (bt), click-through rate (ctr)., online advertising, user segmentation
Web service derivatives BIBAKFull-Text 271-280
  Thomas Meinl; Benjamin Blau
Web service development and usage has shifted from simple information processing services to high-value business services that are crucial to productivity and success. In order to deal with an increasing risk of unavailability or failure of mission-critical Web services we argue the need for advanced reservation of services in the form of derivatives.
   The contribution of this paper is twofold: First we provide an abstract model of a market design that enables the trade of derivatives for mission-critical Web services. Our model satisfies requirements that result from service characteristics such as intangibility and the impossibility to inventor services in order to meet fluctuating demand. It comprehends principles from models of incomplete markets such as the absence of a tradeable underlying and consistent arbitrage-free derivative pricing.
   Furthermore we provide an architecture for a Web service market that implements our model and describes the strategy space and interaction of market participants in the trading process of service derivatives. We compare the underlying pricing processes to existing derivative models in energy exchanges, discuss eventual shortcomings, and apply Wavelets to analyze actual data and extract long- and short-term trends.
Keywords: derivatives, incomplete markets, services mashups, wavelets, web services

Performance, scalability and availability/session: performance

Efficient application placement in a dynamic hosting platform BIBAKFull-Text 281-290
  Zakaria Al-Qudah; Hussein A. Alzoubi; Mark Allman; Michael Rabinovich; Vincenzo Liberatore
Web hosting providers are increasingly looking into dynamic hosting to reduce costs and improve the performance of their platforms. Instead of provisioning fixed resources to each customer, dynamic hosting maintains a variable number of application instances to satisfy current demand. While existing research in this area has mostly focused on the algorithms that decide on the number and location of application instances, we address the problem of efficient enactment of these decisions once they are made. We propose a new approach to application placement and experimentally show that it dramatically reduces the cost of application placement, which in turn improves the end-to-end agility of the hosting platform in reacting to demand changes.
Keywords: application servers, dynamic placement, startup performance, web hosting
Network-aware forward caching BIBAKFull-Text 291-300
  Jeffrey Erman; Alexandre Gerber; Mohammad T. Hajiaghayi; Dan Pei; Oliver Spatscheck
This paper proposes and evaluates a Network Aware Forward Caching approach for determining the optimal deployment strategy of forward caches to a network. A key advantage of this approach is that we can reduce the network costs associated with forward caching to maximize the benefit obtained from their deployment. We show in our simulation that a 37% increase to net benefits could be achieved over the standard method of full cache deployment to cache all POPs traffic. In addition, we show that this maximal point occurs when only 68% of the total traffic is cached.
   Another contribution of this paper is the analysis we use to motivate and evaluate this problem. We characterize the Internet traffic of 100K subscribers of a US residential broadband provider. We use both layer 4 and layer 7 analysis to investigate the traffic volumes of the flows as well as study the general characteristics of the applications used. We show that HTTP is a dominant protocol and account for 68% of the total downstream traffic and that 34% of that traffic is multimedia. In addition, we show that multimedia content using HTTP exhibits a 83% annualized growth rate and other HTTP traffic has a 53% growth rate versus the 26% over all annual growth rate of broadband traffic. This shows that HTTP traffic will become ever more dominant and increase the potential caching opportunities. Furthermore, we characterize the core backbone traffic of this broadband provider to measure the distance travelled by content and traffic. We find that CDN traffic is much more efficient than P2P content and that there is large skew in the Air Miles between POP in a typical network. Our findings show that there are many opportunities in broadband provider networks to optimize how traffic is delivered and cached.
Keywords: web caching
Anycast-aware transport for content delivery networks BIBAKFull-Text 301-310
  Zakaria Al-Qudah; Seungjoon Lee; Michael Rabinovich; Oliver Spatscheck; Jacobus Van der Merwe
Anycast-based content delivery networks (CDNs) have many properties that make them ideal for the large scale distribution of content on the Internet. However, because routing changes can result in a change of the endpoint that terminates the TCP session, TCP session disruption remains a concern for anycast CDNs, especially for large file downloads. In this paper we demonstrate that this problem does not require any complex solutions. In particular, we present the design of a simple, yet efficient, mechanism to handle session disruptions due to endpoint changes. With our mechanism, a client can continue the download of the content from the point at which it was before the endpoint change. Furthermore, CDN servers purge the TCP connection state quickly to handle frequent switching with low system overhead.
   We demonstrate experimentally the effectiveness of our proposed mechanism and show that more complex mechanisms are not required. Specifically, we find that our mechanism maintains high download throughput even with a reasonably high rate of endpoint switching, which is attractive for load balancing scenarios. Moreover, our results show that edge servers can purge TCP connection state after a single timeout-triggered retransmission without any tangible impact on ongoing connections. Besides improving server performance, this behavior improves the resiliency of the CDN to certain denial of service attacks.
Keywords: anycast, connection disruption, content delivery networks

Rich media/session: media applications

Less talk, more rock: automated organization of community-contributed collections of concert videos BIBAKFull-Text 311-320
  Lyndon Kennedy; Mor Naaman
We describe a system for synchronization and organization of user-contributed content from live music events. We start with a set of short video clips taken at a single event by multiple contributors, who were using a varied set of capture devices. Using audio fingerprints, we synchronize these clips such that overlapping clips can be displayed simultaneously. Furthermore, we use the timing and link structure generated by the synchronization algorithm to improve the findability and representation of the event content, including identifying key moments of interest and descriptive text for important captured segments of the show. We also identify the preferred audio track when multiple clips overlap. We thus create a much improved representation of the event that builds on the automatic content match. Our work demonstrates important principles in the use of content analysis techniques for social media content on the Web, and applies those principles in the domain of live music capture.
Keywords: audio fingerprinting, social media, synchronization, video
A generalised cross-modal clustering method applied to multimedia news semantic indexing and retrieval BIBAKFull-Text 321-330
  Alberto Messina; Maurizio Montagnuolo
Current Web technology has enabled the distribution of informative content through dynamic media platforms.
   In addition, the availability of the same content in the form of digital multimedia data has dramatically increased.
   Content-based, cross-media retrieval applications are needed to efficiently access desired information from this variety of data sources. This paper presents a novel approach for cross-media information aggregation, and describes a prototype system implementing this approach. The prototype adopts online newspaper articles and TV newscasts as information sources, to deliver a service made up of items including both contributions. Extensive experiments prove the effectiveness of the proposed approach in a real-world business context.
Keywords: cross-modal clustering, multimedia mashup, news retrieval
What makes conversations interesting?: themes, participants and consequences of conversations in online social media BIBAKFull-Text 331-340
  Munmun De Choudhury; Hari Sundaram; Ajita John; Dorée Duncan Seligmann
Rich media social networks promote not only creation and consumption of media, but also communication about the posted media item. What causes a conversation to be interesting, that prompts a user to participate in the discussion on a posted video? We conjecture that people participate in conversations when they find the conversation theme interesting, see comments by people whom they are familiar with, or observe an engaging dialogue between two or more people (absorbing back and forth exchange of comments). Importantly, a conversation that is interesting must be consequential -- i.e. it must impact the social network itself.
   Our framework has three parts: characterizing themes, characterizing participants for determining interestingness and measures of consequences of a conversation deemed to be interesting. First, we detect conversational themes using a mixture model approach. Second, we determine interestingness of participants and interestingness of conversations based on a random walk model. Third, we measure the consequence of a conversation by measuring how interestingness affects the following three variables -- participation in related themes, participant cohesiveness and theme diffusion. We have conducted extensive experiments using dataset from the popular video sharing site, YouTube. Our results show that our method of interestingness maximizes the mutual information, and is significantly better (twice as large) than three other baseline methods (number of comments, number of new participants and PageRank based assessment).
Keywords: conversations, interestingness, social media, themes, youtube

Rich media/session: tagging and clustering

Visual diversification of image search results BIBAKFull-Text 341-350
  Reinier H. van Leuken; Lluis Garcia; Ximena Olivares; Roelof van Zwol
Due to the reliance on the textual information associated with an image, image search engines on the Web lack the discriminative power to deliver visually diverse search results. The textual descriptions are key to retrieve relevant results for a given user query, but at the same time provide little information about the rich image content.
   In this paper we investigate three methods for visual diversification of image search results. The methods deploy lightweight clustering techniques in combination with a dynamic weighting function of the visual features, to best capture the discriminative aspects of the resulting set of images that is retrieved. A representative image is selected from each cluster, which together form a diverse result set.
   Based on a performance evaluation we find that the outcome of the methods closely resembles human perception of diversity, which was established in an extensive clustering experiment carried out by human assessors.
Keywords: flickr, image clustering, visual diversity
Tag ranking BIBAKFull-Text 351-360
  Dong Liu; Xian-Sheng Hua; Linjun Yang; Meng Wang; Hong-Jiang Zhang
Social media sharing web sites like Flickr allow users to annotate images with free tags, which significantly facilitate Web image search and organization. However, the tags associated with an image generally are in a random order without any importance or relevance information, which limits the effectiveness of these tags in search and other applications. In this paper, we propose a tag ranking scheme, aiming to automatically rank the tags associated with a given image according to their relevance to the image content. We first estimate initial relevance scores for the tags based on probability density estimation, and then perform a random walk over a tag similarity graph to refine the relevance scores. Experimental results on a 50, 000 Flickr photo collection
   show that the proposed tag ranking method is both effective and efficient. We also apply tag ranking into three applications: (1) tag-based image search, (2) tag recommendation, and (3) group recommendation, which demonstrates that the proposed tag ranking approach really boosts the performances of social-tagging related applications.
Keywords: flickr, random walk, recommendation, search, tag ranking
Learning to tag BIBAKFull-Text 361-370
  Lei Wu; Linjun Yang; Nenghai Yu; Xian-Sheng Hua
Social tagging provides valuable and crucial information for large-scale web image retrieval. It is ontology-free and easy to obtain; however, irrelevant tags frequently appear, and users typically will not tag all semantic objects in the image, which is also called semantic loss. To avoid noises and compensate for the semantic loss, tag recommendation is proposed in literature. However, current recommendation simply ranks the related tags based on the single modality of tag co-occurrence on the whole dataset, which ignores other modalities, such as visual correlation. This paper proposes a multi-modality recommendation based on both tag and visual correlation, and formulates the tag recommendation as a learning problem. Each modality is used to generate a ranking feature, and Rankboost algorithm is applied to learn an optimal combination of these ranking features from different modalities. Experiments on Flickr data demonstrate the effectiveness of this learning-based multi-modality recommendation strategy.
Keywords: learning to tag, multi-modality rankboost, social tagging, tag recommendation

Search/session: search UI

Efficient interactive fuzzy keyword search BIBAKFull-Text 371-380
  Shengyue Ji; Guoliang Li; Chen Li; Jianhua Feng
Traditional information systems return answers after a user submits a complete query. Users often feel "left in the dark" when they have limited knowledge about the underlying data, and have to use a try-and-see approach for finding information. A recent trend of supporting autocomplete in these systems is a first step towards solving this problem. In this paper, we study a new information-access paradigm, called "interactive, fuzzy search," in which the system searches the underlying data "on the fly" as the user types in query keywords. It extends autocomplete interfaces by (1) allowing keywords to appear in multiple attributes (in an arbitrary order) of the underlying data; and (2) finding relevant records that have keywords matching query keywords approximately. This framework allows users to explore data as they type, even in the presence of minor errors. We study research challenges in this framework for large amounts of data. Since each keystroke of the user could invoke a query on the backend, we need efficient algorithms to process each query within milliseconds. We develop various incremental-search algorithms using previously computed and cached results in order to achieve an interactive speed. We have deployed several real prototypes using these techniques. One of them has been deployed to support interactive search on the UC Irvine people directory, which has been used regularly and well received by users due to its friendly interface and high efficiency.
Keywords: autocomplete, fuzzy search, interactive search
An axiomatic approach for result diversification BIBAKFull-Text 381-390
  Sreenivas Gollapudi; Aneesh Sharma
Understanding user intent is key to designing an effective ranking system in a search engine. In the absence of any explicit knowledge of user intent, search engines want to diversify results to improve user satisfaction. In such a setting, the probability ranking principle-based approach of presenting the most relevant results on top can be sub-optimal, and hence the search engine would like to trade-off relevance for diversity in the results.
   In analogy to prior work on ranking and clustering systems, we use the axiomatic approach to characterize and design diversification systems. We develop a set of natural axioms that a diversification system is expected to satisfy, and show that no diversification function can satisfy all the axioms simultaneously. We illustrate the use of the axiomatic framework by providing three example diversification objectives that satisfy different subsets of the axioms. We also uncover a rich link to the facility dispersion problem that results in algorithms for a number of diversification objectives. Finally, we propose an evaluation methodology to characterize the objectives and the underlying axioms. We conduct a large scale evaluation of our objectives based on two data sets: a data set derived from the Wikipedia disambiguation pages and a product database.
Keywords: approximation algorithms, axiomatic framework, diversification, facility dispersion, search engine, wikipedia
Quicklink selection for navigational query results BIBAKFull-Text 391-400
  Deepayan Chakrabarti; Ravi Kumar; Kunal Punera
Quicklinks for a website are navigational shortcuts displayed below the website homepage on a search results page, and that let the users directly jump to selected points inside the website. Since the real-estate on a search results page is constrained and valuable, picking the best set of quicklinks to maximize the benefits for a majority of the users becomes an important problem for search engines. Using user browsing trails obtained from browser toolbars, and a simple probabilistic model, we formulate the quicklink selection problem as a combinatorial optimization problem. We first demonstrate the hardness of the objective, and then propose an algorithm that is provably within a factor of 1-1/e of the optimal. We also propose a different algorithm that works on trees and that can find the optimal solution; unlike the previous algorithm, this algorithm can incorporate natural constraints on the set of chosen quicklinks. The efficacy of our methods is demonstrated via empirical results on both a manually labeled set of websites and a set for which quicklink click-through rates for several webpages were obtained from a real-world search engine.
Keywords: navigational queries, quicklinks, toolbar data, trails

Search/session: query processing

Inverted index compression and query processing with optimized document ordering BIBAKFull-Text 401-410
  Hao Yan; Shuai Ding; Torsten Suel
Web search engines use highly optimized compression schemes to decrease inverted index size and improve query throughput, and many index compression techniques have been studied in the literature. One approach taken by several recent studies first performs a renumbering of the document IDs in the collection that groups similar documents together, and then applies standard compression techniques. It is known that this can significantly improve index compression compared to a random document ordering. We study index compression and query processing techniques for such reordered indexes. Previous work has focused on determining the best possible ordering of documents. In contrast, we assume that such an ordering is already given, and focus on how to optimize compression methods and query processing for this case. We perform an extensive study of compression techniques for document IDs and present new optimizations of existing techniques which can achieve significant improvement in both compression and decompression performances. We also propose and evaluate techniques for compressing frequency values for this case. Finally, we study the effect of this approach on query processing performance. Our experiments show very significant improvements in index size and query processing speed on the TREC GOV2 collection of 25.2 million web pages.
Keywords: IR query processing, document ordering, index compression, inverted index, search engines
RuralCafe: web search in the rural developing world BIBAKFull-Text 411-420
  Jay Chen; Lakshminarayanan Subramanian; Jinyang Li
The majority of people in rural developing regions do not have access to the World Wide Web. Traditional network connectivity technologies have proven to be prohibitively expensive in these areas. The emergence of new long-range wireless technologies provide hope for connecting these rural regions to the Internet. However, the network connectivity provided by these new solutions are by nature intermittent due to high network usage rates, frequent power-cuts and the use of delay tolerant links. Typical applications, especially interactive applications like web search, do not tolerate intermittent connectivity. In this paper, we present the design and implementation of RuralCafe, a system intended to support efficient web search over intermittent networks. RuralCafe enables users to perform web search asynchronously and find what they are looking for in one round of intermittency as opposed to multiple rounds of search/downloads. RuralCafe does this by providing an expanded search query interface which allows a user to specify additional query terms to maximize the utility of the results returned by a search query. Given knowledge of the limited available network resources, RuralCafe performs optimizations to prefetch pages to best satisfy a search query based on a user's search preferences. In addition, RuralCafe does not require modifications to the web browser, and can provide single round search results tailored to various types of networks and economic constraints. We have implemented and evaluated the effectiveness of RuralCafe using queries from logs made to a large search engine, queries made by users in an intermittent setting, and live queries from a small testbed deployment. We have also deployed a prototype of RuralCafe in Kerala, India.
Keywords: intermittent network, low bandwidth, web search, world wide web
Using graphics processors for high performance IR query processing BIBAKFull-Text 421-430
  Shuai Ding; Jinru He; Hao Yan; Torsten Suel
Web search engines are facing formidable performance challenges due to data sizes and query loads. The major engines have to process tens of thousands of queries per second over tens of billions of documents. To deal with this heavy workload, such engines employ massively parallel systems consisting of thousands of machines. The significant cost of operating these systems has motivated a lot of recent research into more efficient query processing mechanisms. We investigate a new way to build such high performance IR systems using graphical processing units (GPUs). GPUs were originally designed to accelerate computer graphics applications through massive on-chip parallelism. Recently a number of researchers have studied how to use GPUs for other problem domains such as databases and scientific computing. Our contribution here is to design a basic system architecture for GPU-based high-performance IR, to develop suitable algorithms for subtasks such as inverted list compression, list intersection, and top-$k$ scoring, and to show how to achieve highly efficient query processing on GPU-based systems. Our experimental results for a prototype GPU-based system on $25.2$ million web pages indicate that significant gains in query processing performance can be obtained.
Keywords: GPU, index compression, ir query processing, search engines

Search/session: caching and indices

Improved techniques for result caching in web search engines BIBAKFull-Text 431-440
  Qingqing Gan; Torsten Suel
Query processing is a major cost factor in operating large web search engines. In this paper, we study query result caching, one of the main techniques used to optimize query processing performance. Our first contribution is a study of result caching as a weighted caching problem. Most previous work has focused on optimizing cache hit ratios, but given that processing costs of queries can vary very significantly we argue that total cost savings also need to be considered. We describe and evaluate several algorithms for weighted result caching, and study the impact of Zipf-based query distributions on result caching. Our second and main contribution is a new set of feature-based cache eviction policies that achieve significant improvements over all previous methods, substantially narrowing the existing performance gap to the theoretically optimal (clairvoyant) method. Finally, using the same approach, we also obtain performance gains for the related problem of inverted list caching.
Keywords: index caching, result caching, search engines, weighted caching
Nearest-neighbor caching for content-match applications BIBAKFull-Text 441-450
  Sandeep Pandey; Andrei Broder; Flavio Chierichetti; Vanja Josifovski; Ravi Kumar; Sergei Vassilvitskii
Motivated by contextual advertising systems and other web applications involving efficiency-accuracy tradeoffs, we study similarity caching. Here, a cache hit is said to occur if the requested item is similar but not necessarily equal to some cached item. We study two objectives that dictate the efficiency-accuracy tradeoff and provide our caching policies for these objectives. By conducting extensive experiments on real data we show similarity caching can significantly improve the efficiency of contextual advertising systems, with minimal impact on accuracy. Inspired by the above, we propose a simple generative model that embodies two fundamental characteristics of page requests arriving to advertising systems, namely, long-range dependences and similarities. We provide theoretical bounds on the gains of similarity caching in this model and demonstrate these gains empirically by fitting the actual data to the model.
Keywords: caching, content-match, nearest-neighbor
Compressed web indexes BIBAKFull-Text 451-460
  Flavio Chierichetti; Ravi Kumar; Prabhakar Raghavan
Web search engines use indexes to efficiently retrieve pages containing specified query terms, as well as pages linking to specified pages. The problem of compressed indexes that permit such fast retrieval has a long history. We consider the problem: assuming that the terms in (or links to) a page are generated from a probability distribution, how well compactly can we build such indexes that allow fast retrieval? Of particular interest is the case when the probability distribution is Zipfian (or a similar power law), since these are the distributions that arise on the web. We obtain sharp bounds on the space requirement of Boolean indexes for text documents that follow Zipf's law. In the process we develop a general technique that applies to any probability distribution, not necessarily a power law; this is the first analysis of compression in indexes under arbitrary distributions. Our bounds lead to quantitative versions of rules of thumb that are folklore in indexing. Our experiments on several document collections show that the distribution of terms appears to follow a double-Pareto law rather than Zipf's law. Despite widely varying sets of documents, the index sizes observed in the experiments conform well to our theoretical predictions.
Keywords: compression, double-pareto, index size, power law

Search/session: query categorization

Unsupervised query categorization using automatically-built concept graphs BIBAKFull-Text 461-470
  Eustache Diemert; Gilles Vandelle
Automatic categorization of user queries is an important component of general purpose (Web) search engines, particularly for triggering rich, query-specific content and sponsored links. We propose an unsupervised learning scheme that reduces dramatically the cost of setting up and maintaining such a categorizer, while retaining good categorization power. The model is stored as a graph of concepts where graph edges represent the cross-reference between the concepts. Concepts and relations are extracted from query logs by an offline Web mining process, which uses a search engine as a powerful summarizer for building a concept graph. Empirical evaluation indicates that the system compares favorably on publicly available data sets (such as KDD Cup 2005) as well as on portions of the current query stream of Yahoo! Search, where it is already changing the experience of millions of Web search users.
Keywords: concept networks, cross-reference, knowledge based search, query categorization, unsupervised learning, web mining
Understanding user's query intent with wikipedia BIBAKFull-Text 471-480
  Jian Hu; Gang Wang; Fred Lochovsky; Jian-tao Sun; Zheng Chen
Understanding the intent behind a user's query can help search engine to automatically route the query to some corresponding vertical search engines to obtain particularly relevant contents, thus, greatly improving user satisfaction. There are three major challenges to the query intent classification problem: (1) Intent representation; (2) Domain coverage and (3) Semantic interpretation. Current approaches to predict the user's intent mainly utilize machine learning techniques. However, it is difficult and often requires many human efforts to meet all these challenges by the statistical machine learning approaches. In this paper, we propose a general methodology to the problem of query intent classification. With very little human effort, our method can discover large quantities of intent concepts by leveraging Wikipedia, one of the best human knowledge base. The Wikipedia concepts are used as the intent representation space, thus, each intent domain is represented as a set of Wikipedia articles and categories. The intent of any input query is identified through mapping the query into the Wikipedia representation space. Compared with previous approaches, our proposed method can achieve much better coverage to classify queries in an intent domain even through the number of seed intent examples is very small. Moreover, the method is very general and can be easily applied to various intent domains. We demonstrate the effectiveness of this method in three different applications, i.e., travel, job, and person name. In each of the three cases, only a couple of seed intent queries are provided. We perform the quantitative evaluations in comparison with two baseline methods, and the experimental results shows that our method significantly outperforms other methods in each intent domain.
Keywords: query classification, query intent, user intent, wikipedia
Discovering users' specific geo intention in web search BIBAKFull-Text 481-490
  Xing Yi; Hema Raghavan; Chris Leggetter
Discovering users' specific and implicit geographic intention in web search can greatly help satisfy users' information needs. We build a geo intent analysis system that uses minimal supervision to learn a model from large amounts of web-search logs for this discovery. We build a city language model, which is a probabilistic representation of the language surrounding the mention of a city in web queries. We use several features derived from these language models to: (1) identify users' implicit geo intent and pinpoint the city corresponding to this intent, (2) determine whether the geo-intent is localized around the users' current geographic location, (3) predict cities for queries that have a mention of an entity that is located in a specific place. Experimental results demonstrate the effectiveness of using features derived from the city language model. We find that (1) the system has over 90% precision and more than 74% accuracy for the task of detecting users' implicit city level geo intent (2) the system achieves more than 96% accuracy in determining whether implicit geo queries are local geo queries, neighbor region geo queries or none-of-these (3) the city language model can effectively retrieve cities in location-specific queries with high precision (88%) and recall (74%); human evaluation shows that the language model predicts city labels for location-specific queries with high accuracy (84.5%).
Keywords: city language model, geo intent, geographic search intent, implicit search intent, local search intent

Search/session: ads and query expansion

A search-based method for forecasting ad impression in contextual advertising BIBAKFull-Text 491-500
  Xuerui Wang; Andrei Broder; Marcus Fontoura; Vanja Josifovski
Contextual advertising (also called content match) refers to the placement of small textual ads within the content of a generic web page. It has become a significant source of revenue for publishers ranging from individual bloggers to major newspapers. At the same time it is an important way for advertisers to reach their intended audience. This reach depends on the total number of exposures of the ad (impressions) and its click-through-rate (CTR) that can be viewed as the probability of an end-user clicking on the ad when shown. These two orthogonal, critical factors are both difficult to estimate and even individually can still be very informative and useful in planning and budgeting advertising campaigns.
   In this paper, we address the problem of forecasting the number of impressions for new or changed ads in the system. Producing such forecasts, even within large margins of error, is quite challenging: 1) ad selection in contextual advertising is a complicated process based on tens or even hundreds of page and ad features; 2) the publishers' content and traffic vary over time; and 3) the scale of the problem is daunting: over a course of a week it involves billions of impressions, hundreds of millions of distinct pages, hundreds of millions of ads, and varying bids of other competing advertisers. We tackle these complexities by simulating the presence of a given ad with its associated bid over weeks of historical data. We obtain an impression estimate by counting how many times the ad would have been displayed if it were in the system over that period of time. We estimate this count by an efficient two-level search algorithm over the distinct pages in the data set. Experimental results show that our approach can accurately forecast the expected number of impressions of contextual ads in real time. We also show how this method can be used in tools for bid selection and ad evaluation.
Keywords: content match, contextual advertising, impression forecasting, online advertising, wand
Exploiting web search engines to search structured databases BIBAKFull-Text 501-510
  Sanjay Agrawal; Kaushik Chakrabarti; Surajit Chaudhuri; Venkatesh Ganti; Arnd Christian Konig; Dong Xin
Web search engines often federate many user queries to relevant structured databases. For example, a product related query might be federated to a product database containing their descriptions and specifications. The relevant structured data items are then returned to the user along with web search results. However, each structured database is searched in isolation. Hence, the search often produces empty or incomplete results as the database may not contain the required information to answer the query. In this paper, we propose a novel integrated search architecture. We establish and exploit the relationships between web search results and the items in structured databases to identify the relevant structured data items for a much wider range of queries.
   Our architecture leverages existing search engine components to implement this functionality at very low overhead. We demonstrate the quality and efficiency of our techniques through an extensive experimental study.
Keywords: entity extraction, entity ranking, entity search, structured database search
Online expansion of rare queries for sponsored search BIBAKFull-Text 511-520
  Andrei Broder; Peter Ciccolo; Evgeniy Gabrilovich; Vanja Josifovski; Donald Metzler; Lance Riedel; Jeffrey Yuan
Sponsored search systems are tasked with matching queries
   to relevant advertisements. The current state-of-the-art matching algorithms expand the user's query using a variety of external resources, such as Web search results. While these expansion-based algorithms are highly effective, they are largely inefficient and cannot be applied in real-time. In practice, such algorithms are applied offline to popular queries, with the results of the expensive operations cached for fast access at query time. In this paper, we describe an efficient and effective approach for matching ads against rare queries that were not processed offline. The approach builds an expanded query representation by leveraging offline processing done for related popular queries. Our experimental results show that our approach significantly improves the effectiveness of advertising on rare queries with only a negligible increase in computational cost.
Keywords: query expansion, sponsored search, tail queries

Security and privacy/session: web privacy

Collective privacy management in social networks BIBAKFull-Text 521-530
  Anna Cinzia Squicciarini; Mohamed Shehab; Federica Paci
Social Networking is one of the major technological phenomena of the Web 2.0, with hundreds of millions of people participating. Social networks enable a form of self expression for users, and help them to socialize and share content with other users. In spite of the fact that content sharing represents one of the prominent features of existing Social Network sites, Social Networks yet do not support any mechanism for collaborative management of privacy settings for shared content. In this paper, we model the problem of collaborative enforcement of privacy policies on shared data by using game theory. In particular, we propose a solution that offers automated ways to share images based on an extended notion of content ownership. Building upon the Clarke-Tax mechanism, we describe a simple mechanism that promotes truthfulness, and that rewards users who promote co-ownership. We integrate our design with inference techniques that free the users from the burden of manually selecting privacy preferences for each picture. To the best of our knowledge this is the first time such a protection mechanism for Social Networking has been proposed. In the paper, we also show a proof-of-concept application, which we implemented in the context of Facebook, one of today's most popular social networks. We show that supporting these type of solutions is not also feasible, but can be implemented through a minimal increase in overhead to end-users.
Keywords: game theory, privacy, social networks
To join or not to join: the illusion of privacy in social networks with mixed public and private user profiles BIBAKFull-Text 531-540
  Elena Zheleva; Lise Getoor
In order to address privacy concerns, many social media websites allow users to hide their personal profiles from the public. In this work, we show how an adversary can exploit an online social network with a mixture of public and private user profiles to predict the private attributes of users. We map this problem to a relational classification problem and we propose practical models that use friendship and group membership information (which is often not hidden) to infer sensitive attributes. The key novel idea is that in addition to friendship links, groups can be carriers of significant information. We show that on several well-known social media sites, we can easily and accurately recover the information of private-profile users. To the best of our knowledge, this is the first work that uses link-based and group-based classification to study privacy implications in social networks with mixed public and private user profiles.
Keywords: attribute inference, groups, privacy, social networks
Privacy diffusion on the web: a longitudinal perspective BIBAKFull-Text 541-550
  Balachander Krishnamurthy; Craig Wills
For the last few years we have been studying the diffusion of private information for users as they visit various Web sites triggering data gathering aggregation by third parties. This paper reports on our longitudinal study consisting of multiple snapshots of our examination of such diffusion over four years. We examine the various technical ways by which third-party aggregators acquire data and the depth of user-related information acquired. We study techniques for protecting privacy diffusion as well as limitations of such techniques. We introduce the concept of secondary privacy damage. Our results show increasing aggregation of user-related data by a steadily decreasing number of entities. A handful of companies are able to track users' movement across almost all of the popular Web sites. Virtually all the protection techniques have significant limitations highlighting the seriousness of the problem and the need for alternate solutions.
Keywords: privacy, privacy enhancing technologies

Security and privacy/session: web security

All your contacts are belong to us: automated identity theft attacks on social networks BIBAKFull-Text 551-560
  Leyla Bilge; Thorsten Strufe; Davide Balzarotti; Engin Kirda
Social networking sites have been increasingly gaining popularity. Well-known sites such as Facebook have been reporting growth rates as high as 3% per week. Many social networking sites have millions of registered users who use these sites to share photographs, contact long-lost friends, establish new business contacts and to keep in touch. In this paper, we investigate how easy it would be for a potential attacker to launch automated crawling and identity theft attacks against a number of popular social networking sites in order to gain access to a large volume of personal user information. The first attack we present is the automated identity theft of existing user profiles and sending of friend requests to the contacts of the cloned victim. The hope, from the attacker's point of view, is that the contacted users simply trust and accept the friend request. By establishing a friendship relationship with the contacts of a victim, the attacker is able to access the sensitive personal information provided by them. In the second, more advanced attack we present, we show that it is effective and feasible to launch an automated, cross-site profile cloning attack. In this attack, we are able to automatically create a forged profile in a network where the victim is not registered yet and contact the victim's friends who are registered on both networks. Our experimental results with real users show that the automated attacks we present are effective and feasible in practice.
Keywords: identity theft, social network security
Using static analysis for Ajax intrusion detection BIBAKFull-Text 561-570
  Arjun Guha; Shriram Krishnamurthi; Trevor Jim
We present a static control-flow analysis for JavaScript programs running in a web browser. Our analysis tackles numerous challenges posed by modern web applications including asynchronous communication, frameworks, and dynamic code generation. We use our analysis to extract a model of expected client behavior as seen from the server, and build an intrusion-prevention proxy for the server: the proxy intercepts client requests and disables those that do not meet the expected behavior. We insert random asynchronous requests to foil mimicry attacks. Finally, we evaluate our technique against several real applications and show that it protects against an attack in a widely-used web application.
Keywords: Ajax, control-flow analysis, intrusion detection, javascript
A hybrid phish detection approach by identity discovery and keywords retrieval BIBAKFull-Text 571-580
  Guang Xiang; Jason I. Hong
Phishing is a significant security threat to the Internet, which causes tremendous economic loss every year. In this paper, we proposed a novel hybrid phish detection method based on information extraction (IE) and information retrieval (IR) techniques. The identity-based component of our method detects phishing webpages by directly discovering the inconsistency between their identity and the identity they are imitating. The keywords-retrieval component utilizes IR algorithms exploiting the power of search engines to identify phish. Our method requires no training data, no prior knowledge of phishing signatures and specific implementations, and thus is able to adapt quickly to constantly appearing new phishing patterns. Comprehensive experiments over a diverse spectrum of data sources with 11449 pages show that both components have a low false positive rate and the stacked approach achieves a true positive rate of 90.06% with a false positive rate of 1.95%.
Keywords: anti-phishing, information retrieval, named entity recognition

Semantic/data web/session: semantic data management

Rapid prototyping of semantic mash-ups through semantic web pipes BIBAKFull-Text 581-590
  Danh Le-Phuoc; Axel Polleres; Manfred Hauswirth; Giovanni Tummarello; Christian Morbidoni
The use of RDF data published on the Web for applications is still a cumbersome and resource-intensive task due to the limited software support and the lack of standard programming paradigms to deal with everyday problems such as combination of RDF data from different sources, object identifier consolidation, ontology alignment and mediation, or plain querying and filtering tasks. In this paper we present a framework, Semantic Web Pipes, that supports fast implementation of Semantic data mash-ups while preserving desirable properties such as abstraction, encapsulation, component-orientation, code re-usability and maintainability which are common and well supported in other application areas.
Keywords: RDF, mash-up, pipes, semantic web
idMesh: graph-based disambiguation of linked data BIBAKFull-Text 591-600
  Philippe Cudré-Mauroux; Parisa Haghani; Michael Jost; Karl Aberer; Hermann De Meer
We tackle the problem of disambiguating entities on the Web. We propose a user-driven scheme where graphs of entities -- represented by globally identifiable declarative artifacts -- self-organize in a dynamic and probabilistic manner. Our solution has the following two desirable properties: i) it lets end-users freely define associations between arbitrary entities and ii) it probabilistically infers entity relationships based on uncertain links using constraint-satisfaction mechanisms. We outline the interface between our scheme and the current data Web, and show how higher-layer applications can take advantage of our approach to enhance search and update of information relating to online entities. We describe a decentralized infrastructure supporting efficient and scalable entity disambiguation and demonstrate the practicability of our approach in a deployment over several hundreds of machines.
Keywords: emergent semantics, entity disambiguation, linked data, peer data management
OpenRuleBench: an analysis of the performance of rule engines BIBAKFull-Text 601-610
  Senlin Liang; Paul Fodor; Hui Wan; Michael Kifer
The Semantic Web initiative has led to an upsurge of the interest in rules as a general and powerful way of processing, combining, and analyzing semantic information. Since several of the technologies underlying rule-based systems are already quite mature, it is important to understand how such systems might perform on the Web scale. OpenRuleBench is a suite of benchmarks for analyzing the performance and scalability of different rule engines. Currently the study spans five different technologies and eleven systems, but OpenRuleBench is an open community resource, and contributions from the community are welcome. In this paper, we describe the tested systems and technologies, the methodology used in testing, and analyze the results.
Keywords: benchmark, openrulebench, rule systems, semantic web

Semantic/data web/session: linked data

Large scale integration of senses for the semantic web BIBAKFull-Text 611-620
  Jorge Gracia; Mathieu d'Aquin; Eduardo Mena
Nowadays, the increasing amount of semantic data available on the Web leads to a new stage in the potential of Semantic Web applications. However, it also introduces new issues due to the heterogeneity of the available semantic resources. One of the most remarkable is redundancy, that is, the excess of different semantic descriptions, coming from different sources, to describe the same intended meaning.
   In this paper, we propose a technique to perform a large scale integration of senses (expressed as ontology terms), in order to cluster the most similar ones, when indexing large amounts of online semantic information. It can dramatically reduce the redundancy problem on the current Semantic Web. In order to make this objective feasible, we have studied the adaptability and scalability of our previous work on sense integration, to be translated to the much larger scenario of the Semantic Web. Our evaluation shows a good behaviour of these techniques when used in large scale experiments, then making feasible the proposed approach.
Keywords: ontologies, scalable sense integration, semantic web
Triplify: light-weight linked data publication from relational databases BIBAKFull-Text 621-630
  Sören Auer; Sebastian Dietzold; Jens Lehmann; Sebastian Hellmann; David Aumueller
In this paper we present Triplify -- a simplistic but effective approach to publish Linked Data from relational databases. Triplify is based on mapping HTTP-URI requests onto relational database queries. Triplify transforms the resulting relations into RDF statements and publishes the data on the Web in various RDF serializations, in particular as Linked Data. The rationale for developing Triplify is that the largest part of information on the Web is already stored in structured form, often as data contained in relational databases, but usually published by Web applications only as HTML mixing structure, layout and content. In order to reveal the pure structured information behind the current Web, we have implemented Triplify as a light-weight software component, which can be easily integrated into and deployed by the numerous, widely installed Web applications. Our approach includes a method for publishing update logs to enable incremental crawling of linked data sources. Triplify is complemented by a library of configurations for common relational schemata and a REST-enabled data source registry. Triplify configurations containing mappings are provided for many popular Web applications, including osCommerce, WordPress, Drupal, Gallery, and phpBB. We will show that despite its light-weight architecture Triplify is usable to publish very large datasets, such as 160GB of geo data from the OpenStreetMap project.
Keywords: data web, databases, geo data, linked data, rdf, semantic web, sql, web application
SOFIE: a self-organizing framework for information extraction BIBAKFull-Text 631-640
  Fabian M. Suchanek; Mauro Sozio; Gerhard Weikum
This paper presents SOFIE, a system for automated ontology extension. SOFIE can parse natural language documents, extract ontological facts from them and link the facts into an ontology. SOFIE uses logical reasoning on the existing knowledge and on the new knowledge in order to disambiguate words to their most probable meaning, to reason on the meaning of text patterns and to take into account world knowledge axioms. This allows SOFIE to check the plausibility of hypotheses and to avoid inconsistencies with the ontology. The framework of SOFIE unites the paradigms of pattern matching, word sense disambiguation and ontological reasoning in one unified model. Our experiments show that SOFIE delivers high-quality output, even from unstructured Internet documents.
Keywords: automated reasoning, information extraction, ontology

Semantic/data web/session: mining for semantics

Evaluating similarity measures for emergent semantics of social tagging BIBAKFull-Text 641-650
  Benjamin Markines; Ciro Cattuto; Filippo Menczer; Dominik Benz; Andreas Hotho; Gerd Stumme
Social bookmarking systems are becoming increasingly important data sources for bootstrapping and maintaining Semantic Web applications. Their emergent information structures have become known as folksonomies. A key question for harvesting semantics from these systems is how to extend and adapt traditional notions of similarity to folksonomies, and which measures are best suited for applications such as community detection, navigation support, semantic search, user profiling and ontology learning. Here we build an evaluation framework to compare various general folksonomy-based similarity measures, which are derived from several established information-theoretic, statistical, and practical measures. Our framework deals generally and symmetrically with users, tags, and resources. For evaluation purposes we focus on similarity between tags and between resources and consider different methods to aggregate annotations across users. After comparing the ability of several tag similarity measures to predict user-created tag relations, we provide an external grounding by user-validated semantic proxies based on WordNet and the Open Directory Project. We also investigate the issue of scalability. We find that mutual information with distributional micro-aggregation across users yields the highest accuracy, but is not scalable; per-user projection with collaborative aggregation provides the best scalable approach via incremental computations. The results are consistent across resource and tag similarity.
Keywords: ontology learning, semantic grounding, social similarity, web 2.0
Measuring the similarity between implicit semantic relations from the web BIBAKFull-Text 651-660
  Danushka T. Bollegala; Yutaka Matsuo; Mitsuru Ishizuka
Measuring the similarity between semantic relations that hold among entities is an important and necessary step in various Web related tasks such as relation extraction, information retrieval and analogy detection. For example, consider the case in which a person knows a pair of entities (e.g. Google, YouTube), between which a particular relation holds (e.g. acquisition). The person is interested in retrieving other such pairs with similar relations (e.g. Microsoft, Powerset). Existing keyword-based search engines cannot be applied directly in this case because, in keyword-based search, the goal is to retrieve documents that are relevant to the words used in a query -- not necessarily to the relations implied by a pair of words. We propose a relational similarity measure, using a Web search engine, to compute the similarity between semantic relations implied by two pairs of words. Our method has three components: representing the various semantic relations that exist between a pair of words using automatically extracted lexical patterns, clustering the extracted lexical patterns to identify the different patterns that express a particular semantic relation, and measuring the similarity between semantic relations using a metric learning approach. We evaluate the proposed method in two tasks: classifying semantic relations between named entities, and solving word-analogy questions. The proposed method outperforms all baselines in a relation classification task with a statistically significant average precision score of 0.74. Moreover, it reduces the time taken by Latent Relational Analysis to process 374 word-analogy questions from 9 days to less than 6 hours, with an SAT score of 51%.
Keywords: natural language processing, relational similarity, web mining
Extracting key terms from noisy and multitheme documents BIBAKFull-Text 661-670
  Maria Grineva; Maxim Grinev; Dmitry Lizorkin
We present a novel method for key term extraction from text documents. In our method, document is modeled as a graph of semantic relationships between terms of that document. We exploit the following remarkable feature of the graph: the terms related to the main topics of the document tend to bunch up into densely interconnected subgraphs or communities, while non-important terms fall into weakly interconnected communities, or even become isolated vertices. We apply graph community detection techniques to partition the graph into thematically cohesive groups of terms. We introduce a criterion function to select groups that contain key terms discarding groups with unimportant terms. To weight terms and determine semantic relatedness between them we exploit information extracted from Wikipedia.
   Using such an approach gives us the following two advantages. First, it allows effectively processing multi-theme documents. Second, it is good at filtering out noise information in the document, such as, for example, navigational bars or headers in web pages.
   Evaluations of the method show that it outperforms existing methods producing key terms with higher precision and recall. Additional experiments on web pages prove that our method appears to be substantially more effective on noisy and multi-theme documents than existing methods.
Keywords: community detection, graph analysis, keywords extraction, semantic similarity, wikipedia

Social networks and web 2.0/session: recommender systems

Tagommenders: connecting users to items through tags BIBAKFull-Text 671-680
  Shilad Sen; Jesse Vig; John Riedl
Tagging has emerged as a powerful mechanism that enables users to find, organize, and understand online entities. Recommender systems similarly enable users to efficiently navigate vast collections of items. Algorithms combining tags with recommenders may deliver both the automation inherent in recommenders, and the flexibility and conceptual comprehensibility inherent in tagging systems. In this paper we explore tagommenders, recommender algorithms that predict users' preferences for items based on their inferred preferences for tags. We describe tag preference inference algorithms based on users' interactions with tags and movies, and evaluate these algorithms based on tag preference ratings collected from 995 MovieLens users. We design and evaluate algorithms that predict users' ratings for movies based on their inferred tag preferences. Our tag-based algorithms generate better recommendation rankings than state-of-the-art algorithms, and they may lead to flexible recommender systems that leverage the characteristics of items users find most important.
Keywords: collaborative filtering, recommender systems, tagging
Collaborative filtering for orkut communities: discovery of user latent behavior BIBAKFull-Text 681-690
  Wen-Yen Chen; Jon-Chyuan Chu; Junyi Luan; Hongjie Bai; Yi Wang; Edward Y. Chang
Users of social networking services can connect with each other by forming communities for online interaction. Yet as the number of communities hosted by such websites grows over time, users have even greater need for effective community recommendations in order to meet more users. In this paper, we investigate two algorithms from very different domains and evaluate their effectiveness for personalized community recommendation. First is association rule mining (ARM), which discovers associations between sets of communities that are shared across many users. Second is latent Dirichlet allocation (LDA), which models user-community co-occurrences using latent aspects. In comparing LDA with ARM, we are interested in discovering whether modeling low-rank latent structure is more effective for recommendations than directly mining rules from the observed data. We experiment on an Orkut data set consisting of 492,104 users and 118,002 communities. Our empirical comparisons using the top-k recommendations metric show that LDA performs consistently better than ARM for the community recommendation task when recommending a list of 4 or more communities. However, for recommendation lists of up to 3 communities, ARM is still a bit better. We analyze examples of the latent information learned by LDA to explain this finding. To efficiently handle the large-scale data set, we parallelize LDA on distributed computers and demonstrate our parallel implementation's scalability with varying numbers of machines.
Keywords: association rule mining, collaborative filtering, data mining, latent topic models, recommender systems
Personalized recommendation on dynamic content using predictive bilinear models BIBAKFull-Text 691-700
  Wei Chu; Seung-Taek Park
In Web-based services of dynamic content (such as news articles), recommender systems face the difficulty of timely identifying new items of high-quality and providing recommendations for new users. We propose a feature-based machine learning approach to personalized recommendation that is capable of handling the cold-start issue effectively. We maintain profiles of content of interest, in which temporal characteristics of the content, e.g. popularity and freshness, are updated in real-time manner. We also maintain profiles of users including demographic information and a summary of user activities within Yahoo! properties. Based on all features in user and content profiles, we develop predictive bilinear regression models to provide accurate personalized recommendations of new items for both existing and new users. This approach results in an offline model with light computational overhead compared with other recommender systems that require online re-training. The proposed framework is general and flexible for other personalized tasks. The superior performance of our approach is verified on a large-scale data set collected from the Today-Module on Yahoo! Front Page, with comparison against six competitive approaches.
Keywords: bilinear models, dynamic features, personalization, ranking, recommender systems, regression, user and content profile

Social networks and web 2.0/session: diffusion and search in social networks

Social search in "Small-World" experiments BIBAKFull-Text 701-710
  Sharad Goel; Roby Muhamad; Duncan Watts
The "algorithmic small-world hypothesis" states that not only are pairs of individuals in a large social network connected by short paths, but that ordinary individuals can find these paths. Although theoretically plausible, empirical evidence for the hypothesis is limited, as most chains in "small-world" experiments fail to complete, thereby biasing estimates of "true" chain lengths. Using data from two recent small-world experiments, comprising a total of 162,328 message chains, and directed at one of 30 "targets" spread across 19 countries, we model heterogeneity in chain attrition rates as a function of individual attributes. We then introduce a rigorous way of estimating true chain lengths that is provably unbiased, and can account for empirically-observed variation in attrition rates. Our findings provide mixed support for the algorithmic hypothesis. On the one hand, it appears that roughly half of all chains can be completed in 6-7 steps -- thus supporting the "six degrees of separation" assertion -- but on the other hand, estimates of the mean are much longer, suggesting that for at least some of the population, the world is not "small" in the algorithmic sense. We conclude that search distances in social networks are fundamentally different from topological distances, for which the mean and median of the shortest path lengths between nodes tend to be similar.
Keywords: attrition, small-world experiment, social search
Behavioral profiles for advanced email features BIBAKFull-Text 711-720
  Thomas Karagiannis; Milan Vojnovic
We examine the behavioral patterns of email usage in a large-scale enterprise over a three-month period. In particular, we focus on two main questions: (Q1) what do replies depend on? and (Q2) what is the gain of augmenting contacts through the friends of friends from the email social graph? For Q1, we identify and evaluate the significance of several factors that affect the reply probability and the email response time. We find that all factors of our considered set are significant, provide their relative ordering, and identify the recipient list size, and the intensity of email communication between the correspondents as the dominant factors. We highlight various novel threshold behaviors and provide support for existing hypotheses such as that of the least-effort reply. For Q2, we find that the number of new contacts extracted from the friends-of-friends relationships amounts to a large number, but which is still a limited portion of the total enterprise size. We believe that our results provide significant insights towards informed design of advanced email features, including those of social-networking type.
Keywords: email profiles, reply probability, reply time
A measurement-driven analysis of information propagation in the flickr social network BIBAKFull-Text 721-730
  Meeyoung Cha; Alan Mislove; Krishna P. Gummadi
Online social networking sites like MySpace, Facebook, and Flickr have become a popular way to share and disseminate content. Their massive popularity has led to viral marketing techniques that attempt to spread content, products, and ideas on these sites. However, there is little data publicly available on viral propagation in the real world and few studies have characterized how information spreads over current online social networks.
   In this paper, we collect and analyze large-scale traces of information dissemination in the Flickr social network. Our analysis, based on crawls of the favorite markings of 2.5 million users on 11 million photos, aims at answering three key questions: (a) how widely does information propagate in the social network? (b) how quickly does information propagate? and (c) what is the role of word-of-mouth exchanges between friends in the overall propagation of information in the network? Contrary to viral marketing "intuition," we find that (a) even popular photos do not spread widely throughout the network, (b) even popular photos spread slowly through the network, and (c) information exchanged between friends is likely to account for over 50 of all favorite-markings, but with a significant delay at each hop.
Keywords: cascades, flickr, information dissemination, social networks, viral marketing

Social networks and web 2.0/session: interactions in social communities

Network analysis of collaboration structure in Wikipedia BIBAKFull-Text 731-740
  Ulrik Brandes; Patrick Kenis; Jürgen Lerner; Denise van Raaij
In this paper we give models and algorithms to describe and analyze the collaboration among authors of Wikipedia from a network analytical perspective. The edit network encodes who interacts how with whom when editing an article; it significantly extends previous network models that code author communities in Wikipedia. Several characteristics summarizing some aspects of the organization process and allowing the analyst to identify certain types of authors can be obtained from the edit network. Moreover, we propose several indicators characterizing the global network structure and methods to visualize edit networks. It is shown that the structural network indicators are correlated with quality labels of the associated Wikipedia articles.
Keywords: edit network, governance, network visualization, social network analysis, wikipedia
The slashdot zoo: mining a social network with negative edges BIBAKFull-Text 741-750
  Jérôme Kunegis; Andreas Lommatzsch; Christian Bauckhage
We analyse the corpus of user relationships of the Slashdot technology news site. The data was collected from the Slashdot Zoo feature where users of the website can tag other users as friends and foes, providing positive and negative endorsements. We adapt social network analysis techniques to the problem of negative edge weights. In particular, we consider signed variants of global network characteristics such as the clustering coefficient, node-level characteristics such as centrality and popularity measures, and link-level characteristics such as distances and similarity measures. We evaluate these measures on the task of identifying unpopular users, as well as on the task of predicting the sign of links and show that the network exhibits multiplicative transitivity which allows algebraic methods based on matrix multiplication to be used. We compare our methods to traditional methods which are only suitable for positively weighted edges.
Keywords: link prediction, negative edge, slashdot zoo, social network
Community gravity: measuring bidirectional effects by trust and rating on online social networks BIBAKFull-Text 751-760
  Yutaka Matsuo; Hikaru Yamamoto
Several attempts have been made to analyze customer behavior on online E-commerce sites. Some studies particularly emphasize the social networks of customers. Users' reviews and ratings of a product exert effects on other consumers' purchasing behavior. Whether a user refers to other users' ratings depends on the trust accorded by a user to the reviewer. On the other hand, the trust that is felt by a user for another user correlates with the similarity of two users' ratings. This bidirectional interaction that involves trust and rating is an important aspect of understanding consumer behavior in online communities because it suggests clustering of similar users and the evolution of strong communities. This paper presents a theoretical model along with analyses of an actual online E-commerce site. We analyzed a large community site in Japan: @cosme. The noteworthy characteristics of @cosme are that users can bookmark their trusted users; in addition, they can post their own ratings of products, which facilitates our analyses of the ratings' bidirectional effects on trust and ratings. We describe an overview of the data in @cosme, analyses of effects from trust to rating and vice versa, and our proposition of a measure of community gravity, which measures how strongly a user might be attracted to a community. Our study is based on the @cosme dataset in addition to the Epinions dataset. It elucidates important insights and proposes a potentially important measure for mining online social networks.
Keywords: online community, rating, social networks, trust

Social networks and web 2.0/session: photos and web 2.0

Mapping the world's photos BIBAKFull-Text 761-770
  David J. Crandall; Lars Backstrom; Daniel Huttenlocher; Jon Kleinberg
We investigate how to organize a large collection of geotagged photos, working with a dataset of about 35 million images collected from Flickr. Our approach combines content analysis based on text tags and image data with structural analysis based on geospatial data. We use the spatial distribution of where people take photos to define a relational structure between the photos that are taken at popular places. We then study the interplay between this structure and the content, using classification methods for predicting such locations from visual, textual and temporal features of the photos. We find that visual and temporal features improve the ability to estimate the location of a photo, compared to using just textual features. We illustrate using these techniques to organize a large photo collection, while also revealing various interesting properties about popular cities and landmarks at a global scale.
Keywords: geolocation, photo collections
Ranking and classifying attractiveness of photos in folksonomies BIBAKFull-Text 771-780
  Jose San Pedro; Stefan Siersdorfer
Web 2.0 applications like Flickr, YouTube, or Del.icio.us are increasingly popular online communities for creating, editing and sharing content. The growing size of these folksonomies poses new challenges in terms of search and data mining. In this paper we introduce a novel methodology for automatically ranking and classifying photos according to their attractiveness for folksonomy members. To this end, we exploit image features known for having significant effects on the visual quality perceived by humans (e.g. sharpness and colorfulness) as well as textual meta data, in what is a multi-modal approach. Using feedback and annotations available in the Web 2.0 photo sharing system Flickr, we assign relevance values to the photos and train classification and regression models based on these relevance assignments. With the resulting machine learning models we categorize and rank photos according to their attractiveness. Applications include enhanced ranking functions for search and recommender methods for attractive content. Large scale experiments on a collection of Flickr photos demonstrate the viability of our approach.
Keywords: attractiveness features, classification, folksonomy feedback, image analysis, photo appeal, ranking, web 2.0
Constructing folksonomies from user-specified relations on flickr BIBAKFull-Text 781-790
  Anon Plangprasopchok; Kristina Lerman
Automatic folksonomy construction from tags has attracted much attention recently. However, inferring hierarchical relations between concepts from tags has a drawback in that it is difficult to distinguish between more popular and more general concepts. Instead of tags we propose to use user-specified relations for learning folksonomy. We explore two statistical frameworks for aggregating many shallow individual hierarchies, expressed through the collection/set relations on the social photosharing site Flickr, into a common deeper folksonomy that reflects how a community organizes knowledge. Our approach addresses a number of challenges that arise while aggregating information from diverse users, namely noisy vocabulary, and variations in the granularity level of the concepts expressed. Our second contribution is a method for automatically evaluating learned folksonomy by comparing it to a reference taxonomy, e.g., the Web directory created by the Open Directory Project. Our empirical results suggest that user-specified relations are a good source of evidence for learning folksonomies.
Keywords: collective knowledge, data mining, folksonomies, social information processing, taxonomies

User interfaces and mobile web/session: mobile web

Mining interesting locations and travel sequences from GPS trajectories BIBAKFull-Text 791-800
  Yu Zheng; Lizhu Zhang; Xing Xie; Wei-Ying Ma
The increasing availability of GPS-enabled devices is changing the way people interact with the Web, and brings us a large amount of GPS trajectories representing people's location histories. In this paper, based on multiple users' GPS trajectories, we aim to mine interesting locations and classical travel sequences in a given geospatial region. Here, interesting locations mean the culturally important places, such as Tiananmen Square in Beijing, and frequented public areas, like shopping malls and restaurants, etc. Such information can help users understand surrounding locations, and would enable travel recommendation. In this work, we first model multiple individuals' location histories with a tree-based hierarchical graph (TBHG). Second, based on the TBHG, we propose a HITS (Hypertext Induced Topic Search)-based inference model, which regards an individual's access on a location as a directed link from the user to that location. This model infers the interest of a location by taking into account the following three factors. 1) The interest of a location depends on not only the number of users visiting this location but also these users' travel experiences. 2) Users' travel experiences and location interests have a mutual reinforcement relationship. 3) The interest of a location and the travel experience of a user are relative values and are region-related. Third, we mine the classical travel sequences among locations considering the interests of these locations and users' travel experiences. We evaluated our system using a large GPS dataset collected by 107 users over a period of one year in the real world. As a result, our HITS-based inference model outperformed baseline approaches like rank-by-count and rank-by-frequency. Meanwhile, when considering the users' travel experiences and location interests, we achieved a better performance beyond baselines, such as rank-by-count and rank-by-interest, etc.
Keywords: GPS trajectories, location recommendation, spatial data mining, user travel experience
Computers and iphones and mobile phones, oh my!: a logs-based comparison of search users on different devices BIBAKFull-Text 801-810
  Maryam Kamvar; Melanie Kellar; Rajan Patel; Ya Xu
We present a logs-based comparison of search patterns across three platforms: computers, iPhones and conventional mobile phones. Our goal is to understand how mobile search users differ from computer-based search users, and we focus heavily on the distribution and variability of tasks that users perform from each platform. The results suggest that search usage is much more focused for the average mobile user than for the average computer-based user. However, search behavior on high-end phones resembles computer-based search behavior more so than mobile search behavior. A wide variety of implications follow from these findings. First, there is no single search interface which is suitable for all mobile phones. We suggest that for the higher-end phones, a close integration with the standard computer-based interface (in terms of personalization and available feature set) would be beneficial for the user, since these phones seem to be treated as an extension of the users' computer. For all other phones, there is a huge opportunity for personalizing the search experience for the user's "mobile needs", as these users are likely to repeatedly search for a single type of information need on their phone.
Keywords: google, iphone, mobile search, search, user behavior
A game based approach to assign geographical relevance to web images BIBAKFull-Text 811-820
  Yuki Arase; Xing Xie; Manni Duan; Takahiro Hara; Shojiro Nishio
Geographical context is very important for images. Millions of images on the Web have been already assigned latitude and longitude information. Due to the rapid proliferation of such images with geographical context, it is still difficult to effectively search and browse them, since we do not have ways to decide their relevance. In this paper, we focus on the geographical relevance of images, which is defined as to what extent the main objects in an image match landmarks at the location where the image was taken. Recently, researchers have proposed to use game based approaches to label large scale data such as Web images. However, previous works have not shown the quality of collected game logs in detail and how the logs can improve existing applications. To answer these questions, we design and implement a Web-based and multi-player game to collect human knowledge while people are enjoying the game. Then we thoroughly analyze the game logs obtained during a three week study with 147 participants and propose methods to determine the image geographical relevance. In addition, we conduct an experiment to compare our methods with a commercial search engine. Experimental results show that our methods dramatically improve image search relevance. Furthermore, we show that we can derive geographically relevant objects and their salient portion in images, which is valuable for a number of applications such as image location recognition.
Keywords: geographical relevance, human computation, image annotation, image search

User interfaces and mobile web/session: user interfaces

Web 2.0: blind to an accessible new world BIBAKFull-Text 821-830
  Joshua Hailpern; Loretta Guarino-Reid; Richard Boardman; Srinivas Annam
With the advent of Web 2.0 technologies, websites have evolved from static pages to dynamic, interactive Web-based applications with the ability to replicate common desktop functionality. However, for blind and visually impaired individuals who rely upon screen readers, Web 2.0 applications force them to adapt to an inaccessible use model. Many technologies, including WAI-ARIA, AJAX, and improved screen reader support, are rapidly evolving to improve this situation. However, simply combining them does not solve the problems of screen reader users. The main contributions of this paper are two models of interaction for screen reader users, for both traditional websites and Web 2.0 applications. Further contributions are a discussion of accessibility difficulties screen reader users encounter when interacting with Web 2.0 applications, a user workflow design model for improving Web 2.0 accessibility, and a set of design requirements for developers to ease the user's burden and increase accessibility. These models, accessibility difficulties, and design implications are based directly on responses and lessons learned from usability research focusing on Web 2.0 usage and screen reader users. Without the conscious effort of Web engineers and designers, most blind and visually impaired users will shy away from using new Web 2.0 technology in favor of desktop based applications.
Keywords: blind, screen reader, user models, visually impaired, web 2.0
Scrolling behaviour with single- and multi-column layout BIBAKFull-Text 831-840
  Cameron Braganza; Kim Marriott; Peter Moulder; Michael Wybrow; Tim Dwyer
The standard layout model used by web browsers is to lay text out in a vertical scroll using a single column. The horizontal-scroll layout model -- in which text is laid out in columns whose height is set to that of the browser window and the viewer scrolls horizontally -- seems well-suited to multi-column layout on electronic devices. We describe a study that examines how people read and, in particular, the strategies they use for scrolling with these two models when reading large textual documents on a standard computer monitor. We compare usability of the models and evaluate both user preferences and the effect of the model on performance. Also interesting is the description of the browser and its user interface which we used for the study.
Keywords: multi-column layout, reading, scrolling, web-browser
What's up CAPTCHA?: a CAPTCHA based on image orientation BIBAKFull-Text 841-850
  Rich Gossweiler; Maryam Kamvar; Shumeet Baluja
We present a new CAPTCHA which is based on identifying an image's upright orientation. This task requires analysis of the often complex contents of an image, a task which humans usually perform well and machines generally do not. Given a large repository of images, such as those from a web search result, we use a suite of automated orientation detectors to prune those images that can be automatically set upright easily. We then apply a social feedback mechanism to verify that the remaining images have a human-recognizable upright orientation. The main advantages of our CAPTCHA technique over the traditional text recognition techniques are that it is language-independent, does not require text-entry (e.g. for a mobile device), and employs another domain for CAPTCHA generation beyond character obfuscation. This CAPTCHA lends itself to rapid implementation and has an almost limitless supply of images. We conducted extensive experiments to measure the viability of this technique.
Keywords: CAPTCHA, automated attacks, image processing, orientation detection, spam, visual processing

Web engineering/session: end user web engineering

Rapid development of spreadsheet-based web mashups BIBAKFull-Text 851-860
  Woralak Kongdenfha; Boualem Benatallah; Julien Vayssière; Régis Saint-Paul; Fabio Casati
The rapid growth of social networking sites and web communities have motivated web sites to expose their APIs to external developers who create mashups by assembling existing functionalities. Current APIs, however, aim toward developers with programming expertise; they are not directly usable by wider class of users who do not have programming background, but would nevertheless like to build their own mashups. To address this need, we propose a spreadsheet-based Web mashups development framework, which enables users to develop mashups in the popular spreadsheet environment. First, we provide a mechanism that makes structured data first class values of spreadsheet cells. Second, we propose a new component model that can be used to develop fairly sophisticated mashups, involving joining data sources and keeping spreadsheet data up to date. Third, to simplify mashup development, we provide a collection of spreadsheet-based mashup patterns that captures common Web data access and spreadsheet presentation functionalities. Users can reuse and customize these patterns to build spreadsheet-based Web mashups instead of developing them from scratch. Fourth, we enable users to manipulate structured data presented on spreadsheet in a drag-and-drop fashion. Finally, we have developed and tested a proof-of-concept prototype to demonstrate the utility of the proposed framework.
Keywords: component model, spreadsheet-based mashup patterns, spreadsheets, web data mashups
Mashroom: end-user mashup programming using nested tables BIBAKFull-Text 861-870
  Guiling Wang; Shaohua Yang; Yanbo Han
This paper presents an end-user-oriented programming environment called Mashroom. Major contributions herein include an end-user programming model with an expressive data structure as well as a set of formally-defined mashup operators. The data structure takes advantage of nested table, and maintains the intuitiveness while allowing users to express complex data objects. The mashup operators are visualized with contextual menu and formula bar and can be directly applied on the data. Experiments and case studies reveal that end users have little difficulty in effectively and efficiently using Mashroom to build mashup applications.
Keywords: end-user programming, mashup, nested table, spreadsheet
Automated construction of web accessibility models from transaction click-streams BIBAKFull-Text 871-880
  Jalal Mahmud; Yevgen Borodin; I. V. Ramakrishnan; C. R. Ramakrishnan
Screen readers, the dominant assistive technology used by visually impaired people to access the Web, function by speaking out the content of the screen serially. Using screen readers for conducting online transactions can cause considerable information overload, because transactions, such as shopping and paying bills, typically involve a number of steps spanning several web pages. One can combat this overload by using a transaction model for web accessibility that presents only fragments of web pages that are needed for doing transactions. We can realize such a model by coupling a process automaton, encoding states of a transaction, with concept classifiers that identify page fragments "relevant" to a particular state of the transaction. In this paper we present a fully automated process that synergistically combines several techniques for transforming unlabeled click-stream data generated by transactions into a transactionmodel. These techniques include web content analysis to partition a web page into segments consisting of semantically related content, contextual analysis of data surrounding clickable objects in a page, and machine learning methods, such as clustering of page segments based on contextual analysis, statistical classification, and automata learning. The use of unlabeled click streams in building transaction models has important benefits: (i) visually impaired users do not have to depend on sighted users for creating manually labeled training data to construct the models; (ii) it is possible to mine personalized models from unlabeled transaction click-streams associated with sites that visually impaired users visit regularly; (iii) since unlabeled data is relatively easy to obtain, it is feasible to scale up the construction of domain-specific transaction models (e.g., separate models for shopping, airline reservations, bill payments, etc.); (iv) adjusting the performance of deployed models over timtime with new training data is also doable. We provide preliminary experimental evidence of the practical effectiveness of both domain-specific, as well as personalized accessibility transaction models built using our approach. Finally, this approach is applicable for building transaction models for mobile devices with limited-size displays, as well as for creating wrappers for information extraction from web sites.
Keywords: context, machine learning, process models, web transaction

Web engineering/session: service oriented development

Combining global optimization with local selection for efficient QoS-aware service composition BIBAKFull-Text 881-890
  Mohammad Alrifai; Thomas Risse
The run-time binding of web services has been recently put forward in order to support rapid and dynamic web service compositions. With the growing number of alternative web services that provide the same functionality but differ in quality parameters, the service composition becomes a decision problem on which component services should be selected such that user's end-to-end QoS requirements (e.g. availability, response time) and preferences (e.g. price) are satisfied. Although very efficient, local selection strategy fails short in handling global QoS requirements. Solutions based on global optimization, on the other hand, can handle global constraints, but their poor performance renders them inappropriate for applications with dynamic and real-time requirements. In this paper we address this problem and propose a solution that combines global optimization with local selection techniques to benefit from the advantages of both worlds. The proposed solution consists of two steps: first, we use mixed integer programming (MIP) to find the optimal decomposition of global QoS constraints into local constraints. Second, we use distributed local selection to find the best web services that satisfy these local constraints. The results of experimental evaluation indicate that our approach significantly outperforms existing solutions in terms of computation time while achieving close-to-optimal results.
Keywords: QoS, optimization, service composition, web services
A trust management framework for service-oriented environments BIBAKFull-Text 891-900
  William Conner; Arun Iyengar; Thomas Mikalsen; Isabelle Rouvellou; Klara Nahrstedt
Many reputation management systems have been developed under the assumption that each entity in the system will use a variant of the same scoring function. Much of the previous work in reputation management has focused on providing robustness and improving performance for a given reputation scheme. In this paper, we present a reputation-based trust management framework that supports the synthesis of trust-related feedback from many different entities while also providing each entity with the flexibility to apply different scoring functions over the same feedback data for customized trust evaluations. We also propose a novel scheme to cache trust values based on recent client activity. To evaluate our approach, we implemented our trust management service and tested it on a realistic application scenario in both LAN and WAN distributed environments. Our results indicate that our trust management service can effectively support multiple scoring functions with low overhead and high availability.
Keywords: reputation, service-oriented architectures, trust management
Test case prioritization for regression testing of service-oriented business applications BIBAKFull-Text 901-910
  Lijun Mei; Zhenyu Zhang; W. K. Chan; T. H. Tse
Regression testing assures the quality of modified service-oriented business applications against unintended changes. However, a typical regression test suite is large in size. Earlier execution of those test cases that may detect failures is attractive. Many existing prioritization techniques order test cases according to their respective coverage of program statements in a previous version of the application. On the other hand, industrial service-oriented business applications are typically written in orchestration languages such as WS-BPEL and integrated with workflow steps and web services via XPath and WSDL. Faults in these artifacts may cause the application to extract wrong data from messages, leading to failures in service compositions. Surprisingly, current regression testing research hardly considers these artifacts. We propose a multilevel coverage model to capture the business process, XPath, and WSDL from the perspective of regression testing. We develop a family of test case prioritization techniques atop the model. Empirical results show that our techniques can achieve significantly higher rates of fault detection than existing techniques.
Keywords: WSDL, XPath, service orientation, test case prioritization

Web engineering/session: web architecture aspect

Why is the web loosely coupled?: a multi-faceted metric for service design BIBAKFull-Text 911-920
  Cesare Pautasso; Erik Wilde
Loose coupling is often quoted as a desirable property of systems architectures. One of the main goals of building systems using Web technologies is to achieve loose coupling. However, given the lack of a widely accepted definition of this term, it becomes hard to use coupling as a criterion to evaluate alternative Web technology choices, as all options may exhibit, and claim to provide, some kind of "loose" coupling effects. This paper presents a systematic study of the degree of coupling found in service-oriented systems based on a multi-faceted approach. Thanks to the metric introduced in this paper, coupling is no longer a one-dimensional concept with loose coupling found somewhere in between tight coupling and no coupling. The paper shows how the metric can be applied to real-world examples in order to support and improve the design process of service-oriented systems.
Keywords: HTTP, RPC, SOA, loose coupling, rest, tight coupling, web services, ws-*
Highly scalable web applications with zero-copy data transfer BIBAKFull-Text 921-930
  Toyotaro Suzumura; Michiaki Tatsubori; Scott Trent; Akihiko Tozawa; Tamiya Onodera
The performance of server-side applications is becoming increasingly important as more applications exploit the Web application model. Extensive work has been done to improve the performance of individual software components such as Web servers and programming language runtimes. This paper describes a novel approach to boost Web application performance by improving inter-process communication between a programming language runtime and Web server runtime. The approach reduces redundant processing for memory copying and the context switch overhead between user space and kernel space by exploiting the zero-copy data transfer methodology, such as the sendfile system call. In order to transparently utilize this optimization feature with existing Web applications, we propose enhancements of the PHP runtime, FastCGI protocol, and Web server. Our proposed approach achieves a 126% performance improvement with micro-benchmarks and a 44% performance improvement for a standard Web benchmark, SPECweb2005.
Keywords: PHP, fastcgi, scripting, sendfile, web server, zero copy
REST-based management of loosely coupled services BIBAKFull-Text 931-940
  Heiko Ludwig; Jim Laredo; Kamal Bhattacharya; Liliana Pasquale; Bruno Wassermann
Applications increasingly make use of the distributed platform that the World Wide Web provides -- be it as a Software-as-a-Service such as salesforce.com, an application infrastructure such as facebook.com, or a computing infrastructure such as a "cloud". A common characteristic of applications of this kind is that they are deployed on infrastructure or make use of components that reside in different management domains. Current service management approaches and systems, however, often rely on a centrally managed configuration management database (CMDB), which is the basis for centrally orchestrated service management processes, in particular change management and incident management. The distribution of management responsibility of WWW based applications requires a decentralized approach to service management. This paper proposes an approach of decentralized service management based on distributed configuration management and service process co-ordination, making use RESTful access to configuration information and ATOM-based distribution of updates as a novel foundation for service management processes.
Keywords: discovery, loosely coupled systems, rest, service management

Web engineering/session: client side web engineering

Co-browsing dynamic web pages BIBAKFull-Text 941-950
  Dietwig Lowet; Daniel Goergen
Collaborative browsing, or co-browsing, is the co-navigation of the web with other people at-a-distance, supported by software that takes care of synchronizing the browsers. Current state-of-the-art solutions are able to do co-browsing of "static web pages", and do not support the synchronization of JavaScript interactions. However, currently many web pages use JavaScript and Ajax techniques to create highly dynamic and interactive web applications. In this paper, we describe two approaches for co-browsing that both support the synchronization of the JavaScript and Ajax interactions of dynamic web pages. One approach is based on synchronizing the output of the JavaScript engine by sending over the changes made on the DOM tree. The other approach is based on synchronizing the input of the JavaScript engine by synchronizing UI events and incoming data. Since the latter solution offers a better user experience and is more scalable, it is elaborated in more detail. An important aspect of both approaches is that they operate at the DOM level. Therefore, the client-side can be implemented in JavaScript and no browser extensions are required. To the best of the authors' knowledge this is the first DOM-level co-browsing solution that also enables co-browsing of the dynamic interaction parts of web pages. The presented co-browsing solution has been implemented in a research demonstrator which allows users to do co-browsing of web-applications on browser-based networked televisions.
Keywords: co-browsing, collaboration, collaborative computing, shared browsing, web4ce
HTML templates that fly: a template engine approach to automated offloading from server to client BIBAKFull-Text 951-960
  Michiaki Tatsubori; Toyotaro Suzumura
Web applications often use HTML templates to separate the webpage presentation from its underlying business logic and objects. This is now the de facto standard programming model for Web application development. This paper proposes a novel implementation for existing server-side template engines, FlyingTemplate, for (a) reduced bandwidth consumption in Web application servers, and (b) off-loading HTML generation tasks to Web clients. Instead of producing a fully-generated HTML page, the proposed template engine produces a skeletal script which includes only the dynamic values of the template parameters and the bootstrap code that runs on a Web browser at the client side. It retrieves a client-side template engine and the payload templates separately. With the goals of efficiency, implementation transparency, security, and standards compliance in mind, we developed FlyingTemplate with two design principles: effective browser cache usage, and reasonable compromises which restrict the template usage patterns and relax the security policies slightly but in a controllable way. This approach allows typical template-based Web applications to run effectively with FlyingTemplate. As an experiment, we tested the SPECweb2005 banking application using FlyingTemplate without any other modifications and saw throughput improvements from 1.6x to 2.0x in its best mode. In addition, FlyingTemplate can enforce compliance with a simple security policy, thus addressing the security problems of client-server partitioning in the Web environment.
Keywords: client-server partitioning, template engines, web applications
Characterizing insecure javascript practices on the web BIBAKFull-Text 961-970
  Chuan Yue; Haining Wang
JavaScript is an interpreted programming language most often used for enhancing webpage interactivity and functionality. It has powerful capabilities to interact with webpage documents and browser windows, however, it has also opened the door for many browser-based security attacks. Insecure engineering practices of using JavaScript may not directly lead to security breaches, but they can create new attack vectors and greatly increase the risks of browser-based attacks. In this paper, we present the first measurement study on insecure practices of using JavaScript on the Web. Our focus is on the insecure practices of JavaScript inclusion and dynamic generation, and we examine their severity and nature on 6,805 unique websites. Our measurement results reveal that insecure JavaScript practices are common at various websites: (1) at least 66.4% of the measured websites manifest the insecure practices of including JavaScript files from external domains into the top-level documents of their webpages; (2) over 44.4% of the measured websites use the dangerous eval() function to dynamically generate and execute JavaScript code on their webpages; and (3) in JavaScript dynamic generation, using the document.write() method and the innerHTML property is much more popular than using the relatively secure technique of creating script elements via DOM methods. Our analysis indicates that safe alternatives to these insecure practices exist in common cases and ought to be adopted by website developers and administrators for reducing potential security risks.
Keywords: AST tree matching, execution-based measurement, javascript, same origin policy, security, web engineering

XML and web data/session: XML extraction and crawling

Extracting article text from the web with maximum subsequence segmentation BIBAKFull-Text 971-980
  Jeff Pasternack; Dan Roth
Much of the information on the Web is found in articles from online news outlets, magazines, encyclopedias, review collections, and other sources. However, extracting this content from the original HTML document is complicated by the large amount of less informative and typically unrelated material such as navigation menus, forms, user comments, and ads. Existing approaches tend to be either brittle and demand significant expert knowledge and time (manual or tool-assisted generation of rules or code), necessitate labeled examples for every different page structure to be processed (wrapper induction), require relatively uniform layout (template detection), or, as with Visual Page Segmentation (VIPS), are computationally expensive. We introduce maximum subsequence segmentation, a method of global optimization over token-level local classifiers, and apply it to the domain of news websites. Training examples are easy to obtain, both learning and prediction are linear time, and results are excellent (our semi-supervised algorithm yields an overall F1-score of 97.947%), surpassing even those produced by VIPS with a hypothetical perfect block-selection heuristic. We also evaluate against the recent CleanEval shared task with surprisingly good cross-task performance cleaning general web pages, exceeding the top "text-only" score (based on Levenshtein distance), 87.8% versus 84.1%.
Keywords: maximum subsequence, page segmentation, text extraction
Extracting data records from the web using tag path clustering BIBAKFull-Text 981-990
  Gengxin Miao; Junichi Tatemura; Wang-Pin Hsiung; Arsany Sawires; Louise E. Moser
Fully automatic methods that extract lists of objects from the Web have been studied extensively. Record extraction, the first step of this object extraction process, identifies a set of Web page segments, each of which represents an individual object (e.g., a product). State-of-the-art methods suffice for simple search, but they often fail to handle more complicated or noisy Web page structures due to a key limitation -- their greedy manner of identifying a list of records through pairwise comparison (i.e., similarity match) of consecutive segments. This paper introduces a new method for record extraction that captures a list of objects in a more robust way based on a holistic analysis of a Web page. The method focuses on how a distinct tag path appears repeatedly in the DOM tree of the Web document. Instead of comparing a pair of individual segments, it compares a pair of tag path occurrence patterns (called visual signals) to estimate how likely these two tag paths represent the same list of objects. The paper introduces a similarity measure that captures how closely the visual signals appear and interleave. Clustering of tag paths is then performed based on this similarity measure, and sets of tag paths that form the structure of data records are extracted. Experiments show that this method achieves higher accuracy than previous methods.
Keywords: clustering, data record extraction, information extraction
Sitemaps: above and beyond the crawl of duty BIBAKFull-Text 991-1000
  Uri Schonfeld; Narayanan Shivakumar
Comprehensive coverage of the public web is crucial to web search engines. Search engines use crawlers to retrieve pages and then discover new ones by extracting the pages' outgoing links. However, the set of pages reachable from the publicly linked web is estimated to be significantly smaller than the invisible web, the set of documents that have no incoming links and can only be retrieved through web applications and web forms. The Sitemaps protocol is a fast-growing web protocol supported jointly by major search engines to help content creators and search engines unlock this hidden data by making it available to search engines. In this paper, we perform a detailed study of how "classic" discovery crawling compares with Sitemaps, in key measures such as coverage and freshness over key representative websites as well as over billions of URLs seen at Google. We observe that Sitemaps and discovery crawling complement each other very well, and offer different tradeoffs.
Keywords: crawling, metrics, quality, search engines, sitemaps

XML and web data/session: XML querying

Performing grouping and aggregate functions in XML queries BIBAKFull-Text 1001-1010
  Huayu Wu; Tok Wang Ling; Liang Xu; Zhifeng Bao
Since more and more business data are represented in XML format, there is a compelling need of supporting analytical operations in XML queries. Particularly, the latest version of XQuery proposed by W3C, XQuery 1.1, introduces a new construct to explicitly express grouping operation in FLWOR expression. Existing works in XML query processing mainly focus on physically matching query structure over XML document. Given the explicit grouping operation in a query, how to efficiently compute grouping and aggregate functions over XML document is not well studied yet. In this paper, we extend our previous XML query processing algorithm, VERT, to efficiently perform grouping and aggregate function in queries. The main technique of our approach is introducing relational tables to index values. Query pattern matching and aggregation computing are both conducted with table indices. We also propose two semantic optimizations to further improve the query performance. Finally we present experimental results to validate the efficiency of our approach, over other existing approaches.
Keywords: aggregate function, grouping, query processing, xml
XQuery in the browser BIBAKFull-Text 1011-1020
  Ghislain Fourny; Markus Pilman; Daniela Florescu; Donald Kossmann; Tim Kraska; Darin McBeath
Since the invention of the Web, the browser has become more and more powerful. By now, it is a programming and execution environment in itself. The predominant language to program applications in the browser today is JavaScript. With browsers becoming more powerful, JavaScript has been extended and new layers have been added (e.g., DOM-Support and XPath). Today, JavaScript is very successful and applications and GUI features implemented in the browser have become increasingly complex. The purpose of this paper is to improve the programmability of Web browsers by enabling the execution of XQuery programs in the browser. Although it has the potential to ideally replace JavaScript, it is possible to run it in addition to JavaScript for more flexibility. Furthermore, it allows instant code migration from the server to the client and vice-versa. This enables a significant simplification of the technology stack. The intuition is that programming the browser involves mostly XML (i.e., DOM) navigation and manipulation, and the XQuery family of W3C standards were designed exactly for that purpose. The paper proposes extensions to XQuery for Web browsers and gives a number of examples that demonstrate the usefulness of XQuery for the development of AJAX-style applications. Furthermore, the paper presents the design of an XQuery plug-in for Microsoft's Internet Explorer. The paper also gives examples of applications which were developed with the help of this plug-in.
Keywords: CSS, HTML, XHTML, XML, XQuery, browser, client-side programming, dom, events, javascript, mash-up, script, scripting, stylesheets
Answering approximate queries over autonomous web databases BIBAKFull-Text 1021-1030
  Xiangfu Meng; Z. M. Ma; Li Yan
To deal with the problem of empty or too little answers returned from a Web database in response to a user query, this paper proposes a novel approach to provide relevant and ranked query results. Based on the user original query, we speculate how much the user cares about each specified attribute and assign a corresponding weight to it. This original query is then rewritten as an approximate query by relaxing the query criteria range. The relaxation order of all specified attributes and the relaxed degree on each specified attribute are varied with the attribute weights. For the approximate query results, we generate users' contextual preferences from database workload and use them to create a priori orders of tuples in an off-line preprocessing step. Only a few representative orders are saved, each corresponding to a set of contexts. Then, these orders and associated contexts are used at query time to expeditiously provide ranked answers. Results of a preliminary user study demonstrate that our query relaxation and results ranking methods can capture the user's preferences effectively. The efficiency and effectiveness of our approach is also demonstrated by experimental result.
Keywords: query relaxation, query results ranking, top-k., web database

WWW in ibero-america

Analyzing seller practices in a Brazilian marketplace BIBAKFull-Text 1031-1040
  Adriano Pereira; Diego Duarte; Wagner, Jr. Meira; Virgilio Almeida; Paulo Góes
E-commerce is growing at an exponential rate. In the last decade, there has been an explosion of online commercial activity enabled by World Wide Web (WWW). These days, many consumers are less attracted to online auctions, preferring to buy merchandise quickly using fixed-price negotiations. Sales at Amazon.com, the leader in online sales of fixed-price goods, rose 37% in the first quarter of 2008. At eBay, where auctions make up 58% of the site's sales, revenue rose 14%. In Brazil, probably by cultural influence, online auctions are not been popular. This work presents a characterization and analysis of fixed-price online negotiations. Using actual data from a Brazilian marketplace, we analyze seller practices, considering seller profiles and strategies. We show that different sellers adopt strategies according to their interests, abilities and experience. Moreover, we confirm that choosing a selling strategy is not simple, since it is important to consider the seller's characteristics to evaluate the applicability of a strategy. The work also provides a comparative analysis of some selling practices in Brazil with popular worldwide marketplaces.
Keywords: e-commerce, e-markets, marketplaces, selling practices
A geographical analysis of knowledge production in computer science BIBAKFull-Text 1041-1050
  Guilherme Vale Menezes; Nivio Ziviani; Alberto H. F. Laender; Virgílio Almeida
We analyze knowledge production in Computer Science by means of coauthorship networks. For this, we consider 30 graduate programs of different regions of the world, being 8 programs in Brazil, 16 in North America (3 in Canada and 13 in the United States), and 6 in Europe (2 in France, 1 in Switzerland and 3 in the United Kingdom). We use a dataset that consists of 176,537 authors and 352,766 publication entries distributed among 2,176 publication venues. The results obtained for different metrics of collaboration social networks indicate the process of knowledge creation has changed differently for each region. Research is increasingly done in teams across different fields of Computer Science. The size of the giant component indicates the existence of isolated collaboration groups in the European network, contrasting to the degree of connectivity found in the Brazilian and North-American counterparts. We also analyzed the temporal evolution of the social networks representing the three regions. The number of authors per paper experienced an increase in a time span of 12 years. We observe that the number of collaborations between authors grows faster than the number of authors, benefiting from the existing network structure. The temporal evolution shows differences between well-established fields, such as Databases and Computer Architecture, and emerging fields, like Bioinformatics and Geoinformatics. The patterns of collaboration analyzed in this paper contribute to an overall understanding of Computer Science research in different geographical regions that could not be achieved without the use of complex networks and a large publication database.
Keywords: coauthorship networks, collaboration social networks, computer science

Posters Wednesday, April 22, 2009

Competitive analysis from click-through log BIBAKFull-Text 1051-1052
  Gang Wang; Jian Hu; Yunzhang Zhu; Hua Li; Zheng Chen
Existing keyword suggestion tools from various search engine companies could automatically suggest keywords related to the advertisers' products or services, counting in simple statistics of the keywords, such as search volume, cost per click (CPC), etc. However, the nature of the generalized Second Price Auction suggests that better understanding the competitors' keyword selection and bidding strategies better helps to win the auction, other than only relying on general search statistics. In this paper, we propose a novel keyword suggestion strategy, called Competitive Analysis, to explore the keyword based competition relationships among advertisers and eventually help advertisers to build campaigns with better performance. The experimental results demonstrate that the proposed Competitive Analysis can both help advertisers to promote their product selling and generate more revenue to the search engine companies.
Keywords: competitive analysis, keyword advertising, keyword suggestion
Predicting click through rate for job listings BIBAKFull-Text 1053-1054
  Manish Gupta
Click Through Rate (CTR) is an important metric for ad systems, job portals, recommendation systems. CTR impacts publisher's revenue, advertiser's bid amounts in "pay for performance" business models. We learn regression models using features of the job, optional click history of job, features of "related" jobs. We show that our models predict CTR much better than predicting avg. CTR for all job listings, even in absence of the click history for the job listing.
Keywords: CPC, CTR, GBDT, click through rate, gradient boosted decision trees, jobs, linear regression, prediction, treenet
Query clustering using click-through graph BIBAKFull-Text 1055-1056
  Jeonghee Yi; Farzin Maghoul
In this paper we describe a problem of discovering query clusters from a click-through graph of web search logs. The graph consists of a set of web search queries, a set of pages selected for the queries, and a set of directed edges that connects a query node and a page node clicked by a user for the query. The proposed method extracts all maximal bipartite cliques (bicliques) from a click-through graph and compute an equivalence set of queries (i.e., a query cluster) from the maximal bicliques. A cluster of queries is formed from the queries in a biclique. We present a scalable algorithm that enumerates all maximal bicliques from the click-through graph. We have conducted experiments on Yahoo web search queries and the result is promising.
Keywords: biclique, click-through graph, equivalence set, maximal bipartite clique, query clustring, query intent
An effective semantic search technique using ontology BIBAKFull-Text 1057-1058
  Jihyun Lee; Jun-Ki Min; Chin-Wan Chung
In this paper, we present a semantic search technique considering the type of desired Web resources and the semantic relationships between the resources and the query keywords in the ontology. In order to effectively retrieve the most relevant top-k resources, we propose a novel ranking model. To do this, we devise a measure to determine the weight of the semantic relationship. In addition, we consider the number of meaningful semantic relationships between a resource and keywords, the coverage of keywords, and the distinguishability of keywords. Through experiments using real datasets, we observe that our ranking model provides more accurate semantic search results compared to existing ranking models.
Keywords: ontology, ranking, semantic relationship, semantic search, semantic web
Relationalizing RDF stores for tools reusability BIBAKFull-Text 1059-1060
  Sunitha Ramanujam; Anubha Gupta; Latifur Khan; Steven Seida; Bhavani Thuraisingham
The emergence of Semantic Web technologies and standards such as Resource Description Framework (RDF) has introduced novel data storage models such as the RDF Graph Model. In this paper, we present a research effort called R2D, which attempts to bridge the gap between RDF and RDBMS concepts by presenting a relational view of RDF data stores. Thus, R2D is essentially a relational wrapper around RDF stores that aims to make the variety of stable relational tools that are currently in the market available to RDF stores without data duplication and synchronization issues.
Keywords: RDBMs, data interoperability, query processing, resource description framework (rdf), visualization tools
C-SPARQL: SPARQL for continuous querying BIBAKFull-Text 1061-1062
  Davide Francesco Barbieri; Daniele Braga; Stefano Ceri; Emanuele Della Valle; Michael Grossniklaus
C-SPARQL is an extension of SPARQL to support continuous queries, registered and continuously executed over RDF data streams, considering windows of such streams. Supporting streams in RDF format guarantees interoperability and opens up important applications, in which reasoners can deal with knowledge that evolves over time. We present C-SPARQL by means of examples in Urban Computing.
Keywords: RDF, SPARQL, data streams
Interactive search in XML data BIBAKFull-Text 1063-1064
  Guoliang Li; Jianhua Feng; Lizhu Zhou
In a traditional keyword-search system over XML data, a user composes a keyword query, submits it to the system, and retrieves relevant subtrees. In the case where the user has limited knowledge about the data, often the user feels "left in the dark" when issuing queries, and has to use a try-and-see approach for finding information. In this paper, we study a new information-access paradigm for XML data, called "Inks," in which the system searches on the underlying data "on the fly" as the user types in query keywords. Inks extends existing XML keyword search methods by interactively answering queries. We propose effective indices, early-termination techniques, and efficient search algorithms to achieve a high interactive speed. We have implemented our algorithm, and the experimental results show that our method achieves high search efficiency and result quality.
Keywords: interactive search, keyword search, xml
Is there anything worth finding on the semantic web? BIBAKFull-Text 1065-1066
  Harry Halpin
There has recently been an upsurge of interest in the possibilities of combining structured data and ad-hoc information retrieval from traditional hypertext. In this experiment, we run queries extracted from a query log of a major search engine against the Semantic Web to discover if the Semantic Web has anything of interest to the average user. We show that there is indeed much information on the Semantic Web that could be relevant for many queries for people, places and even abstract concepts, although they are overwhelmingly clustered around a Semantic Web-enabled export of Wikipedia known as DBPedia.
Keywords: information retrieval, linked data, search, semantic web
Instance-based probabilistic reasoning in the semantic web BIBAKFull-Text 1067-1068
  Pedro Oliveira; Paulo Gomes
Most of the approaches for dealing with uncertainty in the Semantic Web rely on the principle that this uncertainty is already asserted. In this paper, we propose a new approach to learn and reason about uncertainty in the Semantic Web. Using instance data, we learn the uncertainty of an OWL ontology, and use that information to perform probabilistic reasoning on it. For this purpose, we use Markov logic, a new representation formalism that combines logic with probabilistic graphical models.
Keywords: markov logic, probabilistic reasoning, semantic web
A flight meta-search engine with metamorph BIBAKFull-Text 1069-1070
  Bernhard Kruepl; Wolfgang Holzinger; Yansen Darmaputra; Robert Baumgartner
We demonstrate a flight meta-search engine that is based on the Metamorph framework. Metamorph provides mechanisms to model web forms together with the interactions which are needed to fulfil a request, and can generate interaction sequences that pose queries using these web forms and collect the results. In this paper, we discuss an interesting new feature that makes use of the forms themselves as an information source. We show how data can be extracted from web forms (rather than the data behind web forms) to generate a graph of flight connections between cities.
   The flight connection graph allows us to vastly reduce the number of queries that the engine sends to airline websites in the most interesting search scenarios; those that involve the controversial practice of creative ticketing, in which agencies attempt to find lower price fares by using more than one airline for a journey. We describe a system which attains data from a number of websites to identify promising routes and prune the search tree. Heuristics that make use of geographical information and an estimation of cost based on historical data are employed. The results are then made available to improve the quality of future search requests.
Keywords: hidden web, web data extraction, web form extraction, web form mapping
Thumbs-up: a game for playing to rank search results BIBAKFull-Text 1071-1072
  Ali Dasdan; Chris Drome; Santanu Kolay
Human computation is an effective way to channel human effort spent playing games to solving computational problems that are easy for humans but difficult for computers to automate. We propose Thumbs-Up, a new game for human computation with the purpose of playing to rank search result. Our experience from users shows that Thumbs-Up is not only fun to play, but produces more relevant rankings than both a major search engine and optimal rank aggregation using the Kemeny rule.
Keywords: games with a purpose, human computation, online games, rank aggregation, relevance, search engine
Search shortcuts: driving users towards their goals BIBAKFull-Text 1073-1074
  Ranieri Baraglia; Fidel Cacheda; Victor Carneiro; Vreixo Formoso; Raffaele Perego; Fabrizio Silvestri
Giving suggestions to users of Web-based services is a common practice aimed at enhancing their navigation experience. Major Web Search Engines usually provide "Suggestions" under the form of queries that are, to some extent, related to the current query typed by the user, and the knowledge learned from the past usage of the system. In this work we introduce "Search Shortcuts" as "Successful" queries allowed, in the past, users to satisfy their information needs. Differently from conventional suggestion techniques, our search shortcuts allows to evaluate effectiveness by exploiting a simple train-and-test approach. We have applied several Collaborative Filtering algorithms to this problem, evaluating them on a real query log data. We generate the shortcuts from all user sessions belonging to the testing set, and measure the quality of the shortcuts suggested by considering the similarity between them and the navigational user behavior.
Keywords: evaluation, model, search shortcut
A probabilistic model based approach for blended search BIBAKFull-Text 1075-1076
  Ning Liu; Jun Yan; Zheng Chen
In this paper, we propose to model the blended search problem by assuming conditional dependencies among queries, VSEs and search results. The probability distributions of this model are learned from search engine query log through unigram language model. Our experimental exploration shows that, (1) a large number of queries in generic Web search have vertical search intentions; and (2) our proposed algorithm can effectively blend vertical search results into generic Web search, which can improve the Mean Average Precision (MAP) by as much as 16% compared to traditional Web search without blending.
Keywords: blended search, language model, query log, vertical search
Bucefalo: a tool for intelligent search and filtering for web-based personal health records BIBAKFull-Text 1077-1078
  Francisco P. Romero; Jesus Serrano-Guerrero; Jose A. Olivas
In this poster, a tool named BUCEFALO is presented. This tool is specially designed to improve the information retrieval tasks in web-based Personal Health Records (PHR). This tool implements semantic and multilingual query expansion techniques and information filtering algorithms in order to help users find the most valuable information about a specific clinical case. The filtering model is based on fuzzy prototypes based filtering, data quality measures, user profiles and healthcare ontologies. The first experimental results illustrate the feasibility of this tool.
Keywords: information filtering, web-based personal health record
Dataplorer: a scalable search engine for the data web BIBAKFull-Text 1079-1080
  Haofen Wang; Qiaoling Liu; Gui-Rong Xue; Yong Yu; Lei Zhang; Yue Pan
More and more structured information in the form of semantic data is nowadays available. It offers a wide range of new possibilities especially for semantic search and Web data integration. However, their effective exploitation still brings about a number of challenges, e.g. usability, scalability and uncertainty. In this paper, we present Dataplorer, a solution designed to address these challenges. We consider the usability through the use of hybrid queries and faceted search, while still preserving the scalability thanks to an extension of inverted index to support this type of query. Moreover, Dataplorer deals with uncertainty by means of a powerful ranking scheme to find relevant results. Our experimental results show that our proposed approach is promising and it makes us believe that it is possible to extend the current IR infrastructure to query and search the Web of data.
Keywords: faceted search, hybrid query, inverted index, ranking
Threshold selection for web-page classification with highly skewed class distribution BIBAKFull-Text 1081-1082
  Xiaofeng He; Lei Duan; Yiping Zhou; Byron Dom
We propose a novel cost-efficient approach to threshold selection for binary web-page classification problems with imbalanced class distributions. In many binary-classification tasks the distribution of classes is highly skewed. In such problems, using uniform random sampling in constructing sample sets for threshold setting requires large sample sizes in order to include a statistically sufficient number of examples of the minority class. On the other hand, manually labeling examples is expensive and budgetary considerations require that the size of sample sets be limited. These conflicting requirements make threshold selection a challenging problem. Our method of sample-set construction is a novel approach based on stratified sampling, in which manually labeled examples are expanded to reflect the true class distribution of the web-page population. Our experimental results show that using false positive rate as the criterion for threshold setting results in lower-variance threshold estimates than using other widely used accuracy measures such as F1 and precision.
Keywords: binary classifier, skewed class distribution, stratified sampling, threshold selection, web-page classification
Web-scale classification with naive bayes BIBAKFull-Text 1083-1084
  Congle Zhang; Gui-Rong Xue; Yong Yu; Hongyuan Zha
Traditional Naive Bayes Classifier performs miserably on web-scale taxonomies. In this paper, we investigate the reasons behind such bad performance. We discover that the low performance are not completely caused by the intrinsic limitations of Naive Bayes, but mainly comes from two largely ignored problems: contradiction pair problem and discriminative evidence cancelation problem. We propose modifications that can alleviate the two problems while preserving the advantages of Naive Bayes. The experimental results show our modified Naive Bayes can significantly improve the performance on real web-scale taxonomies.
Keywords: naive bayes, web-scale taxonomy
News article extraction with template-independent wrapper BIBAKFull-Text 1085-1086
  Junfeng Wang; Xiaofei He; Can Wang; Jian Pei; Jiajun Bu; Chun Chen; Ziyu Guan; Gang Lu
We consider the problem of template-independent news extraction. The state-of-the-art news extraction method is based on template-level wrapper induction, which has two serious limitations. 1) It cannot correctly extract pages belonging to an unseen template until the wrapper for that template has been generated. 2) It is costly to maintain up-to-date wrappers for hundreds of websites, because any change of a template may lead to the invalidation of the corresponding wrapper. In this paper we formalize news extraction as a machine learning problem and learn a template-independent wrapper using a very small number of labeled news pages from a single site. Novel features dedicated to news titles and bodies are developed respectively. Correlations between the news title and the news body are exploited. Our template-independent wrapper can extract news pages from different sites regardless of templates. In experiments, a wrapper is learned from 40 pages from a single news site. It achieved 98.1% accuracy over 3,973 news pages from 12 news sites.
Keywords: data extraction
Graffiti: node labeling in heterogeneous networks BIBAKFull-Text 1087-1088
  Ralitsa Angelova; Gjergji Kasneci; Fabian M. Suchanek; Gerhard Weikum
We introduce a multi-label classification model and algorithm for labeling heterogeneous networks, where nodes belong to different types and different types have different sets of classification labels. We present a graph-based approach which models the mutual influence between nodes in the network as a random walk. When viewing class labels as "colors", the random surfer is "spraying" different node types with different color palettes; hence the name Graffiti. We demonstrate the performance gains of our method by comparing it to three state-of-the-art techniques for graph-based classification.
Keywords: graph classification, web 2.0 ir
Graph based crawler seed selection BIBAKFull-Text 1089-1090
  Shuyi Zheng; Pavel Dmitriev; C. Lee Giles
This paper identifies and explores the problem of seed selection in a web-scale crawler. We argue that seed selection is not a trivial but very important problem. Selecting proper seeds can increase the number of pages a crawler will discover, and can result in a collection with more "good" and less "bad" pages. Based on the analysis of the graph structure of the web, we propose several seed selection algorithms. Effectiveness of these algorithms is proved by our experimental results on real web data.
Keywords: crawler, graph analysis, pagerank, seed selection
Building term suggestion relational graphs from collective intelligence BIBAKFull-Text 1091-1092
  Jyh-Ren Shieh; Yung-Huan Hsieh; Yang-Ting Yeh; Tse-Chung Su; Ching-Yung Lin; Ja-Ling Wu
This paper proposes an effective approach to provide relevant search terms for conceptual Web search. 'Semantic Term Suggestion' function has been included so that users can find the most appropriate query term to what they really need. Conventional approaches for term suggestion involve extracting frequently occurring key terms from retrieved documents. They must deal with term extraction difficulties and interference from irrelevant documents. In this paper, we propose a semantic term suggestion function called Collective Intelligence based Term Suggestion (CITS). CITS provides a novel social-network based framework for relevant terms suggestion with a semantic graph of the search term without limiting to the specific query term. A visualization of semantic graph is presented to the users to help browsing search results from related terms in the semantic graph. The search results are ranked each time according to their relevance to the related terms in the entire query session. Comparing to two popular commercial search engines, a user study of 18 users on 50 search terms showed better user satisfactions and indicated the potential usefulness of proposed method in real-world search applications.
Keywords: keyword expansion, re-ranking, social network
Towards intent-driven bidterm suggestion BIBAKFull-Text 1093-1094
  William Chang; Patrick Pantel; Ana-Maria Popescu; Evgeniy Gabrilovich
In online advertising, pervasive in commercial search engines, advertisers typically bid on few terms, and the scarcity of data makes ad matching difficult. Suggesting additional bidterms can significantly improve ad clickability and conversion rates. In this paper, we present a large-scale bidterm suggestion system that models an advertiser's intent and finds new bidterms consistent with that intent. Preliminary experiments show that our system significantly increases the coverage of a state of the art production system used at Yahoo while maintaining comparable precision.
Keywords: sponsored search
Advertising keyword generation using active learning BIBAKFull-Text 1095-1096
  Hao Wu; Guang Qiu; Xiaofei He; Yuan Shi; Mingcheng Qu; Jing Shen; Jiajun Bu; Chun Chen
This paper proposes an efficient relevance feedback based interactive model for keyword generation in sponsored search advertising. We formulate the ranking of relevant terms as a supervised learning problem and suggest new terms for the seed by leveraging user relevance feedback information. Active learning is employed to select the most informative samples from a set of candidate terms for user labeling. Experiments show our approach improves the relevance of generated terms significantly with little user effort required.
Keywords: active learning, keyword generation, sponsored search
Gamesense BIBAKFull-Text 1097-1098
  Lusong Li; Tao Mei; Chris Liu; Xian-Sheng Hua
This paper presents a novel game-like advertising system called GameSense, which is driven by the compelling contents of online images. Given a Web page which typically contains images, GameSense is able to select suitable images to create online in-image games for advertising. The contextually relevant ads (i.e., product logos) are embedded at appropriate positions within the online games. The ads are selected based on not only textual relevance but also visual content similarity. The game is able to provide viewers rich experience and thus promote the embedded ads to provide more effective advertising.
Keywords: image advertising, online game
Rare item detection in e-commerce site BIBAKFull-Text 1099-1100
  Dan Shen; Xiaoyuan Wu; Alvaro Bolivar
As the largest online marketplace in the world, eBay has a huge inventory where there are plenty of great rare items with potentially large, even rapturous buyers. These items are obscured in long tail of eBay item listing and hard to find through existing searching or browsing methods. It is observed that there are great rarity demands from users according to eBay query log. To keep up with the demands, the paper proposes a method to automatically detect rare items in eBay online listing. A large set of features relevant to the task are investigated to filter items and further measure item rareness. The experiments on the most rarity-demand-intensitive domains show that the method may effectively detect rare items (>90% precision).
Keywords: long tail theory, rare item detection, rareness measure
A declarative framework for semantic link discovery over relational data BIBAKFull-Text 1101-1102
  Oktie Hassanzadeh; Lipyeow Lim; Anastasios Kementsietsidis; Min Wang
In this paper, we present a framework for online discovery of semantic links from relational data. Our framework is based on declarative specification of the linkage requirements by the user, that allows matching data items in many real-world scenarios. These requirements are translated to queries that can run over the relational data source, potentially using the semantic knowledge to enhance the accuracy of link discovery. Our framework lets data publishers to easily find and publish high-quality links to other data sources, and therefore could significantly enhance the value of the data in the next generation of web.
Keywords: link discovery, linked data, record linkage, semantic web
Modeling semantics and structure of discussion threads BIBAKFull-Text 1103-1104
  Chen Lin; Jiang-Ming Yang; Rui Cai; Xin-Jing Wang; Wei Wang; Lei Zhang
The abundant knowledge in web communities has motivated the research interests in discussion threads. The dynamic nature of discussion threads poses interesting and challenging problems for computer scientists. Although techniques such as semantic models or structural models have been shown to be useful in a number of areas, they are inefficient in understanding discussion threads due to the temporal dependence among posts in a discussion thread. Such dependence causes that semantics and structure coupled with each other in discussion threads. In this paper, we propose a sparse coding-based model named SMSS to Simultaneously Model Semantic and Structure of discussion threads.
Keywords: reply reconstruction, sparse coding, threaded discussion
Combining anchor text categorization and graph analysis for paid link detection BIBAKFull-Text 1105-1106
  Kirill Nikolaev; Ekaterina Zudina; Andrey Gorshkov
In order to artificially boost the rank of commercial pages in search engine results, search engine optimizers pay for links to these pages on other websites. Identifying paid links is important for a web search engine to produce highly relevant results. In this paper we introduce a novel method of identifying such links. We start with training a classifier of anchor text topics and analyzing web pages for diversity of their outgoing commercial links. Then we use this information and analyze link graph of the Russian Web to find pages that sell links and sites that buy links and to identify the paid links. Testing on manually marked samples showed high efficiency of the algorithm.
Keywords: categorization, language model, link analysis, machine learning, search engines, web mining
Rethinking email message and people search BIBAKFull-Text 1107-1108
  Sebastian Michel; Ingmar Weber
We show how a number of novel email search features can be implemented without any kind of natural language processing (NLP) or advanced data mining. Our approach inspects the email headers of all messages a user has ever sent or received and it creates simple per-contact summaries, including simple information about the message exchange history, the domain of the sender or even the sender's gender. With these summaries advanced questions/tasks such as "Who do I still need to reply to?" or "Find 'fun' messages sent by friends." become possible. As a proof of concept, we implemented a Mozilla-Thunderbird extension, adding powerful people search to the popular email client.
Keywords: email, email search, inbox 2.0, people search
Purely URL-based topic classification BIBAKFull-Text 1109-1110
  Eda Baykan; Monika Henzinger; Ludmila Marian; Ingmar Weber
Given only the URL of a web page, can we identify its topic? This is the question that we examine in this paper. Usually, web pages are classified using their content, but a URL-only classifier is preferable, (i) when speed is crucial, (ii) to enable content filtering before an (objection-able) web page is downloaded, (iii) when a page's content is hidden in images, (iv) to annotate hyperlinks in a personalized web browser, without fetching the target page, and (v) when a focused crawler wants to infer the topic of a target page before devoting bandwidth to download it. We apply a machine learning approach to the topic identification task and evaluate its performance in extensive experiments on categorized web pages from the Open Directory Project (ODP). When training separate binary classifiers for each topic, we achieve typical F-measure values between 80 and 85, and a typical precision of around 85. We also ran experiments on a small data set of university web pages. For the task of classifying these pages into faculty, student, course and project pages, our methods improve over previous approaches by 13.8 points of F-measure.
Keywords: ODP, URL, topic classification

Posters Thursday, April 23, 2009

WPBench: a benchmark for evaluating the client-side performance of web 2.0 applications BIBAKFull-Text 1111-1112
  Kaimin Zhang; Lu Wang; Xiaolin Guo; Aimin Pan; Bin Benjamin Zhu
In this paper, a benchmark called WPBench is reported to evaluate the responsiveness of Web browsers for modern Web 2.0 applications. In WPBench, variations of servers and networks are removed and the benchmark result is the closest to what Web users would perceive. To achieve these, WPBench records users' interactions with typical Web 2.0 applications, and then replays Web navigations when benchmarking browsers. The replay mechanism can emulate the actual user interactions and the characteristics of the servers and the networks in a consistent way independent of browsers so that any browser compliant to the standards can be benchmarked fairly. In addition to describing the design and generation of WPBench, we also report the WPBench comparison results on the responsiveness performance for three popular Web browsers: Internet Explorer, Firefox and Chrome.
Keywords: benchmark, browser, javascript, replay, web
MASTH proxy: an extensible platform for web overload control BIBAKFull-Text 1113-1114
  Vipul Mathur; Sanket Dhopeshwarkar; Varsha Apte
Many overload control mechanisms for Web based applications aim to prevent overload by setting limits on factors such as admitted load, number of server threads, buffer size. For this they need online measurements of metrics such as response time, throughput, and resource utilization. This requires instrumentation of the server by modifying server code, which may not be feasible or desirable. An alternate approach is to use a proxy between the clients and servers.
   We have developed a proxy-based overload control platform called MASTH Proxy -- Multi-class Admission-controlled Self-Tuning HTTP Proxy. It records detailed measurements, supports multiple request classes, manages queues of HTTP requests, provides tunable parameters and enables easy implementation of dynamic overload control. This gives designers of overload control schemes a platform where they can concentrate on developing the core control logic, without the need to modify upstream server code.
Keywords: admission control, multi-class, overload, proxy, web server
A messaging API for inter-widgets communication BIBAKFull-Text 1115-1116
  Stéphane Sire; Micaël Paquier; Alain Vagner; Jérôme Bogaerts
Widget containers are used everywhere on the Web, for instance as customizable start pages to Web desktops. In this poster, we describe the extension of a widget container with an inter-widgets communication layer, as well as the subsequent application programming interfaces (APIs) added to the Widget object to support this feature. We present the benefits of a drag and drop facility within widgets and conclude by a call for standardization of inter-widgets communication on the Web.
Keywords: communication, drag and drop, portal, start page, widget
Why are moved web pages difficult to find?: the WISH approach BIBAKFull-Text 1117-1118
  Atsuyuki Morishima; Akiyoshi Nakamizo; Toshinari Iida; Shigeo Sugimoto; Hiroyuki Kitagawa
This paper addresses the problem of finding new locations of moved Web pages. We discuss why the content-based approach has a limitation in solving the problem and why it is important to exploit the knowledge on where to search for the pages.
Keywords: broken links, integrity management
Detecting soft errors by redirection classification BIBAKFull-Text 1119-1120
  Taehyung Lee; Jinil Kim; Jin Wook Kim; Sung-Ryul Kim; Kunsoo Park
A soft error redirection is a URL redirection to a page that returns the HTTP status code 200 (OK) but has actually no relevant content to the client request. Since such redirections degrade the performance of web search engines in many ways, it is highly desirable to remove as many of them as possible. We propose a novel approach to detect soft error redirections by analyzing redirection logs collected during crawling operation. Experimental results on huge crawl data show that our measure can classify soft error redirections effectively.
Keywords: search engine, soft 404, spam, url redirection
Automatic web service composition with abstraction and refinement BIBAKFull-Text 1121-1122
  Hyunyoung Kil; Wonhong Nam; Dongwon Lee
The behavioral description based Web Service Composition (WSC) problem aims at the automatic construction of a coordinator web service that controls a set of web services to reach a goal state. However, solving the WSC problem exactly with a realistic model is doubly-exponential in the number of variables in web service descriptions. In this paper, we propose a novel efficient approximation-based algorithm using automatic abstraction and refinement to dramatically reduce the number of variables needed to solve the problem.
Keywords: abstraction, refinement, service composition
Where to adapt dynamic service compositions BIBAKFull-Text 1123-1124
  Bo Jiang; W. K. Chan; Zhenyu Zhang; T. H. Tse
Peer services depend on one another to accomplish their tasks, and their structures may evolve. A service composition may be designed to replace its member services whenever the quality of the composite service fails to meet certain quality-of-service (QoS) requirements. Finding services and service invocation endpoints having the greatest impact on the quality are important to guide subsequent service adaptations. This paper proposes a technique that samples the QoS of composite services and continually analyzes them to identify artifacts for service adaptation. The preliminary results show that our technique has the potential to effectively find such artifacts in services.
Keywords: service adaptation, service composition
SGPS: a semantic scheme for web service similarity BIBAKFull-Text 1125-1126
  Sourish Dasgupta; Satish Bhat; Yugyung Lee
Today's Web becomes a platform for services to be dynamically interconnected to produce a desired outcome. It is important to formalize the semantics of the contextual elements of web services. In this paper, we propose a novel technique called Semantic Genome Propagation Scheme (SGPS) for measuring similarity between semantic concepts. We show how SGPS is used to compute a multi-dimensional similarity between two services. We evaluate the SGPS similarity measurement in terms of the similarity performance and scalability.
Keywords: context, semantics, similarity, web services
Automated synthesis of composite services with correctness guarantee BIBAKFull-Text 1127-1128
  Ting Deng; Jinpeng Huai; Xianxian Li; Zongxia Du; Huipeng Guo
In this paper, we propose a novel approach for composing existing web services to satisfy the correctness constraints to the design, including freeness of deadlock and unspecified reception, and temporal constraints in Computation Tree Logic formula. An automated synthesis algorithm based on learning algorithm is introduced, which guarantees that the composite service is the most general way of coordinating services so that the correctness is ensured. We have implemented a prototype system evaluating the effectiveness and efficiency of our synthesis approach through an experimental study.
Keywords: composition synthesis, correctness constraints, learning algorithm
User-centric content freshness metrics for search engines BIBAKFull-Text 1129-1130
  Ali Dasdan; Xinh Huynh
In order to return relevant search results, a search engine must keep its local repository synchronized to the Web, but it is usually impossible to attain perfect freshness. Hence, it is vital for a production search engine continually to monitor and improve repository freshness. Most previous freshness metrics, formulated in the context of developing better synchronization policies, focused on the web crawler while ignoring other parts of a search engine. But, the freshness of documents in a web crawler does not necessarily translate directly into the freshness of search results as seen by users. We propose metrics for measuring freshness from a user's perspective, which take into account the latency between when documents are crawled and when they are viewed by users, as well as the variation in user click and view frequency among different documents. We also describe a practical implementation of these metrics that were used in a production search engine.
Keywords: crawling, document age, freshness, latency, metrics, monitoring, search engine
Reliability analysis using weighted combinational models for web-based software BIBAKFull-Text 1131-1132
  Chao-Jung Hsu; Chin-Yu Huang
In the past, some researches suggested that engineers can use combined software reliability growth models (SRGMs) to obtain more accurate reliability prediction during testing. In this paper, three weighted combinational models, namely, equal, linear, and nonlinear weight, are proposed for reliability estimation of web-based software. We further investigate the estimation accuracy of using genetic algorithm to determine the weight assignment for the proposed models. Preliminary result shows that the linearly and nonlinearly weighted combinational models have better prediction capability than single SRGM and equally weighted combinational model for web-based software.
Keywords: genetic algorithm., software reliability growth model, weighted combination
sMash: semantic-based mashup navigation for data API network BIBAKFull-Text 1133-1134
  Bin Lu; Zhaohui Wu; Yuan Ni; Guotong Xie; Chunying Zhou; Huajun Chen
With the proliferation of data APIs, it is not uncommon that users who have no clear ideas about data APIs will encounter difficulties to build Mashups to satisfy their requirements. In this paper, we present a semantic-based mashup navigation system, sMash that makes mashup building easy by constructing and visualizing a real-life data API network. We build a sample network by gathering more than 300 popular APIs and find that the relationships between them are so complex that our system will play an important role in navigating users and give them inspiration to build interesting mashups easily. The system is accessible at: http://www.dart.zju.edu.cn/mashup.
Keywords: data api network, mashup navigation, semantic, social
Semantic wiki aided business process specification BIBAKFull-Text 1135-1136
  Toufeeq Hussain; Rajesh Balakrishnan; Amar Viswanathan
This paper formulates a collaborative system for modeling business application. The system uses a Semantic Wiki to enable collaboration between the various stakeholders involved in the design of the system and translates the captured intelligence into business models which are used for designing a business system.
Keywords: RDF, business modeling, business process modeling language, semantic web, software
Raise semantics at the user level for dynamic and interactive SOA-based portals BIBAKFull-Text 1137-1138
  Jean-Sebastien Brunner; Patrick Gatellier
In this paper, we describe the fully dynamic semantic portal we implemented, integrating Semantic Web technologies and Service Oriented Architecture (SOA). The goals of the portal are twofold: first it helps administrators to easily propose new features in the portal using semantics to ease the orchestration process; secondly it automatically generates a customized user interface for these scenarios. This user interface takes into account different devices and assists end-users in the use of the portal taking benefit of context awareness. All the added-value of this portal is based on a core semantics defined by an ontology. We present here the main features of this portal and how it was implemented using state-of-the-art technologies and frameworks.
Keywords: SOA, context, semantic portal, semantic web services
Towards lightweight and efficient DDOS attacks detection for web server BIBAKFull-Text 1139-1140
  Yang Li; Tian-Bo Lu; Li Guo; Zhi-Hong Tian; Qin-Wu Nie
In this poster, based on our previous work in building a lightweight DDoS (Distributed Denial-of-Services) attacks detection mechanism for web server using TCM-KNN (Transductive Confidence Machines for K-Nearest Neighbors) and genetic algorithm based instance selection methods, we further propose a more efficient and effective instance selection method, named E-FCM (Extend Fuzzy C-Means). By using this method, we can obtain much cheaper training time for TCM-KNN while ensuring high detection performance. Therefore, the optimized mechanism is more suitable for lightweight DDoS attacks detection in real network environment.
Keywords: E-FCM algorithm, TCM-KNN algorithm, instance selection, web server anomaly detection
A general framework for adaptive and online detection of web attacks BIBAKFull-Text 1141-1142
  Wei Wang; Florent Masseglia; Thomas Guyet; Rene Quiniou; Marie-Odile Cordier
Detection of web attacks is an important issue in current defense-in-depth security framework. In this paper, we propose a novel general framework for adaptive and online detection of web attacks. The general framework can be based on any online clustering methods. A detection model based on the framework is able to learn online and deal with "concept drift" in web audit data streams. Str-DBSCAN that we extended DBSCAN to streaming data as well as StrAP are both used to validate the framework. The detection model based on the framework automatically labels the web audit data and adapts to normal behavior changes while identifies attacks through dynamical clustering of the streaming data. A very large size of real HTTP Log data collected in our institute is used to validate the framework and the model. The preliminary testing results demonstrated its effectiveness.
Keywords: anomaly detection, clustering, intrusion detection
PAKE-based mutual HTTP authentication for preventing phishing attacks BIBAKFull-Text 1143-1144
  Yutaka Oiwa; Hiromitsu Takagi; Hajime Watanabe; Hirofumi Suzuki
We developed a new Web authentication protocol with password-based mutual authentication which prevents various kinds of phishing attacks. This protocol provides a protection of user's passwords against any phishers even if a dictionary attack is employed, and prevents phishers from imitating a false sense of successful authentication to users. The protocol is designed considering interoperability with many recent Web applications which requires many features which current HTTP authentication does not provide. The protocol is proposed as an Internet Draft submitted to IETF, and implemented in both server side (as an Apache extension) and client side (as a Mozilla-based browser and an IE-based one).
Keywords: http, mutual authentication, web systems
Inferring private information using social network data BIBAKFull-Text 1145-1146
  Jack Lindamood; Raymond Heatherly; Murat Kantarcioglu; Bhavani Thuraisingham
On-line social networks, such as Facebook, are increasingly utilized by many users. These networks allow people to publish details about themselves and connect to their friends. Some of the information revealed inside these networks is private and it is possible that corporations could use learning algorithms on the released data to predict undisclosed private information. In this paper, we explore how to launch inference attacks using released social networking data to predict undisclosed private information about individuals. We then explore the effectiveness of possible sanitization techniques that can be used to combat such inference attacks under different scenarios.
Keywords: inference, privacy, social networks
Privacy preserving frequency capping in internet banner advertising BIBAKFull-Text 1147-1148
  Ayman Farahat
We describe an optimize-and-dispatch approach for delivering pay-per-impression advertisements in online advertising. The platform provider for an advertising network commits to showing advertisers' banner ads while capping the number of advertising message shown to a unique user as the user transitions through the network. The traditional approach for enforcing frequency caps has been to use cross-site cookies to track users. However, cross-site cookies and other tracking mechanisms can infringe on the user privacy. In this paper, we propose a novel linear programming approach that decides when to show an ad to the user based solely on the page currently viewed by the users. We show that the frequency caps are fulfilled in expectation. We show the efficacy of that approach using simulation results.
Keywords: privacy, user model
Crosslanguage blog mining and trend visualisation BIBAKFull-Text 1149-1150
  Andreas Juffinger; Elisabeth Lex
People use weblogs to express thoughts, present ideas and share knowledge, therefore weblogs are extraordinarily valuable resources, among others, for trend analysis. Trends are derived from the chronological sequence of blog post count per topic. The comparison with a reference corpus allows qualitative statements over identified trends. We propose a crosslanguage blog mining and trend visualisation system to analyse blogs across languages and topics. The trend visualisation facilitates the identification of trends and the comparison with the reference news article corpus. To prove the correctness of our system we computed the correlation between trends in blogs and news articles for a subset of blogs and topics. The evaluation corroborated our hypothesis of a high correlation coefficient for these subsets and therefore the correctness of our system for different languages and topics is proven.
Keywords: blog mining, crosslanguage, trend visualisation
Crawling English-Japanese person-name transliterations from the web BIBAKFull-Text 1151-1152
  Satoshi Sato
Automatic compilation of lexicon is a dream of lexicon compilers as well as lexicon users. This paper proposes a system that crawls English-Japanese person-name transliterations from the Web, which works a back-end collector for automatic compilation of bilingual person-name lexicon. Our crawler collected 561K transliterations in five months. From them, an English-Japanese person-name lexicon with 406K entries has been compiled by an automatic post processing. This lexicon is much larger than other similar resources including English-Japanese lexicon of HeiNER obtained from Wikipedia.
Keywords: automatic lexicon compilation, mining transliteration pairs, person name
Near real time information mining in multilingual news BIBAKFull-Text 1153-1154
  Martin Atkinson; Erik Van der Goot
This paper presents a near real-time multilingual news monitoring and analysis system that forms the backbone of our research work. The system integrates technologies to address the problems related to information extraction and analysis of open source intelligence on the World Wide Web. By chaining together different techniques in text mining, automated machine learning and statistical analysis, we can automatically determine who, where and, to a certain extent, what is being reported in news articles.
Keywords: automated media monitoring, information mining and analysis, multi-linguality, open source text
Mining multilingual topics from wikipedia BIBAKFull-Text 1155-1156
  Xiaochuan Ni; Jian-Tao Sun; Jian Hu; Zheng Chen
In this paper, we try to leverage a large-scale and multilingual knowledge base, Wikipedia, to help effectively analyze and organize Web information written in different languages. Based on the observation that one Wikipedia concept may be described by articles in different languages, we adapt existing topic modeling algorithm for mining multilingual topics from this knowledge base. The extracted 'universal' topics have multiple types of representations, with each type corresponding to one language. Accordingly, new documents of different languages can be represented in a space using a group of universal topics, which makes various multilingual Web applications feasible.
Keywords: multilingual, topic modeling, universal-topics, wikipedia
Towards language-independent web genre detection BIBAKFull-Text 1157-1158
  Philipp Scholl; Renato Domínguez García; Doreen Böhnstedt; Christoph Rensing; Ralf Steinmetz
The term web genre denotes the type of a given web resource, in contrast to the topic of its content. In this research, we focus on recognizing the web genres blog, wiki and forum. We present a set of features that exploit the hierarchical structure of the web page's HTML mark-up and thus, in contrast to related approaches, do not depend on a linguistic analysis of the page's content. Our results show that it is possible to achieve a very good accuracy for a fully language independent detection of structured web genres.
Keywords: machine learning, structural features, web genres
The web of nations BIBAKFull-Text 1159-1160
  Sukwon Chung; Dungjit Shiowattana; Pavel Dmitriev; Su Chan
In this paper, we report on a large-scale study of structural differences among the national webs. The study is based on a web-scale crawl conducted in the summer 2008. More specifically, we study two graphs derived from this crawl, the nation graph, with nodes corresponding to nations and edges -- to links among nations, and the host graph, with nodes corresponding to hosts and edges -- to hyperlinks among pages on the hosts. Contrary to some of the previous work [2], our results show that webs of different nations are often very different from each other, both in terms of their internal structure, and in terms of their connectivity with other nations.
Keywords: host graph, nation graph, web graph, web structure
Cascading style sheets: a novel approach towards productive styling with today's standards BIBAKFull-Text 1161-1162
  Matthias Keller; Martin Nussbaumer
In this paper we present an approach of generating Cascading Style Sheet documents automatically if the desired effect on the content elements is specified. While a Web user agent resolves the CSS rules and computes their effect, our approach handles the way back. We argue, that this can remarkably improve CSS productivity, since the process of CSS authoring always involves this direction implicitly. Our approach claims a new and innovative way to reuse chunks of markup together with its presentation. It furthermore bears potential for the optimization and reorganization of CSS documents. We describe criteria for CSS code quality we oriented on, including a quantitative indicator for the abstractness of a CSS presentation specification. An evaluation and recomputation of the CSS for 25.000 HTML documents shows that concerning these criteria the automatically generated code comes close to manually authored code.
Keywords: CSS, HTML, abstractness factor, presentation authoring
Automatically filling form-based web interfaces with free text inputs BIBAKFull-Text 1163-1164
  Guilherme A. Toda; Eli Cortez; Filipe Mesquita; Altigran S. da Silva; Edleno Moura; Marden Neubert
On the web of today the most prevalent solution for users to interact with data-intensive applications is the use of form-based interfaces composed by several data input fields, such as text boxes, radio buttons, pull-down lists, check boxes, etc. Although these interfaces are popular and effective, in many cases, free text interfaces are preferred over form-based ones. In this paper we discuss the proposal and the implementation of a novel IR-based method for using data rich free text to interact with form-based interfaces. Our solution takes a free text as input, extracts implicitly data values from it and fills appropriate fields using them. For this task, we rely on values of previous submissions for each field, which are freely obtained from the usage of form-based interfaces.
Keywords: data extraction, form filling, web applications
A densitometric analysis of web template content BIBAKFull-Text 1165-1166
  Christian Kohlschütter
What makes template content in the Web so special that we need to remove it? In this paper I present a large-scale aggregate analysis of textual Web content, corroborating statistical laws from the field of Quantitative Linguistics. I analyze the idiosyncrasy of template content compared to regular "full text" content and derive a simple yet suitable quantitative model.
Keywords: content analysis, noise removal, template detection, template removal, web page segmentation
A flexible dialogue system for enhancing web usability BIBAKFull-Text 1167-1168
  Marta Gatius; Meritxell González
In this paper, we study how the performance and usability of web dialogue systems could be enhanced by using an appropriate representation of the different types of knowledge involved in communication: general dialogue mechanisms, specific domain-restricted linguistic and conceptual knowledge and information on how well the communication process is doing. We describe the experiments carried out to analyze how to improve this knowledge representation in the web dialogue system we developed.
Keywords: adaptive dialogue systems, evaluation, mixed-initiative dialogues, web dialogue systems
Estimating web site readability using content extraction BIBAKFull-Text 1169-1170
  Thomas Gottron; Ludger Martin
Nowadays, information is primarily searched on the WWW. From a user perspective, the readability is an important criterion for measuring the accessibility and thereby the quality of an information. We show that modern content extraction algorithms help to estimate the readability of a web document quite accurate.
Keywords: content extraction, readability, usability
Web content accessibility guidelines: from 1.0 to 2.0 BIBAKFull-Text 1171-1172
  Miquel Termens; Mireia Ribera; Mercè Porras; Marc Boldú; Andreu Sulé; Pilar Paris
This poster explains the changes introduced in the Web Content Accessibility Guidelines (WCAG) 2.0 from WCAG 1.0 and proposes a checklist for adapting existing websites. Finally, it describes the most common criticisms of the WCAG and places them in the context of its origin and initial aims.
Keywords: WCAG 1.0, WCAG 2.0, accessibility regulations

Posters Friday, April 24, 2009

Mining cultural differences from a large number of geotagged photos BIBAKFull-Text 1173-1174
  Keiji Yanai; Bingyu Qiu
We propose a novel method to detect cultural differences over the world automatically by using a large amount of geotagged images on the photo sharingWeb sites such as Flickr. We employ the state-of-the-art object recognition technique developed in the research community of computer vision to mine representative photos of the given concept for representative local regions from a large-scale unorganized collection of consumer-generated geotagged photos. The results help us understand how objects, scenes or events corresponding to the same given concept are visually different depending on local regions over the world.
Keywords: flickr, geotag, object recognition, representative images
An experimental study of large-scale mobile social network BIBAKFull-Text 1175-1176
  Zheng-Bin Dong; Guo-Jie Song; Kun-Qing Xie; Jing-Yao Wang
Mobile social network is a typical social network where one or more individuals of similar interests or commonalities, conversing and connecting with one another using the mobile phone. Our works in this paper focus on the experimental study for this kind of social network with the support of large-scale real mobile call data. The main contributions can be summarized as three-fold: firstly, a large-scale real mobile phone call log of one city has been extracted from a mobile phone carrier in China to construct mobile social network; secondly, common features of traditional social networks, such as power law distribution and small diameter etc, have been experimented, with which we confirm that the mobile social network is a typical scale-free network and has small-world phenomenon; lastly, different from traditional analytical methods, important properties of the actors, such as gender and age, have been introduced into our experiments with some interesting findings about human behavior, for example, the middle-age people are more active than the young and old people, and the female is unusual more active than the male while in the old age.
Keywords: betweenness centrality, clustering coefficient, degree distribution, diameter, shortest path
A P2P based distributed services network for next generation mobile internet communications BIBAKFull-Text 1177-1178
  Yang Li; Yi-Chuan Wu; Jian-Ying Zhang; Jin Peng; Hong-Luan Liao; Yun-Fei Zhang
In this poster, we present a novel P2P (Peer to Peer) based distributed services network (DSN), which is a next generation operable and manageable distributed core network architecture and functional structure, proposed by China Mobile for telecommunication services and wireless Internet. Our preliminary implementations of P2P VoIP (Voice over Internet Protocol) system over DSN platform demonstrate its effectiveness and promising future.
Keywords: DSN, P2P, distributed computing, mobile communication
Securely implementing open geospatial consortium web service interface standards in oracle spatial BIBAKFull-Text 1179-1180
  Ning An; Raja Chatterjee; Mike Horhammer; Siva Ravada
In this paper, we briefly describe the implementation of various Open Geospatial Consortium Web Service Interface Standards in Oracle Spatial 11g. We highlight how we utilize Oracle's implementation of OASIS Web Services Security (WSS) to provide a robust security framework for these OGC Web Services. We also discuss our future direction in supporting OGC Web Service Interface Standards.
Keywords: geospatial, ogc web service interface standards, oracle spatial, security
Visualization of Geo-annotated pictures in mobile phones BIBAKFull-Text 1181-1182
  Roberto Manca; Francesco Massidda; Davide Carboni
In this work, a novel mobile browser for geo-referenced pictures is introduced and described. We use the term browser to denote a system aimed at browsing pictures selected from a large set like Internet photo sharing services. The criteria to filter a subset of pictures to browse are three: the user's actual position, the user's actual heading, and the user's preferences. In this work we only focus on the first two criteria leaving the integration of user's preferences for future developments.
Keywords: GPS, arduino, compass, geo-browsing, maps, mobile photo browsing
Deducing trip related information from flickr BIBAKFull-Text 1183-1184
  Adrian Popescu; Gregory Grefenstette
Uploading tourist photos is a popular activity on photo sharing platforms. These photographs and their associated metadata (tags, geo-tags, and temporal information) should be useful for mining information about the sites visited. However, user-supplied metadata are often noisy and efficient filtering methods are needed before extracting useful knowledge. We focus here on exploiting temporal information, associated with tourist sites that appear in Flickr. From automatically filtered sets of geo-tagged photos, we deduce answers to questions like "how long does it take to visit a tourist attraction?" or "what can I visit in one day in this city?" Our method is evaluated and validated by comparing the automatically obtained visit duration times to manual estimations.
Keywords: flickr, geographical gazetteer, georeferencing, image mining, text mining, tourist sites, visit times
Link based small sample learning for web spam detection BIBAKFull-Text 1185-1186
  Guang-Gang Geng; Qiudan Li; Xinchang Zhang
Robust statistical learning based web spam detection system often requires large amounts of labeled training data. However, labeled samples are more difficult, expensive and time consuming to obtain than unlabeled ones. This paper proposed link based semi-supervised learning algorithms to boost the performance of a classifier, which integrates the traditional Self-training with the topological dependency based link learning. The experiments with a few labeled samples on standard WEBSPAM-UK2006 benchmark showed that the algorithms are effective.
Keywords: content spam, link spam, machine learning, web spam
Detecting image spam using local invariant features and pyramid match kernel BIBAKFull-Text 1187-1188
  Haiqiang Zuo; Weiming Hu; Ou Wu; Yunfei Chen; Guan Luo
Image spam is a new obfuscating method which spammers invented to more effectively bypass conventional text based spam filters. In this paper, we extract local invariant features of images and run a one-class SVM classifier which uses the pyramid match kernel as the kernel function to detect image spam. Experimental results demonstrate that our algorithm is effective for fighting image spam.
Keywords: image spam, local invariant features, pyramid match kernel
Web image retrieval reranking with multi-view clustering BIBAKFull-Text 1189-1190
  Mingmin Chi; Peiwu Zhang; Yingbin Zhao; Rui Feng; Xiangyang Xue
General image retrieval is often carried out by a text-based search engine, such as Google Image Search. In this case, natural language queries are used as input to the search engine. Usually, the user queries are quite ambiguous and the returned results are not well-organized as the ranking often done by the popularity of an image. In order to address these problems, we propose to use both textual and visual contents of retrieved images to reRank web retrieved results. In particular, a machine learning technique, a multi-view clustering algorithm is proposed to reorganize the original results provided by the text-based search engine. Preliminary results validate the effectiveness of the proposed framework.
Keywords: multi-view clustering, reranking, web image retrieval
Characterizing web-based video sharing workloads BIBKFull-Text 1191-1192
  Siddharth Mitra; Mayank Agrawal; Amit Yadav; Niklas Carlsson; Derek Eager; Anirban Mahanti
Keywords: ugc, video sharing, workload characterization
Deriving music theme annotations from user tags BIBAKFull-Text 1193-1194
  Kerstin Bischoff; Claudiu S. Firan; Raluca Paiu
Music theme annotations would be really beneficial for supporting retrieval, but are often neglected by users while annotating. Thus, in order to support users in tagging and to fill the gaps in the tag space, in this paper we develop algorithms for recommending theme annotations. Our methods exploit already existing user tags, the lyrics of music tracks, as well as combinations of both. We compare the results for our recommended theme annotations against genre and style recommendations -- a much easier and already studied task. We evaluate the quality of our recommended tags against an expert ground truth data set. Our results are promising and provide interesting insights into possible extensions for music tagging systems to support music search.
Keywords: collaborative tagging, high-level music descriptors, metadata enrichment, theme tag recommendations
Tag-oriented document summarization BIBAKFull-Text 1195-1196
  Junyan Zhu; Can Wang; Xiaofei He; Jiajun Bu; Chun Chen; Shujie Shang; Mingcheng Qu; Gang Lu
Social annotations on a Web document are highly generalized description of topics contained in that page. Their tagged frequency indicates the user attentions with various degrees. This makes annotations a good resource for summarizing multiple topics in a Web page. In this paper, we present a tag-oriented Web document summarization approach by using both document content and the tags annotated on that document. To improve summarization performance, a new tag ranking algorithm named EigenTag is proposed in this paper to reduce noise in tags. Meanwhile, association mining technique is employed to expand tag set to tackle the sparsity problem. Experimental results show our tag-oriented summarization has a significant improvement over those not using tags.
Keywords: document summarization, ranking, tag
Search result re-ranking based on gap between search queries and social tags BIBAKFull-Text 1197-1198
  Jun Yan; Ning Liu; Elaine Qing Chang; Lei Ji; Zheng Chen
Both search engine click-through log and social annotation have been utilized as user feedback for search result re-ranking. However, to our best knowledge, no previous study has explored the correlation between these two factors for the task of search result ranking. In this paper, we show that the gap between search queries and social tags of the same Web page can well reflect its user preference score. Motivated by this observation, we propose a novel algorithm, called Query-Tag-Gap (QTG), to re-rank search results for better user satisfaction. Intuitively, on one hand, the search users' intentions are generally described by their queries before they read the search results. On the other hand, the Web annotators semantically tag Web pages after they read the content of the pages. The difference between users' recognition of the same page before and after they read it is a good reflection of user satisfaction. In this extended abstract, we formally define the query set and tag set of the same page as users' pre- and post- knowledge respectively. We empirically show the strong correlation between user satisfaction and user's knowledge gap before and after reading the page. Based on this gap, experiments have shown outstanding performance of our proposed QTG algorithm in search result re-ranking.
Keywords: query log, search result ranking, social tagging
Signaling emotion in tagclouds BIBAKFull-Text 1199-1200
  Takeharu Eda; Toshio Uchiyama; Tadasu Uchiyama; Masatoshi Yoshikawa
In order to create more attractive tagclouds that get people interested in tagged content, we propose a simple but novel tagcloud where font size is determined by tag's entropy value, not the popularity to its content. Our method raises users' emotional interest in the content by emphasizing more emotional tags. Our initial experiments show that emotional tagclouds attract more attention than normal tagclouds at first look; thus they will enhance the role of tagcloud as a social signaller.
Keywords: classification, folksonomy, tagcloud, tagging
Two birds with one stone: a graph-based framework for disambiguating and tagging people names in web search BIBAKFull-Text 1201-1202
  Lili Jiang; Jianyong Wang; Ning An; Shengyuan Wang; Jian Zhan; Lian Li
The ever growing volume of Web data makes it increasingly challenging to accurately find relevant information about a specific person on the Web. To address the challenge caused by name ambiguity in Web people search, this paper explores a novel graph-based framework to both disambiguate and tag people entities in Web search results. Experimental results demonstrate the effectiveness of the proposed framework in tag discovery and name disambiguation.
Keywords: clustering, name disambiguation, tagging
The value of socially tagged urls for a search engine BIBAKFull-Text 1203-1204
  Santanu Kolay; Ali Dasdan
Social bookmarking has emerged as a growing source of human generated content on the web. In essence, bookmarking involves URLs and tags on them. In this paper, we perform a large scale study of the usefulness of bookmarked URLs from the top social bookmarking site Delicious. Instead of focusing on the dimension of tags, which has been covered in the previous work, we explore social bookmarking from the dimension of URLs. More specifically, we investigate the Delicious URLs and their content to quantify their value to a search engine. For their value in leading to good content, we show that the Delicious URLs have higher quality content and more external outlinks. For their value in satisfying users, we show that the Delicious URLs have more clicked URLs as well as get more clicks. We suggest that based on their value, the Delicious URLs should be used as another source of seed URLs for crawlers.
Keywords: content quality, delicious, social bookmarking
The recurrence dynamics of social tagging BIBAKFull-Text 1205-1206
  Dell Zhang; Robert Mao; Wei Li
How often do tags recur? How hard is predicting tag recurrence? What tags are likely to recur? We try to answer these questions by analysing the RSDC08 dataset, in both individual and collective settings. Our findings provide useful insights for the development of tag suggestion techniques etc.
Keywords: folksonomy, social tagging, web 2.0
Playful tagging: folksonomy generation using online games BIBAKFull-Text 1207-1208
  Markus Krause; Hidir Aras
Collaborative Tagging is a powerful method to create folksonomies that can be used to grasp/filter user preferences or enhance web search. Recent research has shown that depending on the number of users and the quality of user-provided tags powerful community-driven semantics or "ontologies" can emerge -- as it was evident analyzing user data from social web applications such as del.icio.us or Flickr. Unfortunately, most web pages do not contain tags and, thus, no vocabulary that describes the information provided. A common problem in web page annotation is to motivate users for constant participation, i.e. tagging. In this paper we describe our approach of a binary verification game that embeds collaborative tagging into on-line games in order to produce domain specific folksonomies.
Keywords: folksonomies, games with a purpose, human-based computation, social tagging
Identifying vertical search intention of query through social tagging propagation BIBAKFull-Text 1209-1210
  Ning Liu; Jun Yan; Weiguo Fan; Qiang Yang; Zheng Chen
A pressing task during the unification process is to identify a user's vertical search intention based on the user's query. In this paper, we propose a novel method to propagate social annotation, which includes user-supplied tag data, to both queries and VSEs for semantically bridging them. Our proposed algorithm consists of three key steps: query annotation, vertical annotation and query intention identification. Our algorithm, referred to as TagQV, verifies that the social tagging can be propagated to represent Web objects such as queries and VSEs besides Web pages. Experiments on real Web search queries demonstrate the effectiveness of TagQV in query intention identification.
Keywords: metadata, social annotation, vertical search engine (vse).
Social search and discovery using a unified approach BIBAKFull-Text 1211-1212
  Einat Amitay; David Carmel; Nadav Har'El; Shila Ofek-Koifman; Aya Soffer; Sivan Yogev; Nadav Golbandi
We explore new ways of improving a search engine using data from Web 2.0 applications such as blogs and social bookmarks. This data contains entities such as documents, people and tags, and relationships between them. We propose a simple yet effective method, based on faceted search, that treats all entities in a unified manner: returning all of them (documents, people and tags) on every search, and allowing all of them to be used as search terms. We describe an implementation of such a social search engine on the intranet of a large enterprise, and present large-scale experiments which verify the validity of our approach.
Keywords: enterprise search, faceted search, social search
Extracting community structure through relational hypergraphs BIBAKFull-Text 1213-1214
  Yu-Ru Lin; Jimeng Sun; Paul Castro; Ravi Konuru; Hari Sundaram; Aisling Kelliher
Social media websites promote diverse user interaction on media objects as well as user actions with respect to other users. The goal of this work is to discover community structure in rich media social networks, and observe how it evolves over time, through analysis of multi-relational data. The problem is important in the enterprise domain where extracting emergent community structure on enterprise social media, can help in forming new collaborative teams, aid in expertise discovery, and guide long term enterprise reorganization. Our approach consists of three main parts: (1) a relational hypergraph model for modeling various social context and interactions; (2) a novel hypergraph factorization method for community extraction on multi-relational social data; (3) an on-line method to handle temporal evolution through incremental hypergraph factorization. Extensive experiments on real-world enterprise data suggest that our technique is scalable and can extract meaningful communities. To evaluate the quality of our mining results, we use our method to predict users' future interests. Our prediction outperforms baseline methods (frequency counts, pLSA) by 36-250% on the average, indicating the utility of leveraging multi-relational social context by using our method.
Keywords: community evolution, dynamic social network analysis, non-negative tensor factorization, relational hypergraph
Ranking user-created contents by search user's inclination in online communities BIBAKFull-Text 1215-1216
  Hyoseop Shin; Jeehoon Lee
Searching posts effectively has become an important issue in large-scale online communities. Especially, if search users have different inclinations when they search posts, they have different kinds of posts in their minds. To address this problem, in this paper, we propose a scheme of ranking posts based on search users' inclination. User ranking score is employed to capture posts that are relevant to a specific user inclination. Specifically, we present a scheme to rank posts in terms of user expertise and popularity. Experimental results show that different user inclinations can produce quite different search results and the proposed scheme achieves about 70% accuracy.
Keywords: expertise, online community, popularity, post ranking, search user inclination, user ranking
Retaining personal expression for social search BIBAKFull-Text 1217-1218
  Praphul Chandra; Ajay Gupta
Web is being extensively used for personal expression, which includes ratings, reviews, recommendations, blogs. This user created content, e.g. book review on Amazon.com, becomes the property of the website, and the user often does not have easy access to it. In some cases, user's feedback may get averaged with feedback from other users e.g. ratings of a video. We argue that the creator of such content needs to be able to retain (a link to) her created content. We introduce the concept of MEB which is a user controlled store of such retained links. A MEB allows a user to access/share all the reviews she has given on different websites. With this capability users can allow their friends to search through their feedback. Searching through one's social network allows harnessing the power of social networks where known relationships provide the context & trust necessary to interpret feedback.
Keywords: 2.0, search, social, social networks, user content, web2.0
Discovering the staring people from social networks BIBAKFull-Text 1219-1220
  Dewei Chen; Jie Tang; Juanzi Li; Lizhu Zhou
In this paper, we study a novel problem of staring people discovery from social networks, which is concerned with finding people who are not only authoritative but also sociable in the social network. We formalize this problem as an optimization programming problem. Taking the co-author network as a case study, we define three objective functions and propose two methods to combine these objective functions. A genetic algorithm based method is further presented to solve this problem. Experimental results show that the proposed solution can effectively find the staring people from social networks.
Keywords: social network, staring people discovery
Analysis of community structure in Wikipedia BIBAKFull-Text 1221-1222
  Dmitry Lizorkin; Olena Medelyan; Maria Grineva
We present the results of a community detection analysis of the Wikipedia graph. Distinct communities in Wikipedia contain semantically closely related articles. The central topic of a community can be identified using PageRank. Extracted communities can be organized hierarchically similar to manually created Wikipedia category structure.
Keywords: community detection, graph analysis, wikipedia
Content hole search in community-type content BIBAKFull-Text 1223-1224
  Akiyo Nadamoto; Eiji Aramaki; Takeshi Abekawa; Yohei Murakami
In community-type content such as blogs and SNSs, we call the user's unawareness of information as a "content hole" and the search for this information as a "content hole search." A content hole search differs from similarity searching and has a variety of types. In this paper, we propose different types of content holes and define each type. We also propose an analysis of dialogue related to community-type content and introduce content hole search by using Wikipedia as an example.
Keywords: SNS, blog, community, content hole search
Searching for events in the blogosphere BIBAKFull-Text 1225-1226
  Manolis Platakis; Dimitrios Kotsakos; Dimitrios Gunopulos
Over the last few years, blogs (web logs) have gained massive popularity and have become one of the most influential web social media in our times. Every blog post in the Blogosphere has a well defined timestamp, which is not taken into account by search engines. By conducting research regarding this feature of the Blogosphere, we can attempt to discover bursty terms and correlations between them during a time interval. We apply Kleinberg's automaton on extracted titles of blog posts to discover bursty terms, we introduce a novel representation of a term's burstiness evolution called State Series and we employ a Euclidean-based distance metric to discover potential correlations between terms without taking into account their context. We evaluate the results trying to match them with real life events. Finally, we propose some ideas for further evaluation techniques and future research in the field.
Keywords: blogs, burst analysis, hot topics, information retrieval, keyword correlation, social media, text mining
Ranking community answers via analogical reasoning BIBAKFull-Text 1227-1228
  Xudong Tu; Xin-Jing Wang; Dan Feng; Lei Zhang
Due to the lexical gap between questions and answers, automatically detecting right answers becomes very challenging for community question-answering sites. In this paper, we propose an analogical reasoning-based method. It treats questions and answers as relational data and ranks an answer by measuring the analogy of its link to a query with the links embedded in previous relevant knowledge; the answer that links in the most analogous way to the new question is assumed to be the best answer. We based our experiments on 29.8 million Yahoo!Answer question-answer threads and showed the effectiveness of the approach.
Keywords: analogical reasoning, community question answering
Probabilistic question recommendation for question answering communities BIBAKFull-Text 1229-1230
  Mingcheng Qu; Guang Qiu; Xiaofei He; Cheng Zhang; Hao Wu; Jiajun Bu; Chun Chen
User-Interactive Question Answering (QA) communities such as Yahoo! Answers are growing in popularity. However, as these QA sites always have thousands of new questions posted daily, it is difficult for users to find the questions that are of interest to them. Consequently, this may delay the answering of the new questions. This gives rise to question recommendation techniques that help users locate interesting questions. In this paper, we adopt the Probabilistic Latent Semantic Analysis (PLSA) model for question recommendation and propose a novel metric to evaluate the performance of our approach. The experimental results show our recommendation approach is effective.
Keywords: PLSA, question answering, question recommendation
Buzz-based recommender system BIBAKFull-Text 1231-1232
  Nish Parikh; Neel Sundaresan
In this paper, we describe a buzz-based recommender system based on a large source of queries in an eCommerce application. The system detects bursts in query trends. These bursts are linked to external entities like news and inventory information to find the queries currently in-demand which we refer to as buzz queries. The system follows the paradigm of limited quantity merchandising, in the sense that on a per-day basis the system shows recommendations around a single buzz query with the intent of increasing user curiosity, and improving activity and stickiness on the site. A semantic neighborhood of the chosen buzz query is selected and appropriate recommendations are made on products that relate to this neighborhood.
Keywords: buzz, novelty, recommenders, serendipity
Discovering user profiles BIBAKFull-Text 1233-1234
  Riddhiman Ghosh; Mohamed Dekhil
In this paper we describe techniques for the discovery and construction of user profiles. Leveraging from the emergent data web, our system addresses the problem of sparseness of user profile information currently faced by both asserted and inferred profile systems. A profile mediator, that dynamically builds the most suitable user profile for a particular service or interaction in real-time, is employed in our prototype implementation.
Keywords: information management, personalization, semantic web, user profiles
Bootstrapped extraction of class attributes BIBAKFull-Text 1235-1236
  Joseph Reisinger; Marius Pasca
As an alternative to previous studies on extracting class attributes from unstructured text, which consider either Web documents or query logs as the source of textual data, A bootstrapped method extracts class attributes simultaneously from both sources, using a small set of seed attributes. The method improves extraction precision and also improves attribute relevance across 40 test classes.
Keywords: attribute extraction