| A dynamic bayesian network click model for web search ranking | | BIBAK | Full-Text | 1-10 | |
| Olivier Chapelle; Ya Zhang | |||
| As with any application of machine learning, web search ranking requires
labeled data. The labels usually come in the form of relevance assessments made
by editors. Click logs can also provide an important source of implicit
feedback and can be used as a cheap proxy for editorial labels. The main
difficulty however comes from the so called position bias -- urls appearing in
lower positions are less likely to be clicked even if they are relevant. In
this paper, we propose a Dynamic Bayesian Network which aims at providing us
with unbiased estimation of the relevance from the click logs. Experiments show
that the proposed click model outperforms other existing click models in
predicting both click-through rate and relevance. Keywords: click modeling, click-through rate, dynamic bayesian network, ranking, web
search | |||
| Click chain model in web search | | BIBAK | Full-Text | 11-20 | |
| Fan Guo; Chao Liu; Anitha Kannan; Tom Minka; Michael Taylor; Yi-Min Wang; Christos Faloutsos | |||
| Given a terabyte click log, can we build an efficient and effective click
model? It is commonly believed that web search click logs are a gold mine for
search business, because they reflect users' preference over web documents
presented by the search engine. Click models provide a principled approach to
inferring user-perceived relevance of web documents, which can be leveraged in
numerous applications in search businesses. Due to the huge volume of click
data, scalability is a must.
We present the click chain model (CCM), which is based on a solid, Bayesian framework. It is both scalable and incremental, perfectly meeting the computational challenges imposed by the voluminous click logs that constantly grow. We conduct an extensive experimental study on a data set containing 8.8 million query sessions obtained in July 2008 from a commercial search engine. CCM consistently outperforms two state-of-the-art competitors in a number of metrics, with over 9.7% better log-likelihood, over 6.2% better click perplexity and much more robust (up to 30%) prediction of the first and the last clicked position. Keywords: bayesian models, click log analysis, web search | |||
| Spatio-temporal models for estimating click-through rate | | BIBAK | Full-Text | 21-30 | |
| Deepak Agarwal; Bee-Chung Chen; Pradheep Elango | |||
| We propose novel spatio-temporal models to estimate click-through rates in
the context of content recommendation. We track article CTR at a fixed location
over time through a dynamic Gamma-Poisson model and combine information from
correlated locations through dynamic linear regressions, significantly
improving on per-location model. Our models adjust for user fatigue through an
exponential tilt to the first-view CTR (probability of click on first article
exposure) that is based only on user-specific repeat-exposure features. We
illustrate our approach on data obtained from a module (Today Module) published
regularly on Yahoo! Front Page and demonstrate significant improvement over
commonly used baseline methods. Large scale simulation experiments to study the
performance of our models under different scenarios provide encouraging
results. Throughout, all modeling assumptions are validated via rigorous
exploratory data analysis. Keywords: content recommendation, ctr positional correlation | |||
| Fast dynamic reranking in large graphs | | BIBAK | Full-Text | 31-40 | |
| Purnamrita Sarkar; Andrew W. Moore | |||
| In this paper we consider the problem of re-ranking search results by
incorporating user feedback. We present a graph theoretic measure for
discriminating irrelevant results from relevant results using a few labeled
examples provided by the user. The key intuition is that nodes relatively
closer (in graph topology) to the relevant nodes than the irrelevant nodes are
more likely to be relevant. We present a simple sampling algorithm to evaluate
this measure at specific nodes of interest, and an efficient branch and bound
algorithm to compute the top k nodes from the entire graph under this measure.
On quantifiable prediction tasks the introduced measure outperforms other
diffusion-based proximity measures which take only the positive relevance
feedback into account. On the Entity-Relation graph built from the authors and
papers of the entire DBLP citation corpus (1.4 million nodes and 2.2 million
edges) our branch and bound algorithm takes about 1.5 seconds to retrieve the
top 10 nodes w.r.t. this measure with 10 labeled nodes. Keywords: harmonic function, random walk, search, semi-supervised learning | |||
| Estimating the impressionrank of web pages | | BIBAK | Full-Text | 41-50 | |
| Ziv Bar-Yossef; Maxim Gurevich | |||
| The ImpressionRank of a web page (or, more generally, of a web site) is the
number of times users viewed the page while browsing search results.
ImpressionRank captures the visibility of pages and sites in search engines and
is thus an important measure, which is of interest to web site owners,
competitors, market analysts, and end users.
All previous approaches to estimating the ImpressionRank of a page rely on privileged access to private data sources, like the search engine's query log. In this paper we present the first external algorithm for estimating the ImpressionRank of a web page. This algorithm relies on access to three public data sources: the search engine, the query suggestion service of the search engine, and the web. In addition, the algorithm is local and uses modest resources. It can therefore be used by almost any party to estimate the ImpressionRank of any page on any search engine. En route to estimating the ImpressionRank of a page, our algorithm solves a novel variant of the keyword extraction problem: it finds the most popular search keywords that drive impressions of a page. Empirical analysis of the algorithm on the Google and Yahoo! search engines indicates that it is accurate and provides interesting insights about sites and search queries. Keywords: auto-completions, data mining, estimation, impressionrank, popular keyword
extraction, search engines, suggestions | |||
| Learning to recognize reliable users and content in social media with coupled mutual reinforcement | | BIBAK | Full-Text | 51-60 | |
| Jiang Bian; Yandong Liu; Ding Zhou; Eugene Agichtein; Hongyuan Zha | |||
| Community Question Answering (CQA) has emerged as a popular forum for users
to pose questions for other users to answer. Over the last few years, CQA
portals such as Naver and Yahoo! Answers have exploded in popularity, and now
provide a viable alternative to general purpose Web search. At the same time,
the answers to past questions submitted in CQA sites comprise a valuable
knowledge repository which could be a gold mine for information retrieval and
automatic question answering. Unfortunately, the quality of the submitted
questions and answers varies widely -- increasingly so that a large fraction of
the content is not usable for answering queries. Previous approaches for
retrieving relevant and high quality content have been proposed, but they
require large amounts of manually labeled data -- which limits the
applicability of the supervised approaches to new sites and domains. In this
paper we address this problem by developing a semi-supervised coupled mutual
reinforcement framework for simultaneously calculating content quality and user
reputation, that requires relatively few labeled examples to initialize the
training process. Results of a large scale evaluation demonstrate that our
methods are more effective than previous approaches for finding high-quality
answers, questions, and users. More importantly, our quality estimation
significantly improves the accuracy of search over CQA archives over the
state-of-the-art methods. Keywords: authority and expertise in online communities, community question answering | |||
| Detecting the origin of text segments efficiently | | BIBAK | Full-Text | 61-70 | |
| Ossama Abdel Hamid; Behshad Behzadi; Stefan Christoph; Monika Henzinger | |||
| In the origin detection problem an algorithm is given a set S of documents,
ordered by creation time, and a query document D. It needs to output for every
consecutive sequence of k alphanumeric terms in D the earliest document in $S$
in which the sequence appeared (if such a document exists). Algorithms for the
origin detection problem can, for example, be used to detect the "origin" of
text segments in D and thus to detect novel content in D. They can also find
the document from which the author of D has copied the most (or show that D is
mostly original.) We concentrate on solutions that use only a fixed amount of
memory. We propose novel algorithms for this problem and evaluate them together
with a large number of previously published algorithms. Our results show that
(1) detecting the origin of text segments efficiently can be done with very
high accuracy even when the space used is less than 1% of the size of the
documents in $S$, (2) the precision degrades smoothly with the amount of
available space, (3) various estimation techniques can be used to increase the
performance of the algorithms. Keywords: document overlap, shingling | |||
| Enhancing diversity, coverage and balance for summarization through structure learning | | BIBAK | Full-Text | 71-80 | |
| Liangda Li; Ke Zhou; Gui-Rong Xue; Hongyuan Zha; Yong Yu | |||
| Document summarization plays an increasingly important role with the
exponential growth of documents on the Web. Many supervised and unsupervised
approaches have been proposed to generate summaries from documents. However,
these approaches seldom simultaneously consider summary diversity, coverage,
and balance issues which to a large extent determine the quality of summaries.
In this paper, we consider extract-based summarization emphasizing the
following three requirements: 1) diversity in summarization, which seeks to
reduce redundancy among sentences in the summary; 2) sufficient coverage, which
focuses on avoiding the loss of the document's main information when generating
the summary; and 3) balance, which demands that different aspects of the
document need to have about the same relative importance in the summary. We
formulate the extract-based summarization problem as learning a mapping from a
set of sentences of a given document to a subset of the sentences that
satisfies the above three requirements. The mapping is learned by incorporating
several constraints in a structure learning framework, and we explore the graph
structure of the output variables and employ structural SVM for solving the
resulted optimization problem. Experiments on the DUC2001 data sets demonstrate
significant performance improvements in terms of F1 and ROUGE metrics. Keywords: balance, coverage, diversity, structural svm, summarization | |||
| Efficient overlap and content reuse detection in blogs and online news articles | | BIBAK | Full-Text | 81-90 | |
| Jong Wook Kim; K. Selçuk Candan; Junichi Tatemura | |||
| The use of blogs to track and comment on real world (political, news,
entertainment) events is growing. Similarly, as more individuals start relying
on the Web as their primary information source and as more traditional media
outlets try reaching consumers through alternative venues, the number of news
sites on the Web is also continuously increasing. Content-reuse, whether in the
form of extensive quotations or content borrowing across media outlets, is very
common in blogs and news entries outlets tracking the same real-world event.
Knowledge about which web entries re-use content from which others can be an
effective asset when organizing these entries for presentation. On the other
hand, this knowledge is not cheap to acquire: considering the size of the
related space web entries, it is essential that the techniques developed for
identifying re-use are fast and scalable. Furthermore, the dynamic nature of
blog and news entries necessitates incremental processing for reuse detection.
In this paper, we develop a novel qSign algorithm that efficiently and
effectively analyze the blogosphere for quotation and reuse identification.
Experiment results show that with qSign processing time gains from 10X to 100X
are possible while maintaining reuse detection rates of up to 90%. Furthermore,
processing time gains can be pushed multiple orders of magnitude (from 100X to
1000X) for 70% recall. Keywords: reuse detection, weblogs | |||
| Latent space domain transfer between high dimensional overlapping distributions | | BIBAK | Full-Text | 91-100 | |
| Sihong Xie; Wei Fan; Jing Peng; Olivier Verscheure; Jiangtao Ren | |||
| Transferring knowledge from one domain to another is challenging due to a
number of reasons. Since both conditional and marginal distribution of the
training data and test data are non-identical, model trained in one domain,
when directly applied to a different domain, is usually low in accuracy. For
many applications with large feature sets, such as text document, sequence
data, medical data, image data of different resolutions, etc. two domains
usually do not contain exactly the same features, thus introducing large
numbers of "missing values" when considered over the union of features from
both domains. In other words, its marginal distributions are at most
overlapping. In the same time, these problems are usually high dimensional,
such as, several thousands of features. Thus, the combination of high
dimensionality and missing values make the relationship in conditional
probabilities between two domains hard to measure and model. To address these
challenges, we propose a framework that first brings the marginal distributions
of two domains closer by "filling up" those missing values of disjoint
features. Afterwards, it looks for those comparable sub-structures in the
"latent-space" as mapped from the expanded feature vector, where both marginal
and conditional distribution are similar. With these sub-structures in latent
space, the proposed approach then find common concepts that are transferable
across domains with high probability. During prediction, unlabeled instances
are treated as "queries", the mostly related labeled instances from out-domain
are retrieved, and the classification is made by weighted voting using
retrieved out-domain examples. We formally show that importing feature values
across domains and latent semantic index can jointly make the distributions of
two related domains easier to measure than in original feature space, the
nearest neighbor method employed to retrieve related out domain examples is
bounded in error when predicting in-domain examples. Software and datasets are
available for download. Keywords: high dimensional, latent, missing value, text mining, transfer learning | |||
| StatSnowball: a statistical approach to extracting entity relationships | | BIBAK | Full-Text | 101-110 | |
| Jun Zhu; Zaiqing Nie; Xiaojiang Liu; Bo Zhang; Ji-Rong Wen | |||
| Traditional relation extraction methods require pre-specified relations and
relation-specific human-tagged examples. Bootstrapping systems significantly
reduce the number of training examples, but they usually apply heuristic-based
methods to combine a set of strict hard rules, which limit the ability to
generalize and thus generate a low recall. Furthermore, existing bootstrapping
methods do not perform open information extraction (Open IE), which can
identify various types of relations without requiring pre-specifications. In
this paper, we propose a statistical extraction framework called Statistical
Snowball (StatSnowball), which is a bootstrapping system and can perform both
traditional relation extraction and Open IE.
StatSnowball uses the discriminative Markov logic networks (MLNs) and softens hard rules by learning their weights in a maximum likelihood estimate sense. MLN is a general model, and can be configured to perform different levels of relation extraction. In StatSnwoball, pattern selection is performed by solving an l{sub:1}-norm penalized maximum likelihood estimation, which enjoys well-founded theories and efficient solvers. We extensively evaluate the performance of StatSnowball in different configurations on both a small but fully labeled data set and large-scale Web data. Empirical results show that StatSnowball can achieve a significantly higher recall without sacrificing the high precision during iterations with a small number of seeds, and the joint inference of MLN can improve the performance. Finally, StatSnowball is efficient and we have developed a working entity relation search engine called Renlifang based on it. Keywords: Markov logic networks, relationship extraction, statistical models | |||
| Matchbox: large scale online bayesian recommendations | | BIBAK | Full-Text | 111-120 | |
| David H. Stern; Ralf Herbrich; Thore Graepel | |||
| We present a probabilistic model for generating personalised recommendations
of items to users of a web service. The Matchbox system makes use of content
information in the form of user and item meta data in combination with
collaborative filtering information from previous user behavior in order to
predict the value of an item for a user. Users and items are represented by
feature vectors which are mapped into a low-dimensional 'trait space' in which
similarity is measured in terms of inner products. The model can be trained
from different types of feedback in order to learn user-item preferences. Here
we present three alternatives: direct observation of an absolute rating each
user gives to some items, observation of a binary preference (like/don't like)
and observation of a set of ordinal ratings on a user-specific scale. Efficient
inference is achieved by approximate message passing involving a combination of
Expectation Propagation (EP) and Variational Message Passing. We also include a
dynamics model which allows an item's popularity, a user's taste or a user's
personal rating scale to drift over time. By using Assumed-Density Filtering
(ADF) for training, the model requires only a single pass through the training
data. This is an on-line learning algorithm capable of incrementally taking
account of new data so the system can immediately reflect the latest user
preferences. We evaluate the performance of the algorithm on the MovieLens and
Netflix data sets consisting of approximately 1,000,000 and 100,000,000 ratings
respectively. This demonstrates that training the model using the on-line ADF
approach yields state-of-the-art performance with the option of improving
performance further if computational resources are available by performing
multiple EP passes over the training data. Keywords: advertising, bayesian inference, collaborative filtering, machine learning,
online services, recommender system | |||
| Learning consensus opinion: mining data from a labeling game | | BIBAK | Full-Text | 121-130 | |
| Paul N. Bennett; David Maxwell Chickering; Anton Mityagin | |||
| We consider the problem of identifying the consensus ranking for the results
of a query, given preferences among those results from a set of individual
users. Once consensus rankings are identified for a set of queries, these
rankings can serve for both evaluation and training of retrieval and learning
systems. We present a novel approach to collecting the individual user
preferences over image-search results: we use a collaborative game in which
players are rewarded for agreeing on which image result is best for a query.
Our approach is distinct from other labeling games because we are able to
elicit directly the preferences of interest with respect to image queries
extracted from query logs. As a source of relevance judgments, this data
provides a useful complement to click data. Furthermore, the data is free of
positional biases and is collected by the game without the risk of frustrating
users with non-relevant results; this risk is prevalent in standard mechanisms
for debiasing clicks. We describe data collected over 34 days from a deployed
version of this game that amounts to about 18 million expressed preferences
between pairs. Finally, we present several approaches to modeling this data in
order to extract the consensus rankings from the preferences and better sort
the search results for targeted queries. Keywords: learning preferences, preference judgments | |||
| Rated aspect summarization of short comments | | BIBAK | Full-Text | 131-140 | |
| Yue Lu; ChengXiang Zhai; Neel Sundaresan | |||
| Web 2.0 technologies have enabled more and more people to freely comment on
different kinds of entities (e.g. sellers, products, services). The large scale
of information poses the need and challenge of automatic summarization. In many
cases, each of the user-generated short comments comes with an overall rating.
In this paper, we study the problem of generating a "rated aspect summary" of
short comments, which is a decomposed view of the overall ratings for the major
aspects so that a user could gain different perspectives towards the target
entity. We formally define the problem and decompose the solution into three
steps. We demonstrate the effectiveness of our methods by using eBay sellers'
feedback comments. We also quantitatively evaluate each step of our methods and
study how well human agree on such a summarization task. The proposed methods
are quite general and can be used to generate rated aspect summary
automatically given any collection of short comments each associated with an
overall rating. Keywords: rated aspect summarization, rating prediction, short comments | |||
| How opinions are received by online communities: a case study on amazon.com helpfulness votes | | BIBAK | Full-Text | 141-150 | |
| Cristian Danescu-Niculescu-Mizil; Gueorgi Kossinets; Jon Kleinberg; Lillian Lee | |||
| There are many on-line settings in which users publicly express opinions. A
number of these offer mechanisms for other users to evaluate these opinions; a
canonical example is Amazon.com, where reviews come with annotations like "26
of 32 people found the following review helpful." Opinion evaluation appears in
many off-line settings as well, including market research and political
campaigns. Reasoning about the evaluation of an opinion is fundamentally
different from reasoning about the opinion itself: rather than asking, "What
did Y think of X?", we are asking, "What did Z think of Y's opinion of X?" Here
we develop a framework for analyzing and modeling opinion evaluation, using a
large-scale collection of Amazon book reviews as a dataset. We find that the
perceived helpfulness of a review depends not just on its content but also but
also in subtle ways on how the expressed evaluation relates to other
evaluations of the same product. As part of our approach, we develop novel
methods that take advantage of the phenomenon of review "plagiarism" to control
for the effects of text in opinion evaluation, and we provide a simple and
natural mathematical model consistent with our findings. Our analysis also
allows us to distinguish among the predictions of competing theories from
sociology and social psychology, and to discover unexpected differences in the
collective opinion-evaluation behavior of user populations from different
countries. Keywords: online communities, opinion mining, plagiarism, review helpfulness, review
utility, sentiment analysis, social influence | |||
| Exploiting web search to generate synonyms for entities | | BIBAK | Full-Text | 151-160 | |
| Surajit Chaudhuri; Venkatesh Ganti; Dong Xin | |||
| Tasks recognizing named entities such as products, people names, or
locations from documents have recently received significant attention in the
literature. Many solutions to these tasks assume the existence of reference
entity tables. An important challenge that needs to be addressed in the entity
extraction task is that of ascertaining whether or not a candidate string
approximately matches with a named entity in a given reference table.
Prior approaches have relied on string-based similarity which only compare a candidate string and an entity it matches with. In this paper, we exploit web search engines in order to define new similarity functions. We then develop efficient techniques to facilitate approximate matching in the context of our proposed similarity functions. In an extensive experimental evaluation, we demonstrate the accuracy and efficiency of our techniques. Keywords: entity extraction, similarity measure, synonym generation, web search | |||
| Smart Miner: a new framework for mining large scale web usage data | | BIBAK | Full-Text | 161-170 | |
| Murat Ali Bayir; Ismail Hakki Toroslu; Ahmet Cosar; Guven Fidan | |||
| In this paper, we propose a novel framework called Smart-Miner for web usage
mining problem which uses link information for producing accurate user sessions
and frequent navigation patterns. Unlike the simple session concepts in the
time and navigation based approaches, where sessions are sequences of web pages
requested from the server or viewed in the browser, Smart Miner sessions are
set of paths traversed in the web graph that corresponds to users' navigations
among web pages. We have modeled session construction as a new graph problem
and utilized a new algorithm, Smart-SRA, to solve this problem efficiently. For
the pattern discovery phase, we have developed an efficient version of the
Apriori-All technique which uses the structure of web graph to increase the
performance. From the experiments that we have performed on both real and
simulated data, we have observed that Smart-Miner produces at least 30% more
accurate web usage patterns than other approaches including previous session
construction methods. We have also studied the effect of having the referrer
information in the web server logs to show that different versions of Smart-SRA
produce similar results. Our another contribution is that we have implemented
distributed version of the Smart Miner framework by employing Map/Reduce
Paradigm. We conclude that we can efficiently process terabytes of web server
logs belonging to multiple web sites by our scalable framework. Keywords: graph mining, map/reduce, parallel data mining, web usage mining, web user
modeling | |||
| Releasing search queries and clicks privately | | BIBAK | Full-Text | 171-180 | |
| Aleksandra Korolova; Krishnaram Kenthapadi; Nina Mishra; Alexandros Ntoulas | |||
| The question of how to publish an anonymized search log was brought to the
forefront by a well-intentioned, but privacy-unaware AOL search log release.
Since then a series of ad-hoc techniques have been proposed in the literature,
though none are known to be provably private. In this paper, we take a major
step towards a solution: we show how queries, clicks and their associated
perturbed counts can be published in a manner that rigorously preserves
privacy. Our algorithm is decidedly simple to state, but non-trivial to
analyze. On the opposite side of privacy is the question of whether the data we
can safely publish is of any use. Our findings offer a glimmer of hope: we
demonstrate that a non-negligible fraction of queries and clicks can indeed be
safely published via a collection of experiments on a real search log. In
addition, we select an application, keyword generation, and show that the
keyword suggestions generated from the perturbed data resemble those generated
from the original data. Keywords: data release, differential privacy, query click graph, search logs | |||
| Incorporating site-level knowledge to extract structured data from web forums | | BIBAK | Full-Text | 181-190 | |
| Jiang-Ming Yang; Rui Cai; Yida Wang; Jun Zhu; Lei Zhang; Wei-Ying Ma | |||
| Web forums have become an important data resource for many web applications,
but extracting structured data from unstructured web forum pages is still a
challenging task due to both complex page layout designs and unrestricted user
created posts. In this paper, we study the problem of structured data
extraction from various web forum sites. Our target is to find a solution as
general as possible to extract structured data, such as post title, post
author, post time, and post content from any forum site. In contrast to most
existing information extraction methods, which only leverage the knowledge
inside an individual page, we incorporate both page-level and site-level
knowledge and employ Markov logic networks (MLNs) to effectively integrate all
useful evidence by learning their importance automatically. Site-level
knowledge includes (1) the linkages among different object pages, such as list
pages and post pages, and (2) the interrelationships of pages belonging to the
same object. The experimental results on 20 forums show a very encouraging
information extraction performance, and demonstrate the ability of the proposed
approach on various forums. We also show that the performance is limited if
only page-level knowledge is used, while when incorporating the site-level
knowledge both precision and recall can be significantly improved. Keywords: Markov logic networks (MLNS), information extraction, site-level knowledge,
structured data, web forums | |||
| Towards context-aware search by learning a very large variable length hidden markov model from search logs | | BIBAK | Full-Text | 191-200 | |
| Huanhuan Cao; Daxin Jiang; Jian Pei; Enhong Chen; Hang Li | |||
| Capturing the context of a user's query from the previous queries and clicks
in the same session may help understand the user's information need. A
context-aware approach to document re-ranking, query suggestion, and URL
recommendation may improve users' search experience substantially. In this
paper, we propose a general approach to context-aware search. To capture
contexts of queries, we learn a variable length Hidden Markov Model (vlHMM)
from search sessions extracted from log data. Although the mathematical model
is intuitive, how to learn a large vlHMM with millions of states from hundreds
of millions of search sessions poses a grand challenge. We develop a strategy
for parameter initialization in vlHMM learning which can greatly reduce the
number of parameters to be estimated in practice. We also devise a method for
distributed vlHMM learning under the map-reduce model. We test our approach on
a real data set consisting of 1.8 billion queries, 2.6 billion clicks, and 840
million search sessions, and evaluate the effectiveness of the vlHMM learned
from the real data on three search applications: document re-ranking, query
suggestion, and URL recommendation. The experimental results show that our
approach is both effective and efficient. Keywords: context-aware search, variable length hidden Markov model | |||
| A class-feature-centroid classifier for text categorization | | BIBAK | Full-Text | 201-210 | |
| Hu Guan; Jingyu Zhou; Minyi Guo | |||
| Automated text categorization is an important technique for many web
applications, such as document indexing, document filtering, and cataloging web
resources. Many different approaches have been proposed for the automated text
categorization problem. Among them, centroid-based approaches have the
advantages of short training time and testing time due to its computational
efficiency. As a result, centroid-based classifiers have been widely used in
many web applications. However, the accuracy of centroid-based classifiers is
inferior to SVM, mainly because centroids found during construction are far
from perfect locations.
We design a fast Class-Feature-Centroid (CFC) classifier for multi-class, single-label text categorization. In CFC, a centroid is built from two important class distributions: inter-class term index and inner-class term index. CFC proposes a novel combination of these indices and employs a denormalized cosine measure to calculate the similarity score between a text vector and a centroid. Experiments on the Reuters-21578 corpus and 20-newsgroup email collection show that CFC consistently outperforms the state-of-the-art SVM classifiers on both micro-F1 and macro-F1 scores. Particularly, CFC is more effective and robust than SVM when data is sparse. Keywords: centroid, denormalized cosine measure, inner-class, inter-class, text
classification | |||
| Large scale multi-label classification via metalabeler | | BIBAK | Full-Text | 211-220 | |
| Lei Tang; Suju Rajan; Vijay K. Narayanan | |||
| The explosion of online content has made the management of such content
non-trivial. Web-related tasks such as web page categorization, news filtering,
query categorization, tag recommendation, etc. often involve the construction
of multi-label categorization systems on a large scale. Existing multi-label
classification methods either do not scale or have unsatisfactory performance.
In this work, we propose MetaLabeler to automatically determine the relevant
set of labels for each instance without intensive human involvement or
expensive cross-validation. Extensive experiments conducted on benchmark data
show that the MetaLabeler tends to outperform existing methods. Moreover,
MetaLabeler scales to millions of multi-labeled instances and can be deployed
easily. This enables us to apply the MetaLabeler to a large scale query
categorization problem in Yahoo!, yielding a significant improvement in
performance. Keywords: hierarchical classification, large scale, meta model, metalabeler,
multi-label classification, query categorization | |||
| Hybrid keyword search auctions | | BIBAK | Full-Text | 221-230 | |
| Ashish Goel; Kamesh Munagala | |||
| Search auctions have become a dominant source of revenue generation on the
Internet. Such auctions have typically used per-click bidding and pricing. We
propose the use of hybrid auctions where an advertiser can make a
per-impression as well as a per-click bid, and the auctioneer then chooses one
of the two as the pricing mechanism. We assume that the advertiser and the
auctioneer both have separate beliefs (called priors) on the click-probability
of an advertisement. We first prove that the hybrid auction is truthful,
assuming that the advertisers are risk-neutral. We then show that this auction
is superior to the existing per-click auction in multiple ways: We show that
risk-seeking advertisers will choose only a per-impression bid whereas
risk-averse advertisers will choose only a per-click bid, and argue that both
kind of advertisers arise naturally. Hence, the ability to bid in a hybrid
fashion is important to account for the risk characteristics of the
advertisers. For obscure keywords, the auctioneer is unlikely to have a very
sharp prior on the click-probabilities. In such situations, we show that having
the extra information from the advertisers in the form of a per-impression bid
can result in significantly higher revenue. An advertiser who believes that its
click-probability is much higher than the auctioneer's estimate can use
per-impression bids to correct the auctioneer's prior without incurring any
extra cost. The hybrid auction can allow the advertiser and auctioneer to
implement complex dynamic programming strategies to deal with the uncertainty
in the click-probability using the same basic auction. The per-click and
per-impression bidding schemes can only be used to implement two extreme cases
of these strategies. As Internet commerce matures, we need more sophisticated
pricing models to exploit all the information held by each of the participants.
We believe that hybrid auctions could be an important step in this direction.
The hybrid auction easily extends to multiple slots, and is also applicable to
scenarios where the hybrid bidding is per-impression and per-action (i.e. CPM
and CPA), or per-click and per-action (i.e. CPC and CPA). Keywords: internet, keyword auctions, mechanism design | |||
| Bid optimization for broad match ad auctions | | BIBAK | Full-Text | 231-240 | |
| Eyal Even Dar; Vahab S. Mirrokni; S. Muthukrishnan; Yishay Mansour; Uri Nadav | |||
| Ad auctions in sponsored search support "broad match" that allows an
advertiser to target a large number of queries while bidding only on a limited
number. While giving more expressiveness to advertisers, this feature makes it
challenging to optimize bids to maximize their returns: choosing to bid on a
query as a broad match because it provides high profit results in one bidding
for related queries which may yield low or even negative profits.
We abstract and study the complexity of the {\em bid optimization problem} which is to determine an advertiser's bids on a subset of keywords (possibly using broad match) so that her profit is maximized. In the query language model when the advertiser is allowed to bid on all queries as broad match, we present a linear programming (LP)-based polynomial-time algorithm that gets the optimal profit. In the model in which an advertiser can only bid on keywords, ie., a subset of keywords as an exact or broad match, we show that this problem is not approximable within any reasonable approximation factor unless P=NP. To deal with this hardness result, we present a constant-factor approximation when the optimal profit significantly exceeds the cost. This algorithm is based on rounding a natural LP formulation of the problem. Finally, we study a budgeted variant of the problem, and show that in the query language model, one can find two budget constrained ad campaigns in polynomial time that implement the optimal bidding strategy. Our results are the first to address bid optimization under the broad match feature which is common in ad auctions. Keywords: ad auctions, bid optimization, optimal bidding, sponsored search | |||
| General auction mechanism for search advertising | | BIBAK | Full-Text | 241-250 | |
| Gagan Aggarwal; S. Muthukrishnan; Dávid Pál; Martin Pál | |||
| In sponsored search, a number of advertising slots is available on a search
results page, and have to be allocated among a set of advertisers competing to
display an ad on the page. This gives rise to a bipartite matching market that
is typically cleared by the way of an automated auction. Several auction
mechanisms have been proposed, with variants of the Generalized Second Price
(GSP) being widely used in practice. There is a rich body of work on bipartite
matching markets that builds upon the stable marriage model of Gale and Shapley
and the assignment model of Shapley and Shubik. This line of research offers
deep insights into the structure of stable outcomes in such markets and their
incentive properties. In this paper, we model advertising auctions in terms of
an assignment model with linear utilities, extended with bidder and item
specific maximum and minimum prices. Auction mechanisms like the commonly used
GSP or the well-known Vickrey-Clarke-Groves (VCG) can be interpreted as simply
computing a bidder-optimal stable matching in this model, for a suitably
defined set of bidder preferences, but our model includes much richer bidders
and preferences. We prove that in our model the existence of a stable matching
is guaranteed, and under a non-degeneracy assumption a bidder-optimal stable
matching exists as well. We give an algorithm to find such matching in
polynomial time, and use it to design truthful mechanism that generalizes GSP,
is truthful for profit-maximizing bidders, correctly implements features like
bidder-specific minimum prices and position-specific bids, and works for rich
mixtures of bidders and preferences. Our main technical contributions are the
existence of bidder-optimal matchings and strategyproofness of the resulting
mechanism, and are proved by induction on the progress of the matching
algorithm. Keywords: game theory, sponsored search auctions, stable matchings | |||
| Adaptive bidding for display advertising | | BIBAK | Full-Text | 251-260 | |
| Arpita Ghosh; Benjamin I. P. Rubinstein; Sergei Vassilvitskii; Martin Zinkevich | |||
| Motivated by the emergence of auction-based marketplaces for display ads
such as the Right Media Exchange, we study the design of a bidding agent that
implements a display advertising campaign by bidding in such a marketplace. The
bidding agent must acquire a given number of impressions with a given target
spend, when the highest external bid in the marketplace is drawn from an
unknown distribution P. The quantity and spend constraints arise from the fact
that display ads are usually sold on a CPM basis. We consider both the full
information setting, where the winning price in each auction is announced
publicly, and the partially observable setting where only the winner obtains
information about the distribution; these differ in the penalty incurred by the
agent while attempting to learn the distribution. We provide algorithms for
both settings, and prove performance guarantees using bounds on uniform
closeness from statistics, and techniques from online learning. We
experimentally evaluate these algorithms: both algorithms perform very well
with respect to both target quantity and spend; further, our algorithm for the
partially observable case performs nearly as well as that for the fully
observable setting despite the higher penalty incurred during learning. Keywords: adaptive bidding, concentration bounds, display advertising, guaranteed
delivery, guess-then-double algorithms | |||
| How much can behavioral targeting help online advertising? | | BIBAK | Full-Text | 261-270 | |
| Jun Yan; Ning Liu; Gang Wang; Wen Zhang; Yun Jiang; Zheng Chen | |||
| Behavioral Targeting (BT) is a technique used by online advertisers to
increase the effectiveness of their campaigns, and is playing an increasingly
important role in the online advertising market. However, it is underexplored
in academia when looking at how much BT can truly help online advertising in
commercial search engines. To answer this question, in this paper we provide an
empirical study on the click-through log of advertisements collected from a
commercial search engine. From the comprehensively experiment results on the
sponsored search log of the commercial search engine over a period of seven
days, we can draw three important conclusions: (1) Users who clicked the same
ad will truly have similar behaviors on the Web; (2) Click-Through Rate (CTR)
of an ad can be averagely improved as high as 670% by properly segmenting users
for behavioral targeted advertising in a sponsored search; (3) Using the short
term user behaviors to represent users is more effective than using the long
term user behaviors for BT. The statistical t-test verifies that all
conclusions drawn in the paper are statistically significant. To the best of
our knowledge, this work is the first empirical study for BT on the
click-through log of real world ads. Keywords: behavioral targeting (bt), click-through rate (ctr)., online advertising,
user segmentation | |||
| Web service derivatives | | BIBAK | Full-Text | 271-280 | |
| Thomas Meinl; Benjamin Blau | |||
| Web service development and usage has shifted from simple information
processing services to high-value business services that are crucial to
productivity and success. In order to deal with an increasing risk of
unavailability or failure of mission-critical Web services we argue the need
for advanced reservation of services in the form of derivatives.
The contribution of this paper is twofold: First we provide an abstract model of a market design that enables the trade of derivatives for mission-critical Web services. Our model satisfies requirements that result from service characteristics such as intangibility and the impossibility to inventor services in order to meet fluctuating demand. It comprehends principles from models of incomplete markets such as the absence of a tradeable underlying and consistent arbitrage-free derivative pricing. Furthermore we provide an architecture for a Web service market that implements our model and describes the strategy space and interaction of market participants in the trading process of service derivatives. We compare the underlying pricing processes to existing derivative models in energy exchanges, discuss eventual shortcomings, and apply Wavelets to analyze actual data and extract long- and short-term trends. Keywords: derivatives, incomplete markets, services mashups, wavelets, web services | |||
| Efficient application placement in a dynamic hosting platform | | BIBAK | Full-Text | 281-290 | |
| Zakaria Al-Qudah; Hussein A. Alzoubi; Mark Allman; Michael Rabinovich; Vincenzo Liberatore | |||
| Web hosting providers are increasingly looking into dynamic hosting to
reduce costs and improve the performance of their platforms. Instead of
provisioning fixed resources to each customer, dynamic hosting maintains a
variable number of application instances to satisfy current demand. While
existing research in this area has mostly focused on the algorithms that decide
on the number and location of application instances, we address the problem of
efficient enactment of these decisions once they are made. We propose a new
approach to application placement and experimentally show that it dramatically
reduces the cost of application placement, which in turn improves the
end-to-end agility of the hosting platform in reacting to demand changes. Keywords: application servers, dynamic placement, startup performance, web hosting | |||
| Network-aware forward caching | | BIBAK | Full-Text | 291-300 | |
| Jeffrey Erman; Alexandre Gerber; Mohammad T. Hajiaghayi; Dan Pei; Oliver Spatscheck | |||
| This paper proposes and evaluates a Network Aware Forward Caching approach
for determining the optimal deployment strategy of forward caches to a network.
A key advantage of this approach is that we can reduce the network costs
associated with forward caching to maximize the benefit obtained from their
deployment. We show in our simulation that a 37% increase to net benefits could
be achieved over the standard method of full cache deployment to cache all POPs
traffic. In addition, we show that this maximal point occurs when only 68% of
the total traffic is cached.
Another contribution of this paper is the analysis we use to motivate and evaluate this problem. We characterize the Internet traffic of 100K subscribers of a US residential broadband provider. We use both layer 4 and layer 7 analysis to investigate the traffic volumes of the flows as well as study the general characteristics of the applications used. We show that HTTP is a dominant protocol and account for 68% of the total downstream traffic and that 34% of that traffic is multimedia. In addition, we show that multimedia content using HTTP exhibits a 83% annualized growth rate and other HTTP traffic has a 53% growth rate versus the 26% over all annual growth rate of broadband traffic. This shows that HTTP traffic will become ever more dominant and increase the potential caching opportunities. Furthermore, we characterize the core backbone traffic of this broadband provider to measure the distance travelled by content and traffic. We find that CDN traffic is much more efficient than P2P content and that there is large skew in the Air Miles between POP in a typical network. Our findings show that there are many opportunities in broadband provider networks to optimize how traffic is delivered and cached. Keywords: web caching | |||
| Anycast-aware transport for content delivery networks | | BIBAK | Full-Text | 301-310 | |
| Zakaria Al-Qudah; Seungjoon Lee; Michael Rabinovich; Oliver Spatscheck; Jacobus Van der Merwe | |||
| Anycast-based content delivery networks (CDNs) have many properties that
make them ideal for the large scale distribution of content on the Internet.
However, because routing changes can result in a change of the endpoint that
terminates the TCP session, TCP session disruption remains a concern for
anycast CDNs, especially for large file downloads. In this paper we demonstrate
that this problem does not require any complex solutions. In particular, we
present the design of a simple, yet efficient, mechanism to handle session
disruptions due to endpoint changes. With our mechanism, a client can continue
the download of the content from the point at which it was before the endpoint
change. Furthermore, CDN servers purge the TCP connection state quickly to
handle frequent switching with low system overhead.
We demonstrate experimentally the effectiveness of our proposed mechanism and show that more complex mechanisms are not required. Specifically, we find that our mechanism maintains high download throughput even with a reasonably high rate of endpoint switching, which is attractive for load balancing scenarios. Moreover, our results show that edge servers can purge TCP connection state after a single timeout-triggered retransmission without any tangible impact on ongoing connections. Besides improving server performance, this behavior improves the resiliency of the CDN to certain denial of service attacks. Keywords: anycast, connection disruption, content delivery networks | |||
| Less talk, more rock: automated organization of community-contributed collections of concert videos | | BIBAK | Full-Text | 311-320 | |
| Lyndon Kennedy; Mor Naaman | |||
| We describe a system for synchronization and organization of
user-contributed content from live music events. We start with a set of short
video clips taken at a single event by multiple contributors, who were using a
varied set of capture devices. Using audio fingerprints, we synchronize these
clips such that overlapping clips can be displayed simultaneously. Furthermore,
we use the timing and link structure generated by the synchronization algorithm
to improve the findability and representation of the event content, including
identifying key moments of interest and descriptive text for important captured
segments of the show. We also identify the preferred audio track when multiple
clips overlap. We thus create a much improved representation of the event that
builds on the automatic content match. Our work demonstrates important
principles in the use of content analysis techniques for social media content
on the Web, and applies those principles in the domain of live music capture. Keywords: audio fingerprinting, social media, synchronization, video | |||
| A generalised cross-modal clustering method applied to multimedia news semantic indexing and retrieval | | BIBAK | Full-Text | 321-330 | |
| Alberto Messina; Maurizio Montagnuolo | |||
| Current Web technology has enabled the distribution of informative content
through dynamic media platforms.
In addition, the availability of the same content in the form of digital multimedia data has dramatically increased. Content-based, cross-media retrieval applications are needed to efficiently access desired information from this variety of data sources. This paper presents a novel approach for cross-media information aggregation, and describes a prototype system implementing this approach. The prototype adopts online newspaper articles and TV newscasts as information sources, to deliver a service made up of items including both contributions. Extensive experiments prove the effectiveness of the proposed approach in a real-world business context. Keywords: cross-modal clustering, multimedia mashup, news retrieval | |||
| What makes conversations interesting?: themes, participants and consequences of conversations in online social media | | BIBAK | Full-Text | 331-340 | |
| Munmun De Choudhury; Hari Sundaram; Ajita John; Dorée Duncan Seligmann | |||
| Rich media social networks promote not only creation and consumption of
media, but also communication about the posted media item. What causes a
conversation to be interesting, that prompts a user to participate in the
discussion on a posted video? We conjecture that people participate in
conversations when they find the conversation theme interesting, see comments
by people whom they are familiar with, or observe an engaging dialogue between
two or more people (absorbing back and forth exchange of comments).
Importantly, a conversation that is interesting must be consequential -- i.e.
it must impact the social network itself.
Our framework has three parts: characterizing themes, characterizing participants for determining interestingness and measures of consequences of a conversation deemed to be interesting. First, we detect conversational themes using a mixture model approach. Second, we determine interestingness of participants and interestingness of conversations based on a random walk model. Third, we measure the consequence of a conversation by measuring how interestingness affects the following three variables -- participation in related themes, participant cohesiveness and theme diffusion. We have conducted extensive experiments using dataset from the popular video sharing site, YouTube. Our results show that our method of interestingness maximizes the mutual information, and is significantly better (twice as large) than three other baseline methods (number of comments, number of new participants and PageRank based assessment). Keywords: conversations, interestingness, social media, themes, youtube | |||
| Visual diversification of image search results | | BIBAK | Full-Text | 341-350 | |
| Reinier H. van Leuken; Lluis Garcia; Ximena Olivares; Roelof van Zwol | |||
| Due to the reliance on the textual information associated with an image,
image search engines on the Web lack the discriminative power to deliver
visually diverse search results. The textual descriptions are key to retrieve
relevant results for a given user query, but at the same time provide little
information about the rich image content.
In this paper we investigate three methods for visual diversification of image search results. The methods deploy lightweight clustering techniques in combination with a dynamic weighting function of the visual features, to best capture the discriminative aspects of the resulting set of images that is retrieved. A representative image is selected from each cluster, which together form a diverse result set. Based on a performance evaluation we find that the outcome of the methods closely resembles human perception of diversity, which was established in an extensive clustering experiment carried out by human assessors. Keywords: flickr, image clustering, visual diversity | |||
| Tag ranking | | BIBAK | Full-Text | 351-360 | |
| Dong Liu; Xian-Sheng Hua; Linjun Yang; Meng Wang; Hong-Jiang Zhang | |||
| Social media sharing web sites like Flickr allow users to annotate images
with free tags, which significantly facilitate Web image search and
organization. However, the tags associated with an image generally are in a
random order without any importance or relevance information, which limits the
effectiveness of these tags in search and other applications. In this paper, we
propose a tag ranking scheme, aiming to automatically rank the tags associated
with a given image according to their relevance to the image content. We first
estimate initial relevance scores for the tags based on probability density
estimation, and then perform a random walk over a tag similarity graph to
refine the relevance scores. Experimental results on a 50, 000 Flickr photo
collection
show that the proposed tag ranking method is both effective and efficient. We also apply tag ranking into three applications: (1) tag-based image search, (2) tag recommendation, and (3) group recommendation, which demonstrates that the proposed tag ranking approach really boosts the performances of social-tagging related applications. Keywords: flickr, random walk, recommendation, search, tag ranking | |||
| Learning to tag | | BIBAK | Full-Text | 361-370 | |
| Lei Wu; Linjun Yang; Nenghai Yu; Xian-Sheng Hua | |||
| Social tagging provides valuable and crucial information for large-scale web
image retrieval. It is ontology-free and easy to obtain; however, irrelevant
tags frequently appear, and users typically will not tag all semantic objects
in the image, which is also called semantic loss. To avoid noises and
compensate for the semantic loss, tag recommendation is proposed in literature.
However, current recommendation simply ranks the related tags based on the
single modality of tag co-occurrence on the whole dataset, which ignores other
modalities, such as visual correlation. This paper proposes a multi-modality
recommendation based on both tag and visual correlation, and formulates the tag
recommendation as a learning problem. Each modality is used to generate a
ranking feature, and Rankboost algorithm is applied to learn an optimal
combination of these ranking features from different modalities. Experiments on
Flickr data demonstrate the effectiveness of this learning-based multi-modality
recommendation strategy. Keywords: learning to tag, multi-modality rankboost, social tagging, tag
recommendation | |||
| Efficient interactive fuzzy keyword search | | BIBAK | Full-Text | 371-380 | |
| Shengyue Ji; Guoliang Li; Chen Li; Jianhua Feng | |||
| Traditional information systems return answers after a user submits a
complete query. Users often feel "left in the dark" when they have limited
knowledge about the underlying data, and have to use a try-and-see approach for
finding information. A recent trend of supporting autocomplete in these systems
is a first step towards solving this problem. In this paper, we study a new
information-access paradigm, called "interactive, fuzzy search," in which the
system searches the underlying data "on the fly" as the user types in query
keywords. It extends autocomplete interfaces by (1) allowing keywords to appear
in multiple attributes (in an arbitrary order) of the underlying data; and (2)
finding relevant records that have keywords matching query keywords
approximately. This framework allows users to explore data as they type, even
in the presence of minor errors. We study research challenges in this framework
for large amounts of data. Since each keystroke of the user could invoke a
query on the backend, we need efficient algorithms to process each query within
milliseconds. We develop various incremental-search algorithms using previously
computed and cached results in order to achieve an interactive speed. We have
deployed several real prototypes using these techniques. One of them has been
deployed to support interactive search on the UC Irvine people directory, which
has been used regularly and well received by users due to its friendly
interface and high efficiency. Keywords: autocomplete, fuzzy search, interactive search | |||
| An axiomatic approach for result diversification | | BIBAK | Full-Text | 381-390 | |
| Sreenivas Gollapudi; Aneesh Sharma | |||
| Understanding user intent is key to designing an effective ranking system in
a search engine. In the absence of any explicit knowledge of user intent,
search engines want to diversify results to improve user satisfaction. In such
a setting, the probability ranking principle-based approach of presenting the
most relevant results on top can be sub-optimal, and hence the search engine
would like to trade-off relevance for diversity in the results.
In analogy to prior work on ranking and clustering systems, we use the axiomatic approach to characterize and design diversification systems. We develop a set of natural axioms that a diversification system is expected to satisfy, and show that no diversification function can satisfy all the axioms simultaneously. We illustrate the use of the axiomatic framework by providing three example diversification objectives that satisfy different subsets of the axioms. We also uncover a rich link to the facility dispersion problem that results in algorithms for a number of diversification objectives. Finally, we propose an evaluation methodology to characterize the objectives and the underlying axioms. We conduct a large scale evaluation of our objectives based on two data sets: a data set derived from the Wikipedia disambiguation pages and a product database. Keywords: approximation algorithms, axiomatic framework, diversification, facility
dispersion, search engine, wikipedia | |||
| Quicklink selection for navigational query results | | BIBAK | Full-Text | 391-400 | |
| Deepayan Chakrabarti; Ravi Kumar; Kunal Punera | |||
| Quicklinks for a website are navigational shortcuts displayed below the
website homepage on a search results page, and that let the users directly jump
to selected points inside the website. Since the real-estate on a search
results page is constrained and valuable, picking the best set of quicklinks to
maximize the benefits for a majority of the users becomes an important problem
for search engines. Using user browsing trails obtained from browser toolbars,
and a simple probabilistic model, we formulate the quicklink selection problem
as a combinatorial optimization problem. We first demonstrate the hardness of
the objective, and then propose an algorithm that is provably within a factor
of 1-1/e of the optimal. We also propose a different algorithm that works on
trees and that can find the optimal solution; unlike the previous algorithm,
this algorithm can incorporate natural constraints on the set of chosen
quicklinks. The efficacy of our methods is demonstrated via empirical results
on both a manually labeled set of websites and a set for which quicklink
click-through rates for several webpages were obtained from a real-world search
engine. Keywords: navigational queries, quicklinks, toolbar data, trails | |||
| Inverted index compression and query processing with optimized document ordering | | BIBAK | Full-Text | 401-410 | |
| Hao Yan; Shuai Ding; Torsten Suel | |||
| Web search engines use highly optimized compression schemes to decrease
inverted index size and improve query throughput, and many index compression
techniques have been studied in the literature. One approach taken by several
recent studies first performs a renumbering of the document IDs in the
collection that groups similar documents together, and then applies standard
compression techniques. It is known that this can significantly improve index
compression compared to a random document ordering. We study index compression
and query processing techniques for such reordered indexes. Previous work has
focused on determining the best possible ordering of documents. In contrast, we
assume that such an ordering is already given, and focus on how to optimize
compression methods and query processing for this case. We perform an extensive
study of compression techniques for document IDs and present new optimizations
of existing techniques which can achieve significant improvement in both
compression and decompression performances. We also propose and evaluate
techniques for compressing frequency values for this case. Finally, we study
the effect of this approach on query processing performance. Our experiments
show very significant improvements in index size and query processing speed on
the TREC GOV2 collection of 25.2 million web pages. Keywords: IR query processing, document ordering, index compression, inverted index,
search engines | |||
| RuralCafe: web search in the rural developing world | | BIBAK | Full-Text | 411-420 | |
| Jay Chen; Lakshminarayanan Subramanian; Jinyang Li | |||
| The majority of people in rural developing regions do not have access to the
World Wide Web. Traditional network connectivity technologies have proven to be
prohibitively expensive in these areas. The emergence of new long-range
wireless technologies provide hope for connecting these rural regions to the
Internet. However, the network connectivity provided by these new solutions are
by nature intermittent due to high network usage rates, frequent power-cuts and
the use of delay tolerant links. Typical applications, especially interactive
applications like web search, do not tolerate intermittent connectivity. In
this paper, we present the design and implementation of RuralCafe, a system
intended to support efficient web search over intermittent networks. RuralCafe
enables users to perform web search asynchronously and find what they are
looking for in one round of intermittency as opposed to multiple rounds of
search/downloads. RuralCafe does this by providing an expanded search query
interface which allows a user to specify additional query terms to maximize the
utility of the results returned by a search query. Given knowledge of the
limited available network resources, RuralCafe performs optimizations to
prefetch pages to best satisfy a search query based on a user's search
preferences. In addition, RuralCafe does not require modifications to the web
browser, and can provide single round search results tailored to various types
of networks and economic constraints. We have implemented and evaluated the
effectiveness of RuralCafe using queries from logs made to a large search
engine, queries made by users in an intermittent setting, and live queries from
a small testbed deployment. We have also deployed a prototype of RuralCafe in
Kerala, India. Keywords: intermittent network, low bandwidth, web search, world wide web | |||
| Using graphics processors for high performance IR query processing | | BIBAK | Full-Text | 421-430 | |
| Shuai Ding; Jinru He; Hao Yan; Torsten Suel | |||
| Web search engines are facing formidable performance challenges due to data
sizes and query loads. The major engines have to process tens of thousands of
queries per second over tens of billions of documents. To deal with this heavy
workload, such engines employ massively parallel systems consisting of
thousands of machines. The significant cost of operating these systems has
motivated a lot of recent research into more efficient query processing
mechanisms. We investigate a new way to build such high performance IR systems
using graphical processing units (GPUs). GPUs were originally designed to
accelerate computer graphics applications through massive on-chip parallelism.
Recently a number of researchers have studied how to use GPUs for other problem
domains such as databases and scientific computing. Our contribution here is to
design a basic system architecture for GPU-based high-performance IR, to
develop suitable algorithms for subtasks such as inverted list compression,
list intersection, and top-$k$ scoring, and to show how to achieve highly
efficient query processing on GPU-based systems. Our experimental results for a
prototype GPU-based system on $25.2$ million web pages indicate that
significant gains in query processing performance can be obtained. Keywords: GPU, index compression, ir query processing, search engines | |||
| Improved techniques for result caching in web search engines | | BIBAK | Full-Text | 431-440 | |
| Qingqing Gan; Torsten Suel | |||
| Query processing is a major cost factor in operating large web search
engines. In this paper, we study query result caching, one of the main
techniques used to optimize query processing performance. Our first
contribution is a study of result caching as a weighted caching problem. Most
previous work has focused on optimizing cache hit ratios, but given that
processing costs of queries can vary very significantly we argue that total
cost savings also need to be considered. We describe and evaluate several
algorithms for weighted result caching, and study the impact of Zipf-based
query distributions on result caching. Our second and main contribution is a
new set of feature-based cache eviction policies that achieve significant
improvements over all previous methods, substantially narrowing the existing
performance gap to the theoretically optimal (clairvoyant) method. Finally,
using the same approach, we also obtain performance gains for the related
problem of inverted list caching. Keywords: index caching, result caching, search engines, weighted caching | |||
| Nearest-neighbor caching for content-match applications | | BIBAK | Full-Text | 441-450 | |
| Sandeep Pandey; Andrei Broder; Flavio Chierichetti; Vanja Josifovski; Ravi Kumar; Sergei Vassilvitskii | |||
| Motivated by contextual advertising systems and other web applications
involving efficiency-accuracy tradeoffs, we study similarity caching. Here, a
cache hit is said to occur if the requested item is similar but not necessarily
equal to some cached item. We study two objectives that dictate the
efficiency-accuracy tradeoff and provide our caching policies for these
objectives. By conducting extensive experiments on real data we show similarity
caching can significantly improve the efficiency of contextual advertising
systems, with minimal impact on accuracy. Inspired by the above, we propose a
simple generative model that embodies two fundamental characteristics of page
requests arriving to advertising systems, namely, long-range dependences and
similarities. We provide theoretical bounds on the gains of similarity caching
in this model and demonstrate these gains empirically by fitting the actual
data to the model. Keywords: caching, content-match, nearest-neighbor | |||
| Compressed web indexes | | BIBAK | Full-Text | 451-460 | |
| Flavio Chierichetti; Ravi Kumar; Prabhakar Raghavan | |||
| Web search engines use indexes to efficiently retrieve pages containing
specified query terms, as well as pages linking to specified pages. The problem
of compressed indexes that permit such fast retrieval has a long history. We
consider the problem: assuming that the terms in (or links to) a page are
generated from a probability distribution, how well compactly can we build such
indexes that allow fast retrieval? Of particular interest is the case when the
probability distribution is Zipfian (or a similar power law), since these are
the distributions that arise on the web. We obtain sharp bounds on the space
requirement of Boolean indexes for text documents that follow Zipf's law. In
the process we develop a general technique that applies to any probability
distribution, not necessarily a power law; this is the first analysis of
compression in indexes under arbitrary distributions. Our bounds lead to
quantitative versions of rules of thumb that are folklore in indexing. Our
experiments on several document collections show that the distribution of terms
appears to follow a double-Pareto law rather than Zipf's law. Despite widely
varying sets of documents, the index sizes observed in the experiments conform
well to our theoretical predictions. Keywords: compression, double-pareto, index size, power law | |||
| Unsupervised query categorization using automatically-built concept graphs | | BIBAK | Full-Text | 461-470 | |
| Eustache Diemert; Gilles Vandelle | |||
| Automatic categorization of user queries is an important component of
general purpose (Web) search engines, particularly for triggering rich,
query-specific content and sponsored links. We propose an unsupervised learning
scheme that reduces dramatically the cost of setting up and maintaining such a
categorizer, while retaining good categorization power. The model is stored as
a graph of concepts where graph edges represent the cross-reference between the
concepts. Concepts and relations are extracted from query logs by an offline
Web mining process, which uses a search engine as a powerful summarizer for
building a concept graph. Empirical evaluation indicates that the system
compares favorably on publicly available data sets (such as KDD Cup 2005) as
well as on portions of the current query stream of Yahoo! Search, where it is
already changing the experience of millions of Web search users. Keywords: concept networks, cross-reference, knowledge based search, query
categorization, unsupervised learning, web mining | |||
| Understanding user's query intent with wikipedia | | BIBAK | Full-Text | 471-480 | |
| Jian Hu; Gang Wang; Fred Lochovsky; Jian-tao Sun; Zheng Chen | |||
| Understanding the intent behind a user's query can help search engine to
automatically route the query to some corresponding vertical search engines to
obtain particularly relevant contents, thus, greatly improving user
satisfaction. There are three major challenges to the query intent
classification problem: (1) Intent representation; (2) Domain coverage and (3)
Semantic interpretation. Current approaches to predict the user's intent mainly
utilize machine learning techniques. However, it is difficult and often
requires many human efforts to meet all these challenges by the statistical
machine learning approaches. In this paper, we propose a general methodology to
the problem of query intent classification. With very little human effort, our
method can discover large quantities of intent concepts by leveraging
Wikipedia, one of the best human knowledge base. The Wikipedia concepts are
used as the intent representation space, thus, each intent domain is
represented as a set of Wikipedia articles and categories. The intent of any
input query is identified through mapping the query into the Wikipedia
representation space. Compared with previous approaches, our proposed method
can achieve much better coverage to classify queries in an intent domain even
through the number of seed intent examples is very small. Moreover, the method
is very general and can be easily applied to various intent domains. We
demonstrate the effectiveness of this method in three different applications,
i.e., travel, job, and person name. In each of the three cases, only a couple
of seed intent queries are provided. We perform the quantitative evaluations in
comparison with two baseline methods, and the experimental results shows that
our method significantly outperforms other methods in each intent domain. Keywords: query classification, query intent, user intent, wikipedia | |||
| Discovering users' specific geo intention in web search | | BIBAK | Full-Text | 481-490 | |
| Xing Yi; Hema Raghavan; Chris Leggetter | |||
| Discovering users' specific and implicit geographic intention in web search
can greatly help satisfy users' information needs. We build a geo intent
analysis system that uses minimal supervision to learn a model from large
amounts of web-search logs for this discovery. We build a city language model,
which is a probabilistic representation of the language surrounding the mention
of a city in web queries. We use several features derived from these language
models to: (1) identify users' implicit geo intent and pinpoint the city
corresponding to this intent, (2) determine whether the geo-intent is localized
around the users' current geographic location, (3) predict cities for queries
that have a mention of an entity that is located in a specific place.
Experimental results demonstrate the effectiveness of using features derived
from the city language model. We find that (1) the system has over 90%
precision and more than 74% accuracy for the task of detecting users' implicit
city level geo intent (2) the system achieves more than 96% accuracy in
determining whether implicit geo queries are local geo queries, neighbor region
geo queries or none-of-these (3) the city language model can effectively
retrieve cities in location-specific queries with high precision (88%) and
recall (74%); human evaluation shows that the language model predicts city
labels for location-specific queries with high accuracy (84.5%). Keywords: city language model, geo intent, geographic search intent, implicit search
intent, local search intent | |||
| A search-based method for forecasting ad impression in contextual advertising | | BIBAK | Full-Text | 491-500 | |
| Xuerui Wang; Andrei Broder; Marcus Fontoura; Vanja Josifovski | |||
| Contextual advertising (also called content match) refers to the placement
of small textual ads within the content of a generic web page. It has become a
significant source of revenue for publishers ranging from individual bloggers
to major newspapers. At the same time it is an important way for advertisers to
reach their intended audience. This reach depends on the total number of
exposures of the ad (impressions) and its click-through-rate (CTR) that can be
viewed as the probability of an end-user clicking on the ad when shown. These
two orthogonal, critical factors are both difficult to estimate and even
individually can still be very informative and useful in planning and budgeting
advertising campaigns.
In this paper, we address the problem of forecasting the number of impressions for new or changed ads in the system. Producing such forecasts, even within large margins of error, is quite challenging: 1) ad selection in contextual advertising is a complicated process based on tens or even hundreds of page and ad features; 2) the publishers' content and traffic vary over time; and 3) the scale of the problem is daunting: over a course of a week it involves billions of impressions, hundreds of millions of distinct pages, hundreds of millions of ads, and varying bids of other competing advertisers. We tackle these complexities by simulating the presence of a given ad with its associated bid over weeks of historical data. We obtain an impression estimate by counting how many times the ad would have been displayed if it were in the system over that period of time. We estimate this count by an efficient two-level search algorithm over the distinct pages in the data set. Experimental results show that our approach can accurately forecast the expected number of impressions of contextual ads in real time. We also show how this method can be used in tools for bid selection and ad evaluation. Keywords: content match, contextual advertising, impression forecasting, online
advertising, wand | |||
| Exploiting web search engines to search structured databases | | BIBAK | Full-Text | 501-510 | |
| Sanjay Agrawal; Kaushik Chakrabarti; Surajit Chaudhuri; Venkatesh Ganti; Arnd Christian Konig; Dong Xin | |||
| Web search engines often federate many user queries to relevant structured
databases. For example, a product related query might be federated to a product
database containing their descriptions and specifications. The relevant
structured data items are then returned to the user along with web search
results. However, each structured database is searched in isolation. Hence, the
search often produces empty or incomplete results as the database may not
contain the required information to answer the query. In this paper, we propose
a novel integrated search architecture. We establish and exploit the
relationships between web search results and the items in structured databases
to identify the relevant structured data items for a much wider range of
queries.
Our architecture leverages existing search engine components to implement this functionality at very low overhead. We demonstrate the quality and efficiency of our techniques through an extensive experimental study. Keywords: entity extraction, entity ranking, entity search, structured database search | |||
| Online expansion of rare queries for sponsored search | | BIBAK | Full-Text | 511-520 | |
| Andrei Broder; Peter Ciccolo; Evgeniy Gabrilovich; Vanja Josifovski; Donald Metzler; Lance Riedel; Jeffrey Yuan | |||
| Sponsored search systems are tasked with matching queries
to relevant advertisements. The current state-of-the-art matching algorithms expand the user's query using a variety of external resources, such as Web search results. While these expansion-based algorithms are highly effective, they are largely inefficient and cannot be applied in real-time. In practice, such algorithms are applied offline to popular queries, with the results of the expensive operations cached for fast access at query time. In this paper, we describe an efficient and effective approach for matching ads against rare queries that were not processed offline. The approach builds an expanded query representation by leveraging offline processing done for related popular queries. Our experimental results show that our approach significantly improves the effectiveness of advertising on rare queries with only a negligible increase in computational cost. Keywords: query expansion, sponsored search, tail queries | |||
| Collective privacy management in social networks | | BIBAK | Full-Text | 521-530 | |
| Anna Cinzia Squicciarini; Mohamed Shehab; Federica Paci | |||
| Social Networking is one of the major technological phenomena of the Web
2.0, with hundreds of millions of people participating. Social networks enable
a form of self expression for users, and help them to socialize and share
content with other users. In spite of the fact that content sharing represents
one of the prominent features of existing Social Network sites, Social Networks
yet do not support any mechanism for collaborative management of privacy
settings for shared content. In this paper, we model the problem of
collaborative enforcement of privacy policies on shared data by using game
theory. In particular, we propose a solution that offers automated ways to
share images based on an extended notion of content ownership. Building upon
the Clarke-Tax mechanism, we describe a simple mechanism that promotes
truthfulness, and that rewards users who promote co-ownership. We integrate our
design with inference techniques that free the users from the burden of
manually selecting privacy preferences for each picture. To the best of our
knowledge this is the first time such a protection mechanism for Social
Networking has been proposed. In the paper, we also show a proof-of-concept
application, which we implemented in the context of Facebook, one of today's
most popular social networks. We show that supporting these type of solutions
is not also feasible, but can be implemented through a minimal increase in
overhead to end-users. Keywords: game theory, privacy, social networks | |||
| To join or not to join: the illusion of privacy in social networks with mixed public and private user profiles | | BIBAK | Full-Text | 531-540 | |
| Elena Zheleva; Lise Getoor | |||
| In order to address privacy concerns, many social media websites allow users
to hide their personal profiles from the public. In this work, we show how an
adversary can exploit an online social network with a mixture of public and
private user profiles to predict the private attributes of users. We map this
problem to a relational classification problem and we propose practical models
that use friendship and group membership information (which is often not
hidden) to infer sensitive attributes. The key novel idea is that in addition
to friendship links, groups can be carriers of significant information. We show
that on several well-known social media sites, we can easily and accurately
recover the information of private-profile users. To the best of our knowledge,
this is the first work that uses link-based and group-based classification to
study privacy implications in social networks with mixed public and private
user profiles. Keywords: attribute inference, groups, privacy, social networks | |||
| Privacy diffusion on the web: a longitudinal perspective | | BIBAK | Full-Text | 541-550 | |
| Balachander Krishnamurthy; Craig Wills | |||
| For the last few years we have been studying the diffusion of private
information for users as they visit various Web sites triggering data gathering
aggregation by third parties. This paper reports on our longitudinal study
consisting of multiple snapshots of our examination of such diffusion over four
years. We examine the various technical ways by which third-party aggregators
acquire data and the depth of user-related information acquired. We study
techniques for protecting privacy diffusion as well as limitations of such
techniques. We introduce the concept of secondary privacy damage. Our results
show increasing aggregation of user-related data by a steadily decreasing
number of entities. A handful of companies are able to track users' movement
across almost all of the popular Web sites. Virtually all the protection
techniques have significant limitations highlighting the seriousness of the
problem and the need for alternate solutions. Keywords: privacy, privacy enhancing technologies | |||
| All your contacts are belong to us: automated identity theft attacks on social networks | | BIBAK | Full-Text | 551-560 | |
| Leyla Bilge; Thorsten Strufe; Davide Balzarotti; Engin Kirda | |||
| Social networking sites have been increasingly gaining popularity.
Well-known sites such as Facebook have been reporting growth rates as high as
3% per week. Many social networking sites have millions of registered users who
use these sites to share photographs, contact long-lost friends, establish new
business contacts and to keep in touch. In this paper, we investigate how easy
it would be for a potential attacker to launch automated crawling and identity
theft attacks against a number of popular social networking sites in order to
gain access to a large volume of personal user information. The first attack we
present is the automated identity theft of existing user profiles and sending
of friend requests to the contacts of the cloned victim. The hope, from the
attacker's point of view, is that the contacted users simply trust and accept
the friend request. By establishing a friendship relationship with the contacts
of a victim, the attacker is able to access the sensitive personal information
provided by them. In the second, more advanced attack we present, we show that
it is effective and feasible to launch an automated, cross-site profile cloning
attack. In this attack, we are able to automatically create a forged profile in
a network where the victim is not registered yet and contact the victim's
friends who are registered on both networks. Our experimental results with real
users show that the automated attacks we present are effective and feasible in
practice. Keywords: identity theft, social network security | |||
| Using static analysis for Ajax intrusion detection | | BIBAK | Full-Text | 561-570 | |
| Arjun Guha; Shriram Krishnamurthi; Trevor Jim | |||
| We present a static control-flow analysis for JavaScript programs running in
a web browser. Our analysis tackles numerous challenges posed by modern web
applications including asynchronous communication, frameworks, and dynamic code
generation. We use our analysis to extract a model of expected client behavior
as seen from the server, and build an intrusion-prevention proxy for the
server: the proxy intercepts client requests and disables those that do not
meet the expected behavior. We insert random asynchronous requests to foil
mimicry attacks. Finally, we evaluate our technique against several real
applications and show that it protects against an attack in a widely-used web
application. Keywords: Ajax, control-flow analysis, intrusion detection, javascript | |||
| A hybrid phish detection approach by identity discovery and keywords retrieval | | BIBAK | Full-Text | 571-580 | |
| Guang Xiang; Jason I. Hong | |||
| Phishing is a significant security threat to the Internet, which causes
tremendous economic loss every year. In this paper, we proposed a novel hybrid
phish detection method based on information extraction (IE) and information
retrieval (IR) techniques. The identity-based component of our method detects
phishing webpages by directly discovering the inconsistency between their
identity and the identity they are imitating. The keywords-retrieval component
utilizes IR algorithms exploiting the power of search engines to identify
phish. Our method requires no training data, no prior knowledge of phishing
signatures and specific implementations, and thus is able to adapt quickly to
constantly appearing new phishing patterns. Comprehensive experiments over a
diverse spectrum of data sources with 11449 pages show that both components
have a low false positive rate and the stacked approach achieves a true
positive rate of 90.06% with a false positive rate of 1.95%. Keywords: anti-phishing, information retrieval, named entity recognition | |||
| Rapid prototyping of semantic mash-ups through semantic web pipes | | BIBAK | Full-Text | 581-590 | |
| Danh Le-Phuoc; Axel Polleres; Manfred Hauswirth; Giovanni Tummarello; Christian Morbidoni | |||
| The use of RDF data published on the Web for applications is still a
cumbersome and resource-intensive task due to the limited software support and
the lack of standard programming paradigms to deal with everyday problems such
as combination of RDF data from different sources, object identifier
consolidation, ontology alignment and mediation, or plain querying and
filtering tasks. In this paper we present a framework, Semantic Web Pipes, that
supports fast implementation of Semantic data mash-ups while preserving
desirable properties such as abstraction, encapsulation, component-orientation,
code re-usability and maintainability which are common and well supported in
other application areas. Keywords: RDF, mash-up, pipes, semantic web | |||
| idMesh: graph-based disambiguation of linked data | | BIBAK | Full-Text | 591-600 | |
| Philippe Cudré-Mauroux; Parisa Haghani; Michael Jost; Karl Aberer; Hermann De Meer | |||
| We tackle the problem of disambiguating entities on the Web. We propose a
user-driven scheme where graphs of entities -- represented by globally
identifiable declarative artifacts -- self-organize in a dynamic and
probabilistic manner. Our solution has the following two desirable properties:
i) it lets end-users freely define associations between arbitrary entities and
ii) it probabilistically infers entity relationships based on uncertain links
using constraint-satisfaction mechanisms. We outline the interface between our
scheme and the current data Web, and show how higher-layer applications can
take advantage of our approach to enhance search and update of information
relating to online entities. We describe a decentralized infrastructure
supporting efficient and scalable entity disambiguation and demonstrate the
practicability of our approach in a deployment over several hundreds of
machines. Keywords: emergent semantics, entity disambiguation, linked data, peer data management | |||
| OpenRuleBench: an analysis of the performance of rule engines | | BIBAK | Full-Text | 601-610 | |
| Senlin Liang; Paul Fodor; Hui Wan; Michael Kifer | |||
| The Semantic Web initiative has led to an upsurge of the interest in rules
as a general and powerful way of processing, combining, and analyzing semantic
information. Since several of the technologies underlying rule-based systems
are already quite mature, it is important to understand how such systems might
perform on the Web scale. OpenRuleBench is a suite of benchmarks for analyzing
the performance and scalability of different rule engines. Currently the study
spans five different technologies and eleven systems, but OpenRuleBench is an
open community resource, and contributions from the community are welcome. In
this paper, we describe the tested systems and technologies, the methodology
used in testing, and analyze the results. Keywords: benchmark, openrulebench, rule systems, semantic web | |||
| Large scale integration of senses for the semantic web | | BIBAK | Full-Text | 611-620 | |
| Jorge Gracia; Mathieu d'Aquin; Eduardo Mena | |||
| Nowadays, the increasing amount of semantic data available on the Web leads
to a new stage in the potential of Semantic Web applications. However, it also
introduces new issues due to the heterogeneity of the available semantic
resources. One of the most remarkable is redundancy, that is, the excess of
different semantic descriptions, coming from different sources, to describe the
same intended meaning.
In this paper, we propose a technique to perform a large scale integration of senses (expressed as ontology terms), in order to cluster the most similar ones, when indexing large amounts of online semantic information. It can dramatically reduce the redundancy problem on the current Semantic Web. In order to make this objective feasible, we have studied the adaptability and scalability of our previous work on sense integration, to be translated to the much larger scenario of the Semantic Web. Our evaluation shows a good behaviour of these techniques when used in large scale experiments, then making feasible the proposed approach. Keywords: ontologies, scalable sense integration, semantic web | |||
| Triplify: light-weight linked data publication from relational databases | | BIBAK | Full-Text | 621-630 | |
| Sören Auer; Sebastian Dietzold; Jens Lehmann; Sebastian Hellmann; David Aumueller | |||
| In this paper we present Triplify -- a simplistic but effective approach to
publish Linked Data from relational databases. Triplify is based on mapping
HTTP-URI requests onto relational database queries. Triplify transforms the
resulting relations into RDF statements and publishes the data on the Web in
various RDF serializations, in particular as Linked Data. The rationale for
developing Triplify is that the largest part of information on the Web is
already stored in structured form, often as data contained in relational
databases, but usually published by Web applications only as HTML mixing
structure, layout and content. In order to reveal the pure structured
information behind the current Web, we have implemented Triplify as a
light-weight software component, which can be easily integrated into and
deployed by the numerous, widely installed Web applications. Our approach
includes a method for publishing update logs to enable incremental crawling of
linked data sources. Triplify is complemented by a library of configurations
for common relational schemata and a REST-enabled data source registry.
Triplify configurations containing mappings are provided for many popular Web
applications, including osCommerce, WordPress, Drupal, Gallery, and phpBB. We
will show that despite its light-weight architecture Triplify is usable to
publish very large datasets, such as 160GB of geo data from the OpenStreetMap
project. Keywords: data web, databases, geo data, linked data, rdf, semantic web, sql, web
application | |||
| SOFIE: a self-organizing framework for information extraction | | BIBAK | Full-Text | 631-640 | |
| Fabian M. Suchanek; Mauro Sozio; Gerhard Weikum | |||
| This paper presents SOFIE, a system for automated ontology extension. SOFIE
can parse natural language documents, extract ontological facts from them and
link the facts into an ontology. SOFIE uses logical reasoning on the existing
knowledge and on the new knowledge in order to disambiguate words to their most
probable meaning, to reason on the meaning of text patterns and to take into
account world knowledge axioms. This allows SOFIE to check the plausibility of
hypotheses and to avoid inconsistencies with the ontology. The framework of
SOFIE unites the paradigms of pattern matching, word sense disambiguation and
ontological reasoning in one unified model. Our experiments show that SOFIE
delivers high-quality output, even from unstructured Internet documents. Keywords: automated reasoning, information extraction, ontology | |||
| Evaluating similarity measures for emergent semantics of social tagging | | BIBAK | Full-Text | 641-650 | |
| Benjamin Markines; Ciro Cattuto; Filippo Menczer; Dominik Benz; Andreas Hotho; Gerd Stumme | |||
| Social bookmarking systems are becoming increasingly important data sources
for bootstrapping and maintaining Semantic Web applications. Their emergent
information structures have become known as folksonomies. A key question for
harvesting semantics from these systems is how to extend and adapt traditional
notions of similarity to folksonomies, and which measures are best suited for
applications such as community detection, navigation support, semantic search,
user profiling and ontology learning. Here we build an evaluation framework to
compare various general folksonomy-based similarity measures, which are derived
from several established information-theoretic, statistical, and practical
measures. Our framework deals generally and symmetrically with users, tags, and
resources. For evaluation purposes we focus on similarity between tags and
between resources and consider different methods to aggregate annotations
across users. After comparing the ability of several tag similarity measures to
predict user-created tag relations, we provide an external grounding by
user-validated semantic proxies based on WordNet and the Open Directory
Project. We also investigate the issue of scalability. We find that mutual
information with distributional micro-aggregation across users yields the
highest accuracy, but is not scalable; per-user projection with collaborative
aggregation provides the best scalable approach via incremental computations.
The results are consistent across resource and tag similarity. Keywords: ontology learning, semantic grounding, social similarity, web 2.0 | |||
| Measuring the similarity between implicit semantic relations from the web | | BIBAK | Full-Text | 651-660 | |
| Danushka T. Bollegala; Yutaka Matsuo; Mitsuru Ishizuka | |||
| Measuring the similarity between semantic relations that hold among entities
is an important and necessary step in various Web related tasks such as
relation extraction, information retrieval and analogy detection. For example,
consider the case in which a person knows a pair of entities (e.g. Google,
YouTube), between which a particular relation holds (e.g. acquisition). The
person is interested in retrieving other such pairs with similar relations
(e.g. Microsoft, Powerset). Existing keyword-based search engines cannot be
applied directly in this case because, in keyword-based search, the goal is to
retrieve documents that are relevant to the words used in a query -- not
necessarily to the relations implied by a pair of words. We propose a
relational similarity measure, using a Web search engine, to compute the
similarity between semantic relations implied by two pairs of words. Our method
has three components: representing the various semantic relations that exist
between a pair of words using automatically extracted lexical patterns,
clustering the extracted lexical patterns to identify the different patterns
that express a particular semantic relation, and measuring the similarity
between semantic relations using a metric learning approach. We evaluate the
proposed method in two tasks: classifying semantic relations between named
entities, and solving word-analogy questions. The proposed method outperforms
all baselines in a relation classification task with a statistically
significant average precision score of 0.74. Moreover, it reduces the time
taken by Latent Relational Analysis to process 374 word-analogy questions from
9 days to less than 6 hours, with an SAT score of 51%. Keywords: natural language processing, relational similarity, web mining | |||
| Extracting key terms from noisy and multitheme documents | | BIBAK | Full-Text | 661-670 | |
| Maria Grineva; Maxim Grinev; Dmitry Lizorkin | |||
| We present a novel method for key term extraction from text documents. In
our method, document is modeled as a graph of semantic relationships between
terms of that document. We exploit the following remarkable feature of the
graph: the terms related to the main topics of the document tend to bunch up
into densely interconnected subgraphs or communities, while non-important terms
fall into weakly interconnected communities, or even become isolated vertices.
We apply graph community detection techniques to partition the graph into
thematically cohesive groups of terms. We introduce a criterion function to
select groups that contain key terms discarding groups with unimportant terms.
To weight terms and determine semantic relatedness between them we exploit
information extracted from Wikipedia.
Using such an approach gives us the following two advantages. First, it allows effectively processing multi-theme documents. Second, it is good at filtering out noise information in the document, such as, for example, navigational bars or headers in web pages. Evaluations of the method show that it outperforms existing methods producing key terms with higher precision and recall. Additional experiments on web pages prove that our method appears to be substantially more effective on noisy and multi-theme documents than existing methods. Keywords: community detection, graph analysis, keywords extraction, semantic
similarity, wikipedia | |||
| Tagommenders: connecting users to items through tags | | BIBAK | Full-Text | 671-680 | |
| Shilad Sen; Jesse Vig; John Riedl | |||
| Tagging has emerged as a powerful mechanism that enables users to find,
organize, and understand online entities. Recommender systems similarly enable
users to efficiently navigate vast collections of items. Algorithms combining
tags with recommenders may deliver both the automation inherent in
recommenders, and the flexibility and conceptual comprehensibility inherent in
tagging systems. In this paper we explore tagommenders, recommender algorithms
that predict users' preferences for items based on their inferred preferences
for tags. We describe tag preference inference algorithms based on users'
interactions with tags and movies, and evaluate these algorithms based on tag
preference ratings collected from 995 MovieLens users. We design and evaluate
algorithms that predict users' ratings for movies based on their inferred tag
preferences. Our tag-based algorithms generate better recommendation rankings
than state-of-the-art algorithms, and they may lead to flexible recommender
systems that leverage the characteristics of items users find most important. Keywords: collaborative filtering, recommender systems, tagging | |||
| Collaborative filtering for orkut communities: discovery of user latent behavior | | BIBAK | Full-Text | 681-690 | |
| Wen-Yen Chen; Jon-Chyuan Chu; Junyi Luan; Hongjie Bai; Yi Wang; Edward Y. Chang | |||
| Users of social networking services can connect with each other by forming
communities for online interaction. Yet as the number of communities hosted by
such websites grows over time, users have even greater need for effective
community recommendations in order to meet more users. In this paper, we
investigate two algorithms from very different domains and evaluate their
effectiveness for personalized community recommendation. First is association
rule mining (ARM), which discovers associations between sets of communities
that are shared across many users. Second is latent Dirichlet allocation (LDA),
which models user-community co-occurrences using latent aspects. In comparing
LDA with ARM, we are interested in discovering whether modeling low-rank latent
structure is more effective for recommendations than directly mining rules from
the observed data. We experiment on an Orkut data set consisting of 492,104
users and 118,002 communities. Our empirical comparisons using the top-k
recommendations metric show that LDA performs consistently better than ARM for
the community recommendation task when recommending a list of 4 or more
communities. However, for recommendation lists of up to 3 communities, ARM is
still a bit better. We analyze examples of the latent information learned by
LDA to explain this finding. To efficiently handle the large-scale data set, we
parallelize LDA on distributed computers and demonstrate our parallel
implementation's scalability with varying numbers of machines. Keywords: association rule mining, collaborative filtering, data mining, latent topic
models, recommender systems | |||
| Personalized recommendation on dynamic content using predictive bilinear models | | BIBAK | Full-Text | 691-700 | |
| Wei Chu; Seung-Taek Park | |||
| In Web-based services of dynamic content (such as news articles),
recommender systems face the difficulty of timely identifying new items of
high-quality and providing recommendations for new users. We propose a
feature-based machine learning approach to personalized recommendation that is
capable of handling the cold-start issue effectively. We maintain profiles of
content of interest, in which temporal characteristics of the content, e.g.
popularity and freshness, are updated in real-time manner. We also maintain
profiles of users including demographic information and a summary of user
activities within Yahoo! properties. Based on all features in user and content
profiles, we develop predictive bilinear regression models to provide accurate
personalized recommendations of new items for both existing and new users. This
approach results in an offline model with light computational overhead compared
with other recommender systems that require online re-training. The proposed
framework is general and flexible for other personalized tasks. The superior
performance of our approach is verified on a large-scale data set collected
from the Today-Module on Yahoo! Front Page, with comparison against six
competitive approaches. Keywords: bilinear models, dynamic features, personalization, ranking, recommender
systems, regression, user and content profile | |||
| Social search in "Small-World" experiments | | BIBAK | Full-Text | 701-710 | |
| Sharad Goel; Roby Muhamad; Duncan Watts | |||
| The "algorithmic small-world hypothesis" states that not only are pairs of
individuals in a large social network connected by short paths, but that
ordinary individuals can find these paths. Although theoretically plausible,
empirical evidence for the hypothesis is limited, as most chains in
"small-world" experiments fail to complete, thereby biasing estimates of "true"
chain lengths. Using data from two recent small-world experiments, comprising a
total of 162,328 message chains, and directed at one of 30 "targets" spread
across 19 countries, we model heterogeneity in chain attrition rates as a
function of individual attributes. We then introduce a rigorous way of
estimating true chain lengths that is provably unbiased, and can account for
empirically-observed variation in attrition rates. Our findings provide mixed
support for the algorithmic hypothesis. On the one hand, it appears that
roughly half of all chains can be completed in 6-7 steps -- thus supporting the
"six degrees of separation" assertion -- but on the other hand, estimates of
the mean are much longer, suggesting that for at least some of the population,
the world is not "small" in the algorithmic sense. We conclude that search
distances in social networks are fundamentally different from topological
distances, for which the mean and median of the shortest path lengths between
nodes tend to be similar. Keywords: attrition, small-world experiment, social search | |||
| Behavioral profiles for advanced email features | | BIBAK | Full-Text | 711-720 | |
| Thomas Karagiannis; Milan Vojnovic | |||
| We examine the behavioral patterns of email usage in a large-scale
enterprise over a three-month period. In particular, we focus on two main
questions: (Q1) what do replies depend on? and (Q2) what is the gain of
augmenting contacts through the friends of friends from the email social graph?
For Q1, we identify and evaluate the significance of several factors that
affect the reply probability and the email response time. We find that all
factors of our considered set are significant, provide their relative ordering,
and identify the recipient list size, and the intensity of email communication
between the correspondents as the dominant factors. We highlight various novel
threshold behaviors and provide support for existing hypotheses such as that of
the least-effort reply. For Q2, we find that the number of new contacts
extracted from the friends-of-friends relationships amounts to a large number,
but which is still a limited portion of the total enterprise size. We believe
that our results provide significant insights towards informed design of
advanced email features, including those of social-networking type. Keywords: email profiles, reply probability, reply time | |||
| A measurement-driven analysis of information propagation in the flickr social network | | BIBAK | Full-Text | 721-730 | |
| Meeyoung Cha; Alan Mislove; Krishna P. Gummadi | |||
| Online social networking sites like MySpace, Facebook, and Flickr have
become a popular way to share and disseminate content. Their massive popularity
has led to viral marketing techniques that attempt to spread content, products,
and ideas on these sites. However, there is little data publicly available on
viral propagation in the real world and few studies have characterized how
information spreads over current online social networks.
In this paper, we collect and analyze large-scale traces of information dissemination in the Flickr social network. Our analysis, based on crawls of the favorite markings of 2.5 million users on 11 million photos, aims at answering three key questions: (a) how widely does information propagate in the social network? (b) how quickly does information propagate? and (c) what is the role of word-of-mouth exchanges between friends in the overall propagation of information in the network? Contrary to viral marketing "intuition," we find that (a) even popular photos do not spread widely throughout the network, (b) even popular photos spread slowly through the network, and (c) information exchanged between friends is likely to account for over 50 of all favorite-markings, but with a significant delay at each hop. Keywords: cascades, flickr, information dissemination, social networks, viral
marketing | |||
| Network analysis of collaboration structure in Wikipedia | | BIBAK | Full-Text | 731-740 | |
| Ulrik Brandes; Patrick Kenis; Jürgen Lerner; Denise van Raaij | |||
| In this paper we give models and algorithms to describe and analyze the
collaboration among authors of Wikipedia from a network analytical perspective.
The edit network encodes who interacts how with whom when editing an article;
it significantly extends previous network models that code author communities
in Wikipedia. Several characteristics summarizing some aspects of the
organization process and allowing the analyst to identify certain types of
authors can be obtained from the edit network. Moreover, we propose several
indicators characterizing the global network structure and methods to visualize
edit networks. It is shown that the structural network indicators are
correlated with quality labels of the associated Wikipedia articles. Keywords: edit network, governance, network visualization, social network analysis,
wikipedia | |||
| The slashdot zoo: mining a social network with negative edges | | BIBAK | Full-Text | 741-750 | |
| Jérôme Kunegis; Andreas Lommatzsch; Christian Bauckhage | |||
| We analyse the corpus of user relationships of the Slashdot technology news
site. The data was collected from the Slashdot Zoo feature where users of the
website can tag other users as friends and foes, providing positive and
negative endorsements. We adapt social network analysis techniques to the
problem of negative edge weights. In particular, we consider signed variants of
global network characteristics such as the clustering coefficient, node-level
characteristics such as centrality and popularity measures, and link-level
characteristics such as distances and similarity measures. We evaluate these
measures on the task of identifying unpopular users, as well as on the task of
predicting the sign of links and show that the network exhibits multiplicative
transitivity which allows algebraic methods based on matrix multiplication to
be used. We compare our methods to traditional methods which are only suitable
for positively weighted edges. Keywords: link prediction, negative edge, slashdot zoo, social network | |||
| Community gravity: measuring bidirectional effects by trust and rating on online social networks | | BIBAK | Full-Text | 751-760 | |
| Yutaka Matsuo; Hikaru Yamamoto | |||
| Several attempts have been made to analyze customer behavior on online
E-commerce sites. Some studies particularly emphasize the social networks of
customers. Users' reviews and ratings of a product exert effects on other
consumers' purchasing behavior. Whether a user refers to other users' ratings
depends on the trust accorded by a user to the reviewer. On the other hand, the
trust that is felt by a user for another user correlates with the similarity of
two users' ratings. This bidirectional interaction that involves trust and
rating is an important aspect of understanding consumer behavior in online
communities because it suggests clustering of similar users and the evolution
of strong communities. This paper presents a theoretical model along with
analyses of an actual online E-commerce site. We analyzed a large community
site in Japan: @cosme. The noteworthy characteristics of @cosme are that users
can bookmark their trusted users; in addition, they can post their own ratings
of products, which facilitates our analyses of the ratings' bidirectional
effects on trust and ratings. We describe an overview of the data in @cosme,
analyses of effects from trust to rating and vice versa, and our proposition of
a measure of community gravity, which measures how strongly a user might be
attracted to a community. Our study is based on the @cosme dataset in addition
to the Epinions dataset. It elucidates important insights and proposes a
potentially important measure for mining online social networks. Keywords: online community, rating, social networks, trust | |||
| Mapping the world's photos | | BIBAK | Full-Text | 761-770 | |
| David J. Crandall; Lars Backstrom; Daniel Huttenlocher; Jon Kleinberg | |||
| We investigate how to organize a large collection of geotagged photos,
working with a dataset of about 35 million images collected from Flickr. Our
approach combines content analysis based on text tags and image data with
structural analysis based on geospatial data. We use the spatial distribution
of where people take photos to define a relational structure between the photos
that are taken at popular places. We then study the interplay between this
structure and the content, using classification methods for predicting such
locations from visual, textual and temporal features of the photos. We find
that visual and temporal features improve the ability to estimate the location
of a photo, compared to using just textual features. We illustrate using these
techniques to organize a large photo collection, while also revealing various
interesting properties about popular cities and landmarks at a global scale. Keywords: geolocation, photo collections | |||
| Ranking and classifying attractiveness of photos in folksonomies | | BIBAK | Full-Text | 771-780 | |
| Jose San Pedro; Stefan Siersdorfer | |||
| Web 2.0 applications like Flickr, YouTube, or Del.icio.us are increasingly
popular online communities for creating, editing and sharing content. The
growing size of these folksonomies poses new challenges in terms of search and
data mining. In this paper we introduce a novel methodology for automatically
ranking and classifying photos according to their attractiveness for folksonomy
members. To this end, we exploit image features known for having significant
effects on the visual quality perceived by humans (e.g. sharpness and
colorfulness) as well as textual meta data, in what is a multi-modal approach.
Using feedback and annotations available in the Web 2.0 photo sharing system
Flickr, we assign relevance values to the photos and train classification and
regression models based on these relevance assignments. With the resulting
machine learning models we categorize and rank photos according to their
attractiveness. Applications include enhanced ranking functions for search and
recommender methods for attractive content. Large scale experiments on a
collection of Flickr photos demonstrate the viability of our approach. Keywords: attractiveness features, classification, folksonomy feedback, image
analysis, photo appeal, ranking, web 2.0 | |||
| Constructing folksonomies from user-specified relations on flickr | | BIBAK | Full-Text | 781-790 | |
| Anon Plangprasopchok; Kristina Lerman | |||
| Automatic folksonomy construction from tags has attracted much attention
recently. However, inferring hierarchical relations between concepts from tags
has a drawback in that it is difficult to distinguish between more popular and
more general concepts. Instead of tags we propose to use user-specified
relations for learning folksonomy. We explore two statistical frameworks for
aggregating many shallow individual hierarchies, expressed through the
collection/set relations on the social photosharing site Flickr, into a common
deeper folksonomy that reflects how a community organizes knowledge. Our
approach addresses a number of challenges that arise while aggregating
information from diverse users, namely noisy vocabulary, and variations in the
granularity level of the concepts expressed. Our second contribution is a
method for automatically evaluating learned folksonomy by comparing it to a
reference taxonomy, e.g., the Web directory created by the Open Directory
Project. Our empirical results suggest that user-specified relations are a good
source of evidence for learning folksonomies. Keywords: collective knowledge, data mining, folksonomies, social information
processing, taxonomies | |||
| Mining interesting locations and travel sequences from GPS trajectories | | BIBAK | Full-Text | 791-800 | |
| Yu Zheng; Lizhu Zhang; Xing Xie; Wei-Ying Ma | |||
| The increasing availability of GPS-enabled devices is changing the way
people interact with the Web, and brings us a large amount of GPS trajectories
representing people's location histories. In this paper, based on multiple
users' GPS trajectories, we aim to mine interesting locations and classical
travel sequences in a given geospatial region. Here, interesting locations mean
the culturally important places, such as Tiananmen Square in Beijing, and
frequented public areas, like shopping malls and restaurants, etc. Such
information can help users understand surrounding locations, and would enable
travel recommendation. In this work, we first model multiple individuals'
location histories with a tree-based hierarchical graph (TBHG). Second, based
on the TBHG, we propose a HITS (Hypertext Induced Topic Search)-based inference
model, which regards an individual's access on a location as a directed link
from the user to that location. This model infers the interest of a location by
taking into account the following three factors. 1) The interest of a location
depends on not only the number of users visiting this location but also these
users' travel experiences. 2) Users' travel experiences and location interests
have a mutual reinforcement relationship. 3) The interest of a location and the
travel experience of a user are relative values and are region-related. Third,
we mine the classical travel sequences among locations considering the
interests of these locations and users' travel experiences. We evaluated our
system using a large GPS dataset collected by 107 users over a period of one
year in the real world. As a result, our HITS-based inference model
outperformed baseline approaches like rank-by-count and rank-by-frequency.
Meanwhile, when considering the users' travel experiences and location
interests, we achieved a better performance beyond baselines, such as
rank-by-count and rank-by-interest, etc. Keywords: GPS trajectories, location recommendation, spatial data mining, user travel
experience | |||
| Computers and iphones and mobile phones, oh my!: a logs-based comparison of search users on different devices | | BIBAK | Full-Text | 801-810 | |
| Maryam Kamvar; Melanie Kellar; Rajan Patel; Ya Xu | |||
| We present a logs-based comparison of search patterns across three
platforms: computers, iPhones and conventional mobile phones. Our goal is to
understand how mobile search users differ from computer-based search users, and
we focus heavily on the distribution and variability of tasks that users
perform from each platform. The results suggest that search usage is much more
focused for the average mobile user than for the average computer-based user.
However, search behavior on high-end phones resembles computer-based search
behavior more so than mobile search behavior. A wide variety of implications
follow from these findings. First, there is no single search interface which is
suitable for all mobile phones. We suggest that for the higher-end phones, a
close integration with the standard computer-based interface (in terms of
personalization and available feature set) would be beneficial for the user,
since these phones seem to be treated as an extension of the users' computer.
For all other phones, there is a huge opportunity for personalizing the search
experience for the user's "mobile needs", as these users are likely to
repeatedly search for a single type of information need on their phone. Keywords: google, iphone, mobile search, search, user behavior | |||
| A game based approach to assign geographical relevance to web images | | BIBAK | Full-Text | 811-820 | |
| Yuki Arase; Xing Xie; Manni Duan; Takahiro Hara; Shojiro Nishio | |||
| Geographical context is very important for images. Millions of images on the
Web have been already assigned latitude and longitude information. Due to the
rapid proliferation of such images with geographical context, it is still
difficult to effectively search and browse them, since we do not have ways to
decide their relevance. In this paper, we focus on the geographical relevance
of images, which is defined as to what extent the main objects in an image
match landmarks at the location where the image was taken. Recently,
researchers have proposed to use game based approaches to label large scale
data such as Web images. However, previous works have not shown the quality of
collected game logs in detail and how the logs can improve existing
applications. To answer these questions, we design and implement a Web-based
and multi-player game to collect human knowledge while people are enjoying the
game. Then we thoroughly analyze the game logs obtained during a three week
study with 147 participants and propose methods to determine the image
geographical relevance. In addition, we conduct an experiment to compare our
methods with a commercial search engine. Experimental results show that our
methods dramatically improve image search relevance. Furthermore, we show that
we can derive geographically relevant objects and their salient portion in
images, which is valuable for a number of applications such as image location
recognition. Keywords: geographical relevance, human computation, image annotation, image search | |||
| Web 2.0: blind to an accessible new world | | BIBAK | Full-Text | 821-830 | |
| Joshua Hailpern; Loretta Guarino-Reid; Richard Boardman; Srinivas Annam | |||
| With the advent of Web 2.0 technologies, websites have evolved from static
pages to dynamic, interactive Web-based applications with the ability to
replicate common desktop functionality. However, for blind and visually
impaired individuals who rely upon screen readers, Web 2.0 applications force
them to adapt to an inaccessible use model. Many technologies, including
WAI-ARIA, AJAX, and improved screen reader support, are rapidly evolving to
improve this situation. However, simply combining them does not solve the
problems of screen reader users. The main contributions of this paper are two
models of interaction for screen reader users, for both traditional websites
and Web 2.0 applications. Further contributions are a discussion of
accessibility difficulties screen reader users encounter when interacting with
Web 2.0 applications, a user workflow design model for improving Web 2.0
accessibility, and a set of design requirements for developers to ease the
user's burden and increase accessibility. These models, accessibility
difficulties, and design implications are based directly on responses and
lessons learned from usability research focusing on Web 2.0 usage and screen
reader users. Without the conscious effort of Web engineers and designers, most
blind and visually impaired users will shy away from using new Web 2.0
technology in favor of desktop based applications. Keywords: blind, screen reader, user models, visually impaired, web 2.0 | |||
| Scrolling behaviour with single- and multi-column layout | | BIBAK | Full-Text | 831-840 | |
| Cameron Braganza; Kim Marriott; Peter Moulder; Michael Wybrow; Tim Dwyer | |||
| The standard layout model used by web browsers is to lay text out in a
vertical scroll using a single column. The horizontal-scroll layout model -- in
which text is laid out in columns whose height is set to that of the browser
window and the viewer scrolls horizontally -- seems well-suited to multi-column
layout on electronic devices. We describe a study that examines how people read
and, in particular, the strategies they use for scrolling with these two models
when reading large textual documents on a standard computer monitor. We compare
usability of the models and evaluate both user preferences and the effect of
the model on performance. Also interesting is the description of the browser
and its user interface which we used for the study. Keywords: multi-column layout, reading, scrolling, web-browser | |||
| What's up CAPTCHA?: a CAPTCHA based on image orientation | | BIBAK | Full-Text | 841-850 | |
| Rich Gossweiler; Maryam Kamvar; Shumeet Baluja | |||
| We present a new CAPTCHA which is based on identifying an image's upright
orientation. This task requires analysis of the often complex contents of an
image, a task which humans usually perform well and machines generally do not.
Given a large repository of images, such as those from a web search result, we
use a suite of automated orientation detectors to prune those images that can
be automatically set upright easily. We then apply a social feedback mechanism
to verify that the remaining images have a human-recognizable upright
orientation. The main advantages of our CAPTCHA technique over the traditional
text recognition techniques are that it is language-independent, does not
require text-entry (e.g. for a mobile device), and employs another domain for
CAPTCHA generation beyond character obfuscation. This CAPTCHA lends itself to
rapid implementation and has an almost limitless supply of images. We conducted
extensive experiments to measure the viability of this technique. Keywords: CAPTCHA, automated attacks, image processing, orientation detection, spam,
visual processing | |||
| Rapid development of spreadsheet-based web mashups | | BIBAK | Full-Text | 851-860 | |
| Woralak Kongdenfha; Boualem Benatallah; Julien Vayssière; Régis Saint-Paul; Fabio Casati | |||
| The rapid growth of social networking sites and web communities have
motivated web sites to expose their APIs to external developers who create
mashups by assembling existing functionalities. Current APIs, however, aim
toward developers with programming expertise; they are not directly usable by
wider class of users who do not have programming background, but would
nevertheless like to build their own mashups. To address this need, we propose
a spreadsheet-based Web mashups development framework, which enables users to
develop mashups in the popular spreadsheet environment. First, we provide a
mechanism that makes structured data first class values of spreadsheet cells.
Second, we propose a new component model that can be used to develop fairly
sophisticated mashups, involving joining data sources and keeping spreadsheet
data up to date. Third, to simplify mashup development, we provide a collection
of spreadsheet-based mashup patterns that captures common Web data access and
spreadsheet presentation functionalities. Users can reuse and customize these
patterns to build spreadsheet-based Web mashups instead of developing them from
scratch. Fourth, we enable users to manipulate structured data presented on
spreadsheet in a drag-and-drop fashion. Finally, we have developed and tested a
proof-of-concept prototype to demonstrate the utility of the proposed
framework. Keywords: component model, spreadsheet-based mashup patterns, spreadsheets, web data
mashups | |||
| Mashroom: end-user mashup programming using nested tables | | BIBAK | Full-Text | 861-870 | |
| Guiling Wang; Shaohua Yang; Yanbo Han | |||
| This paper presents an end-user-oriented programming environment called
Mashroom. Major contributions herein include an end-user programming model with
an expressive data structure as well as a set of formally-defined mashup
operators. The data structure takes advantage of nested table, and maintains
the intuitiveness while allowing users to express complex data objects. The
mashup operators are visualized with contextual menu and formula bar and can be
directly applied on the data. Experiments and case studies reveal that end
users have little difficulty in effectively and efficiently using Mashroom to
build mashup applications. Keywords: end-user programming, mashup, nested table, spreadsheet | |||
| Automated construction of web accessibility models from transaction click-streams | | BIBAK | Full-Text | 871-880 | |
| Jalal Mahmud; Yevgen Borodin; I. V. Ramakrishnan; C. R. Ramakrishnan | |||
| Screen readers, the dominant assistive technology used by visually impaired
people to access the Web, function by speaking out the content of the screen
serially. Using screen readers for conducting online transactions can cause
considerable information overload, because transactions, such as shopping and
paying bills, typically involve a number of steps spanning several web pages.
One can combat this overload by using a transaction model for web accessibility
that presents only fragments of web pages that are needed for doing
transactions. We can realize such a model by coupling a process automaton,
encoding states of a transaction, with concept classifiers that identify page
fragments "relevant" to a particular state of the transaction. In this paper we
present a fully automated process that synergistically combines several
techniques for transforming unlabeled click-stream data generated by
transactions into a transactionmodel. These techniques include web content
analysis to partition a web page into segments consisting of semantically
related content, contextual analysis of data surrounding clickable objects in a
page, and machine learning methods, such as clustering of page segments based
on contextual analysis, statistical classification, and automata learning. The
use of unlabeled click streams in building transaction models has important
benefits: (i) visually impaired users do not have to depend on sighted users
for creating manually labeled training data to construct the models; (ii) it is
possible to mine personalized models from unlabeled transaction click-streams
associated with sites that visually impaired users visit regularly; (iii) since
unlabeled data is relatively easy to obtain, it is feasible to scale up the
construction of domain-specific transaction models (e.g., separate models for
shopping, airline reservations, bill payments, etc.); (iv) adjusting the
performance of deployed models over timtime with new training data is also
doable. We provide preliminary experimental evidence of the practical
effectiveness of both domain-specific, as well as personalized accessibility
transaction models built using our approach. Finally, this approach is
applicable for building transaction models for mobile devices with limited-size
displays, as well as for creating wrappers for information extraction from web
sites. Keywords: context, machine learning, process models, web transaction | |||
| Combining global optimization with local selection for efficient QoS-aware service composition | | BIBAK | Full-Text | 881-890 | |
| Mohammad Alrifai; Thomas Risse | |||
| The run-time binding of web services has been recently put forward in order
to support rapid and dynamic web service compositions. With the growing number
of alternative web services that provide the same functionality but differ in
quality parameters, the service composition becomes a decision problem on which
component services should be selected such that user's end-to-end QoS
requirements (e.g. availability, response time) and preferences (e.g. price)
are satisfied. Although very efficient, local selection strategy fails short in
handling global QoS requirements. Solutions based on global optimization, on
the other hand, can handle global constraints, but their poor performance
renders them inappropriate for applications with dynamic and real-time
requirements. In this paper we address this problem and propose a solution that
combines global optimization with local selection techniques to benefit from
the advantages of both worlds. The proposed solution consists of two steps:
first, we use mixed integer programming (MIP) to find the optimal decomposition
of global QoS constraints into local constraints. Second, we use distributed
local selection to find the best web services that satisfy these local
constraints. The results of experimental evaluation indicate that our approach
significantly outperforms existing solutions in terms of computation time while
achieving close-to-optimal results. Keywords: QoS, optimization, service composition, web services | |||
| A trust management framework for service-oriented environments | | BIBAK | Full-Text | 891-900 | |
| William Conner; Arun Iyengar; Thomas Mikalsen; Isabelle Rouvellou; Klara Nahrstedt | |||
| Many reputation management systems have been developed under the assumption
that each entity in the system will use a variant of the same scoring function.
Much of the previous work in reputation management has focused on providing
robustness and improving performance for a given reputation scheme. In this
paper, we present a reputation-based trust management framework that supports
the synthesis of trust-related feedback from many different entities while also
providing each entity with the flexibility to apply different scoring functions
over the same feedback data for customized trust evaluations. We also propose a
novel scheme to cache trust values based on recent client activity. To evaluate
our approach, we implemented our trust management service and tested it on a
realistic application scenario in both LAN and WAN distributed environments.
Our results indicate that our trust management service can effectively support
multiple scoring functions with low overhead and high availability. Keywords: reputation, service-oriented architectures, trust management | |||
| Test case prioritization for regression testing of service-oriented business applications | | BIBAK | Full-Text | 901-910 | |
| Lijun Mei; Zhenyu Zhang; W. K. Chan; T. H. Tse | |||
| Regression testing assures the quality of modified service-oriented business
applications against unintended changes. However, a typical regression test
suite is large in size. Earlier execution of those test cases that may detect
failures is attractive. Many existing prioritization techniques order test
cases according to their respective coverage of program statements in a
previous version of the application. On the other hand, industrial
service-oriented business applications are typically written in orchestration
languages such as WS-BPEL and integrated with workflow steps and web services
via XPath and WSDL. Faults in these artifacts may cause the application to
extract wrong data from messages, leading to failures in service compositions.
Surprisingly, current regression testing research hardly considers these
artifacts. We propose a multilevel coverage model to capture the business
process, XPath, and WSDL from the perspective of regression testing. We develop
a family of test case prioritization techniques atop the model. Empirical
results show that our techniques can achieve significantly higher rates of
fault detection than existing techniques. Keywords: WSDL, XPath, service orientation, test case prioritization | |||
| Why is the web loosely coupled?: a multi-faceted metric for service design | | BIBAK | Full-Text | 911-920 | |
| Cesare Pautasso; Erik Wilde | |||
| Loose coupling is often quoted as a desirable property of systems
architectures. One of the main goals of building systems using Web technologies
is to achieve loose coupling. However, given the lack of a widely accepted
definition of this term, it becomes hard to use coupling as a criterion to
evaluate alternative Web technology choices, as all options may exhibit, and
claim to provide, some kind of "loose" coupling effects. This paper presents a
systematic study of the degree of coupling found in service-oriented systems
based on a multi-faceted approach. Thanks to the metric introduced in this
paper, coupling is no longer a one-dimensional concept with loose coupling
found somewhere in between tight coupling and no coupling. The paper shows how
the metric can be applied to real-world examples in order to support and
improve the design process of service-oriented systems. Keywords: HTTP, RPC, SOA, loose coupling, rest, tight coupling, web services, ws-* | |||
| Highly scalable web applications with zero-copy data transfer | | BIBAK | Full-Text | 921-930 | |
| Toyotaro Suzumura; Michiaki Tatsubori; Scott Trent; Akihiko Tozawa; Tamiya Onodera | |||
| The performance of server-side applications is becoming increasingly
important as more applications exploit the Web application model. Extensive
work has been done to improve the performance of individual software components
such as Web servers and programming language runtimes. This paper describes a
novel approach to boost Web application performance by improving inter-process
communication between a programming language runtime and Web server runtime.
The approach reduces redundant processing for memory copying and the context
switch overhead between user space and kernel space by exploiting the zero-copy
data transfer methodology, such as the sendfile system call. In order to
transparently utilize this optimization feature with existing Web applications,
we propose enhancements of the PHP runtime, FastCGI protocol, and Web server.
Our proposed approach achieves a 126% performance improvement with
micro-benchmarks and a 44% performance improvement for a standard Web
benchmark, SPECweb2005. Keywords: PHP, fastcgi, scripting, sendfile, web server, zero copy | |||
| REST-based management of loosely coupled services | | BIBAK | Full-Text | 931-940 | |
| Heiko Ludwig; Jim Laredo; Kamal Bhattacharya; Liliana Pasquale; Bruno Wassermann | |||
| Applications increasingly make use of the distributed platform that the
World Wide Web provides -- be it as a Software-as-a-Service such as
salesforce.com, an application infrastructure such as facebook.com, or a
computing infrastructure such as a "cloud". A common characteristic of
applications of this kind is that they are deployed on infrastructure or make
use of components that reside in different management domains. Current service
management approaches and systems, however, often rely on a centrally managed
configuration management database (CMDB), which is the basis for centrally
orchestrated service management processes, in particular change management and
incident management. The distribution of management responsibility of WWW based
applications requires a decentralized approach to service management. This
paper proposes an approach of decentralized service management based on
distributed configuration management and service process co-ordination, making
use RESTful access to configuration information and ATOM-based distribution of
updates as a novel foundation for service management processes. Keywords: discovery, loosely coupled systems, rest, service management | |||
| Co-browsing dynamic web pages | | BIBAK | Full-Text | 941-950 | |
| Dietwig Lowet; Daniel Goergen | |||
| Collaborative browsing, or co-browsing, is the co-navigation of the web with
other people at-a-distance, supported by software that takes care of
synchronizing the browsers. Current state-of-the-art solutions are able to do
co-browsing of "static web pages", and do not support the synchronization of
JavaScript interactions. However, currently many web pages use JavaScript and
Ajax techniques to create highly dynamic and interactive web applications. In
this paper, we describe two approaches for co-browsing that both support the
synchronization of the JavaScript and Ajax interactions of dynamic web pages.
One approach is based on synchronizing the output of the JavaScript engine by
sending over the changes made on the DOM tree. The other approach is based on
synchronizing the input of the JavaScript engine by synchronizing UI events and
incoming data. Since the latter solution offers a better user experience and is
more scalable, it is elaborated in more detail. An important aspect of both
approaches is that they operate at the DOM level. Therefore, the client-side
can be implemented in JavaScript and no browser extensions are required. To the
best of the authors' knowledge this is the first DOM-level co-browsing solution
that also enables co-browsing of the dynamic interaction parts of web pages.
The presented co-browsing solution has been implemented in a research
demonstrator which allows users to do co-browsing of web-applications on
browser-based networked televisions. Keywords: co-browsing, collaboration, collaborative computing, shared browsing, web4ce | |||
| HTML templates that fly: a template engine approach to automated offloading from server to client | | BIBAK | Full-Text | 951-960 | |
| Michiaki Tatsubori; Toyotaro Suzumura | |||
| Web applications often use HTML templates to separate the webpage
presentation from its underlying business logic and objects. This is now the de
facto standard programming model for Web application development. This paper
proposes a novel implementation for existing server-side template engines,
FlyingTemplate, for (a) reduced bandwidth consumption in Web application
servers, and (b) off-loading HTML generation tasks to Web clients. Instead of
producing a fully-generated HTML page, the proposed template engine produces a
skeletal script which includes only the dynamic values of the template
parameters and the bootstrap code that runs on a Web browser at the client
side. It retrieves a client-side template engine and the payload templates
separately. With the goals of efficiency, implementation transparency,
security, and standards compliance in mind, we developed FlyingTemplate with
two design principles: effective browser cache usage, and reasonable
compromises which restrict the template usage patterns and relax the security
policies slightly but in a controllable way. This approach allows typical
template-based Web applications to run effectively with FlyingTemplate. As an
experiment, we tested the SPECweb2005 banking application using FlyingTemplate
without any other modifications and saw throughput improvements from 1.6x to
2.0x in its best mode. In addition, FlyingTemplate can enforce compliance with
a simple security policy, thus addressing the security problems of
client-server partitioning in the Web environment. Keywords: client-server partitioning, template engines, web applications | |||
| Characterizing insecure javascript practices on the web | | BIBAK | Full-Text | 961-970 | |
| Chuan Yue; Haining Wang | |||
| JavaScript is an interpreted programming language most often used for
enhancing webpage interactivity and functionality. It has powerful capabilities
to interact with webpage documents and browser windows, however, it has also
opened the door for many browser-based security attacks. Insecure engineering
practices of using JavaScript may not directly lead to security breaches, but
they can create new attack vectors and greatly increase the risks of
browser-based attacks. In this paper, we present the first measurement study on
insecure practices of using JavaScript on the Web. Our focus is on the insecure
practices of JavaScript inclusion and dynamic generation, and we examine their
severity and nature on 6,805 unique websites. Our measurement results reveal
that insecure JavaScript practices are common at various websites: (1) at least
66.4% of the measured websites manifest the insecure practices of including
JavaScript files from external domains into the top-level documents of their
webpages; (2) over 44.4% of the measured websites use the dangerous eval()
function to dynamically generate and execute JavaScript code on their webpages;
and (3) in JavaScript dynamic generation, using the document.write() method and
the innerHTML property is much more popular than using the relatively secure
technique of creating script elements via DOM methods. Our analysis indicates
that safe alternatives to these insecure practices exist in common cases and
ought to be adopted by website developers and administrators for reducing
potential security risks. Keywords: AST tree matching, execution-based measurement, javascript, same origin
policy, security, web engineering | |||
| Extracting article text from the web with maximum subsequence segmentation | | BIBAK | Full-Text | 971-980 | |
| Jeff Pasternack; Dan Roth | |||
| Much of the information on the Web is found in articles from online news
outlets, magazines, encyclopedias, review collections, and other sources.
However, extracting this content from the original HTML document is complicated
by the large amount of less informative and typically unrelated material such
as navigation menus, forms, user comments, and ads. Existing approaches tend to
be either brittle and demand significant expert knowledge and time (manual or
tool-assisted generation of rules or code), necessitate labeled examples for
every different page structure to be processed (wrapper induction), require
relatively uniform layout (template detection), or, as with Visual Page
Segmentation (VIPS), are computationally expensive. We introduce maximum
subsequence segmentation, a method of global optimization over token-level
local classifiers, and apply it to the domain of news websites. Training
examples are easy to obtain, both learning and prediction are linear time, and
results are excellent (our semi-supervised algorithm yields an overall F1-score
of 97.947%), surpassing even those produced by VIPS with a hypothetical perfect
block-selection heuristic. We also evaluate against the recent CleanEval shared
task with surprisingly good cross-task performance cleaning general web pages,
exceeding the top "text-only" score (based on Levenshtein distance), 87.8%
versus 84.1%. Keywords: maximum subsequence, page segmentation, text extraction | |||
| Extracting data records from the web using tag path clustering | | BIBAK | Full-Text | 981-990 | |
| Gengxin Miao; Junichi Tatemura; Wang-Pin Hsiung; Arsany Sawires; Louise E. Moser | |||
| Fully automatic methods that extract lists of objects from the Web have been
studied extensively. Record extraction, the first step of this object
extraction process, identifies a set of Web page segments, each of which
represents an individual object (e.g., a product). State-of-the-art methods
suffice for simple search, but they often fail to handle more complicated or
noisy Web page structures due to a key limitation -- their greedy manner of
identifying a list of records through pairwise comparison (i.e., similarity
match) of consecutive segments. This paper introduces a new method for record
extraction that captures a list of objects in a more robust way based on a
holistic analysis of a Web page. The method focuses on how a distinct tag path
appears repeatedly in the DOM tree of the Web document. Instead of comparing a
pair of individual segments, it compares a pair of tag path occurrence patterns
(called visual signals) to estimate how likely these two tag paths represent
the same list of objects. The paper introduces a similarity measure that
captures how closely the visual signals appear and interleave. Clustering of
tag paths is then performed based on this similarity measure, and sets of tag
paths that form the structure of data records are extracted. Experiments show
that this method achieves higher accuracy than previous methods. Keywords: clustering, data record extraction, information extraction | |||
| Sitemaps: above and beyond the crawl of duty | | BIBAK | Full-Text | 991-1000 | |
| Uri Schonfeld; Narayanan Shivakumar | |||
| Comprehensive coverage of the public web is crucial to web search engines.
Search engines use crawlers to retrieve pages and then discover new ones by
extracting the pages' outgoing links. However, the set of pages reachable from
the publicly linked web is estimated to be significantly smaller than the
invisible web, the set of documents that have no incoming links and can only be
retrieved through web applications and web forms. The Sitemaps protocol is a
fast-growing web protocol supported jointly by major search engines to help
content creators and search engines unlock this hidden data by making it
available to search engines. In this paper, we perform a detailed study of how
"classic" discovery crawling compares with Sitemaps, in key measures such as
coverage and freshness over key representative websites as well as over
billions of URLs seen at Google. We observe that Sitemaps and discovery
crawling complement each other very well, and offer different tradeoffs. Keywords: crawling, metrics, quality, search engines, sitemaps | |||
| Performing grouping and aggregate functions in XML queries | | BIBAK | Full-Text | 1001-1010 | |
| Huayu Wu; Tok Wang Ling; Liang Xu; Zhifeng Bao | |||
| Since more and more business data are represented in XML format, there is a
compelling need of supporting analytical operations in XML queries.
Particularly, the latest version of XQuery proposed by W3C, XQuery 1.1,
introduces a new construct to explicitly express grouping operation in FLWOR
expression. Existing works in XML query processing mainly focus on physically
matching query structure over XML document. Given the explicit grouping
operation in a query, how to efficiently compute grouping and aggregate
functions over XML document is not well studied yet. In this paper, we extend
our previous XML query processing algorithm, VERT, to efficiently perform
grouping and aggregate function in queries. The main technique of our approach
is introducing relational tables to index values. Query pattern matching and
aggregation computing are both conducted with table indices. We also propose
two semantic optimizations to further improve the query performance. Finally we
present experimental results to validate the efficiency of our approach, over
other existing approaches. Keywords: aggregate function, grouping, query processing, xml | |||
| XQuery in the browser | | BIBAK | Full-Text | 1011-1020 | |
| Ghislain Fourny; Markus Pilman; Daniela Florescu; Donald Kossmann; Tim Kraska; Darin McBeath | |||
| Since the invention of the Web, the browser has become more and more
powerful. By now, it is a programming and execution environment in itself. The
predominant language to program applications in the browser today is
JavaScript. With browsers becoming more powerful, JavaScript has been extended
and new layers have been added (e.g., DOM-Support and XPath). Today, JavaScript
is very successful and applications and GUI features implemented in the browser
have become increasingly complex. The purpose of this paper is to improve the
programmability of Web browsers by enabling the execution of XQuery programs in
the browser. Although it has the potential to ideally replace JavaScript, it is
possible to run it in addition to JavaScript for more flexibility. Furthermore,
it allows instant code migration from the server to the client and vice-versa.
This enables a significant simplification of the technology stack. The
intuition is that programming the browser involves mostly XML (i.e., DOM)
navigation and manipulation, and the XQuery family of W3C standards were
designed exactly for that purpose. The paper proposes extensions to XQuery for
Web browsers and gives a number of examples that demonstrate the usefulness of
XQuery for the development of AJAX-style applications. Furthermore, the paper
presents the design of an XQuery plug-in for Microsoft's Internet Explorer. The
paper also gives examples of applications which were developed with the help of
this plug-in. Keywords: CSS, HTML, XHTML, XML, XQuery, browser, client-side programming, dom,
events, javascript, mash-up, script, scripting, stylesheets | |||
| Answering approximate queries over autonomous web databases | | BIBAK | Full-Text | 1021-1030 | |
| Xiangfu Meng; Z. M. Ma; Li Yan | |||
| To deal with the problem of empty or too little answers returned from a Web
database in response to a user query, this paper proposes a novel approach to
provide relevant and ranked query results. Based on the user original query, we
speculate how much the user cares about each specified attribute and assign a
corresponding weight to it. This original query is then rewritten as an
approximate query by relaxing the query criteria range. The relaxation order of
all specified attributes and the relaxed degree on each specified attribute are
varied with the attribute weights. For the approximate query results, we
generate users' contextual preferences from database workload and use them to
create a priori orders of tuples in an off-line preprocessing step. Only a few
representative orders are saved, each corresponding to a set of contexts. Then,
these orders and associated contexts are used at query time to expeditiously
provide ranked answers. Results of a preliminary user study demonstrate that
our query relaxation and results ranking methods can capture the user's
preferences effectively. The efficiency and effectiveness of our approach is
also demonstrated by experimental result. Keywords: query relaxation, query results ranking, top-k., web database | |||
| Analyzing seller practices in a Brazilian marketplace | | BIBAK | Full-Text | 1031-1040 | |
| Adriano Pereira; Diego Duarte; Wagner, Jr. Meira; Virgilio Almeida; Paulo Góes | |||
| E-commerce is growing at an exponential rate. In the last decade, there has
been an explosion of online commercial activity enabled by World Wide Web
(WWW). These days, many consumers are less attracted to online auctions,
preferring to buy merchandise quickly using fixed-price negotiations. Sales at
Amazon.com, the leader in online sales of fixed-price goods, rose 37% in the
first quarter of 2008. At eBay, where auctions make up 58% of the site's sales,
revenue rose 14%. In Brazil, probably by cultural influence, online auctions
are not been popular. This work presents a characterization and analysis of
fixed-price online negotiations. Using actual data from a Brazilian
marketplace, we analyze seller practices, considering seller profiles and
strategies. We show that different sellers adopt strategies according to their
interests, abilities and experience. Moreover, we confirm that choosing a
selling strategy is not simple, since it is important to consider the seller's
characteristics to evaluate the applicability of a strategy. The work also
provides a comparative analysis of some selling practices in Brazil with
popular worldwide marketplaces. Keywords: e-commerce, e-markets, marketplaces, selling practices | |||
| A geographical analysis of knowledge production in computer science | | BIBAK | Full-Text | 1041-1050 | |
| Guilherme Vale Menezes; Nivio Ziviani; Alberto H. F. Laender; Virgílio Almeida | |||
| We analyze knowledge production in Computer Science by means of coauthorship
networks. For this, we consider 30 graduate programs of different regions of
the world, being 8 programs in Brazil, 16 in North America (3 in Canada and 13
in the United States), and 6 in Europe (2 in France, 1 in Switzerland and 3 in
the United Kingdom). We use a dataset that consists of 176,537 authors and
352,766 publication entries distributed among 2,176 publication venues. The
results obtained for different metrics of collaboration social networks
indicate the process of knowledge creation has changed differently for each
region. Research is increasingly done in teams across different fields of
Computer Science. The size of the giant component indicates the existence of
isolated collaboration groups in the European network, contrasting to the
degree of connectivity found in the Brazilian and North-American counterparts.
We also analyzed the temporal evolution of the social networks representing the
three regions. The number of authors per paper experienced an increase in a
time span of 12 years. We observe that the number of collaborations between
authors grows faster than the number of authors, benefiting from the existing
network structure. The temporal evolution shows differences between
well-established fields, such as Databases and Computer Architecture, and
emerging fields, like Bioinformatics and Geoinformatics. The patterns of
collaboration analyzed in this paper contribute to an overall understanding of
Computer Science research in different geographical regions that could not be
achieved without the use of complex networks and a large publication database. Keywords: coauthorship networks, collaboration social networks, computer science | |||
| Competitive analysis from click-through log | | BIBAK | Full-Text | 1051-1052 | |
| Gang Wang; Jian Hu; Yunzhang Zhu; Hua Li; Zheng Chen | |||
| Existing keyword suggestion tools from various search engine companies could
automatically suggest keywords related to the advertisers' products or
services, counting in simple statistics of the keywords, such as search volume,
cost per click (CPC), etc. However, the nature of the generalized Second Price
Auction suggests that better understanding the competitors' keyword selection
and bidding strategies better helps to win the auction, other than only relying
on general search statistics. In this paper, we propose a novel keyword
suggestion strategy, called Competitive Analysis, to explore the keyword based
competition relationships among advertisers and eventually help advertisers to
build campaigns with better performance. The experimental results demonstrate
that the proposed Competitive Analysis can both help advertisers to promote
their product selling and generate more revenue to the search engine companies. Keywords: competitive analysis, keyword advertising, keyword suggestion | |||
| Predicting click through rate for job listings | | BIBAK | Full-Text | 1053-1054 | |
| Manish Gupta | |||
| Click Through Rate (CTR) is an important metric for ad systems, job portals,
recommendation systems. CTR impacts publisher's revenue, advertiser's bid
amounts in "pay for performance" business models. We learn regression models
using features of the job, optional click history of job, features of "related"
jobs. We show that our models predict CTR much better than predicting avg. CTR
for all job listings, even in absence of the click history for the job listing. Keywords: CPC, CTR, GBDT, click through rate, gradient boosted decision trees, jobs,
linear regression, prediction, treenet | |||
| Query clustering using click-through graph | | BIBAK | Full-Text | 1055-1056 | |
| Jeonghee Yi; Farzin Maghoul | |||
| In this paper we describe a problem of discovering query clusters from a
click-through graph of web search logs. The graph consists of a set of web
search queries, a set of pages selected for the queries, and a set of directed
edges that connects a query node and a page node clicked by a user for the
query. The proposed method extracts all maximal bipartite cliques (bicliques)
from a click-through graph and compute an equivalence set of queries (i.e., a
query cluster) from the maximal bicliques. A cluster of queries is formed from
the queries in a biclique. We present a scalable algorithm that enumerates all
maximal bicliques from the click-through graph. We have conducted experiments
on Yahoo web search queries and the result is promising. Keywords: biclique, click-through graph, equivalence set, maximal bipartite clique,
query clustring, query intent | |||
| An effective semantic search technique using ontology | | BIBAK | Full-Text | 1057-1058 | |
| Jihyun Lee; Jun-Ki Min; Chin-Wan Chung | |||
| In this paper, we present a semantic search technique considering the type
of desired Web resources and the semantic relationships between the resources
and the query keywords in the ontology. In order to effectively retrieve the
most relevant top-k resources, we propose a novel ranking model. To do this, we
devise a measure to determine the weight of the semantic relationship. In
addition, we consider the number of meaningful semantic relationships between a
resource and keywords, the coverage of keywords, and the distinguishability of
keywords. Through experiments using real datasets, we observe that our ranking
model provides more accurate semantic search results compared to existing
ranking models. Keywords: ontology, ranking, semantic relationship, semantic search, semantic web | |||
| Relationalizing RDF stores for tools reusability | | BIBAK | Full-Text | 1059-1060 | |
| Sunitha Ramanujam; Anubha Gupta; Latifur Khan; Steven Seida; Bhavani Thuraisingham | |||
| The emergence of Semantic Web technologies and standards such as Resource
Description Framework (RDF) has introduced novel data storage models such as
the RDF Graph Model. In this paper, we present a research effort called R2D,
which attempts to bridge the gap between RDF and RDBMS concepts by presenting a
relational view of RDF data stores. Thus, R2D is essentially a relational
wrapper around RDF stores that aims to make the variety of stable relational
tools that are currently in the market available to RDF stores without data
duplication and synchronization issues. Keywords: RDBMs, data interoperability, query processing, resource description
framework (rdf), visualization tools | |||
| C-SPARQL: SPARQL for continuous querying | | BIBAK | Full-Text | 1061-1062 | |
| Davide Francesco Barbieri; Daniele Braga; Stefano Ceri; Emanuele Della Valle; Michael Grossniklaus | |||
| C-SPARQL is an extension of SPARQL to support continuous queries, registered
and continuously executed over RDF data streams, considering windows of such
streams. Supporting streams in RDF format guarantees interoperability and opens
up important applications, in which reasoners can deal with knowledge that
evolves over time. We present C-SPARQL by means of examples in Urban Computing. Keywords: RDF, SPARQL, data streams | |||
| Interactive search in XML data | | BIBAK | Full-Text | 1063-1064 | |
| Guoliang Li; Jianhua Feng; Lizhu Zhou | |||
| In a traditional keyword-search system over XML data, a user composes a
keyword query, submits it to the system, and retrieves relevant subtrees. In
the case where the user has limited knowledge about the data, often the user
feels "left in the dark" when issuing queries, and has to use a try-and-see
approach for finding information. In this paper, we study a new
information-access paradigm for XML data, called "Inks," in which the system
searches on the underlying data "on the fly" as the user types in query
keywords. Inks extends existing XML keyword search methods by interactively
answering queries. We propose effective indices, early-termination techniques,
and efficient search algorithms to achieve a high interactive speed. We have
implemented our algorithm, and the experimental results show that our method
achieves high search efficiency and result quality. Keywords: interactive search, keyword search, xml | |||
| Is there anything worth finding on the semantic web? | | BIBAK | Full-Text | 1065-1066 | |
| Harry Halpin | |||
| There has recently been an upsurge of interest in the possibilities of
combining structured data and ad-hoc information retrieval from traditional
hypertext. In this experiment, we run queries extracted from a query log of a
major search engine against the Semantic Web to discover if the Semantic Web
has anything of interest to the average user. We show that there is indeed much
information on the Semantic Web that could be relevant for many queries for
people, places and even abstract concepts, although they are overwhelmingly
clustered around a Semantic Web-enabled export of Wikipedia known as DBPedia. Keywords: information retrieval, linked data, search, semantic web | |||
| Instance-based probabilistic reasoning in the semantic web | | BIBAK | Full-Text | 1067-1068 | |
| Pedro Oliveira; Paulo Gomes | |||
| Most of the approaches for dealing with uncertainty in the Semantic Web rely
on the principle that this uncertainty is already asserted. In this paper, we
propose a new approach to learn and reason about uncertainty in the Semantic
Web. Using instance data, we learn the uncertainty of an OWL ontology, and use
that information to perform probabilistic reasoning on it. For this purpose, we
use Markov logic, a new representation formalism that combines logic with
probabilistic graphical models. Keywords: markov logic, probabilistic reasoning, semantic web | |||
| A flight meta-search engine with metamorph | | BIBAK | Full-Text | 1069-1070 | |
| Bernhard Kruepl; Wolfgang Holzinger; Yansen Darmaputra; Robert Baumgartner | |||
| We demonstrate a flight meta-search engine that is based on the Metamorph
framework. Metamorph provides mechanisms to model web forms together with the
interactions which are needed to fulfil a request, and can generate interaction
sequences that pose queries using these web forms and collect the results. In
this paper, we discuss an interesting new feature that makes use of the forms
themselves as an information source. We show how data can be extracted from web
forms (rather than the data behind web forms) to generate a graph of flight
connections between cities.
The flight connection graph allows us to vastly reduce the number of queries that the engine sends to airline websites in the most interesting search scenarios; those that involve the controversial practice of creative ticketing, in which agencies attempt to find lower price fares by using more than one airline for a journey. We describe a system which attains data from a number of websites to identify promising routes and prune the search tree. Heuristics that make use of geographical information and an estimation of cost based on historical data are employed. The results are then made available to improve the quality of future search requests. Keywords: hidden web, web data extraction, web form extraction, web form mapping | |||
| Thumbs-up: a game for playing to rank search results | | BIBAK | Full-Text | 1071-1072 | |
| Ali Dasdan; Chris Drome; Santanu Kolay | |||
| Human computation is an effective way to channel human effort spent playing
games to solving computational problems that are easy for humans but difficult
for computers to automate. We propose Thumbs-Up, a new game for human
computation with the purpose of playing to rank search result. Our experience
from users shows that Thumbs-Up is not only fun to play, but produces more
relevant rankings than both a major search engine and optimal rank aggregation
using the Kemeny rule. Keywords: games with a purpose, human computation, online games, rank aggregation,
relevance, search engine | |||
| Search shortcuts: driving users towards their goals | | BIBAK | Full-Text | 1073-1074 | |
| Ranieri Baraglia; Fidel Cacheda; Victor Carneiro; Vreixo Formoso; Raffaele Perego; Fabrizio Silvestri | |||
| Giving suggestions to users of Web-based services is a common practice aimed
at enhancing their navigation experience. Major Web Search Engines usually
provide "Suggestions" under the form of queries that are, to some extent,
related to the current query typed by the user, and the knowledge learned from
the past usage of the system. In this work we introduce "Search Shortcuts" as
"Successful" queries allowed, in the past, users to satisfy their information
needs. Differently from conventional suggestion techniques, our search
shortcuts allows to evaluate effectiveness by exploiting a simple
train-and-test approach. We have applied several Collaborative Filtering
algorithms to this problem, evaluating them on a real query log data. We
generate the shortcuts from all user sessions belonging to the testing set, and
measure the quality of the shortcuts suggested by considering the similarity
between them and the navigational user behavior. Keywords: evaluation, model, search shortcut | |||
| A probabilistic model based approach for blended search | | BIBAK | Full-Text | 1075-1076 | |
| Ning Liu; Jun Yan; Zheng Chen | |||
| In this paper, we propose to model the blended search problem by assuming
conditional dependencies among queries, VSEs and search results. The
probability distributions of this model are learned from search engine query
log through unigram language model. Our experimental exploration shows that,
(1) a large number of queries in generic Web search have vertical search
intentions; and (2) our proposed algorithm can effectively blend vertical
search results into generic Web search, which can improve the Mean Average
Precision (MAP) by as much as 16% compared to traditional Web search without
blending. Keywords: blended search, language model, query log, vertical search | |||
| Bucefalo: a tool for intelligent search and filtering for web-based personal health records | | BIBAK | Full-Text | 1077-1078 | |
| Francisco P. Romero; Jesus Serrano-Guerrero; Jose A. Olivas | |||
| In this poster, a tool named BUCEFALO is presented. This tool is specially
designed to improve the information retrieval tasks in web-based Personal
Health Records (PHR). This tool implements semantic and multilingual query
expansion techniques and information filtering algorithms in order to help
users find the most valuable information about a specific clinical case. The
filtering model is based on fuzzy prototypes based filtering, data quality
measures, user profiles and healthcare ontologies. The first experimental
results illustrate the feasibility of this tool. Keywords: information filtering, web-based personal health record | |||
| Dataplorer: a scalable search engine for the data web | | BIBAK | Full-Text | 1079-1080 | |
| Haofen Wang; Qiaoling Liu; Gui-Rong Xue; Yong Yu; Lei Zhang; Yue Pan | |||
| More and more structured information in the form of semantic data is
nowadays available. It offers a wide range of new possibilities especially for
semantic search and Web data integration. However, their effective exploitation
still brings about a number of challenges, e.g. usability, scalability and
uncertainty. In this paper, we present Dataplorer, a solution designed to
address these challenges. We consider the usability through the use of hybrid
queries and faceted search, while still preserving the scalability thanks to an
extension of inverted index to support this type of query. Moreover, Dataplorer
deals with uncertainty by means of a powerful ranking scheme to find relevant
results. Our experimental results show that our proposed approach is promising
and it makes us believe that it is possible to extend the current IR
infrastructure to query and search the Web of data. Keywords: faceted search, hybrid query, inverted index, ranking | |||
| Threshold selection for web-page classification with highly skewed class distribution | | BIBAK | Full-Text | 1081-1082 | |
| Xiaofeng He; Lei Duan; Yiping Zhou; Byron Dom | |||
| We propose a novel cost-efficient approach to threshold selection for binary
web-page classification problems with imbalanced class distributions. In many
binary-classification tasks the distribution of classes is highly skewed. In
such problems, using uniform random sampling in constructing sample sets for
threshold setting requires large sample sizes in order to include a
statistically sufficient number of examples of the minority class. On the other
hand, manually labeling examples is expensive and budgetary considerations
require that the size of sample sets be limited. These conflicting requirements
make threshold selection a challenging problem. Our method of sample-set
construction is a novel approach based on stratified sampling, in which
manually labeled examples are expanded to reflect the true class distribution
of the web-page population. Our experimental results show that using false
positive rate as the criterion for threshold setting results in lower-variance
threshold estimates than using other widely used accuracy measures such as F1
and precision. Keywords: binary classifier, skewed class distribution, stratified sampling, threshold
selection, web-page classification | |||
| Web-scale classification with naive bayes | | BIBAK | Full-Text | 1083-1084 | |
| Congle Zhang; Gui-Rong Xue; Yong Yu; Hongyuan Zha | |||
| Traditional Naive Bayes Classifier performs miserably on web-scale
taxonomies. In this paper, we investigate the reasons behind such bad
performance. We discover that the low performance are not completely caused by
the intrinsic limitations of Naive Bayes, but mainly comes from two largely
ignored problems: contradiction pair problem and discriminative evidence
cancelation problem. We propose modifications that can alleviate the two
problems while preserving the advantages of Naive Bayes. The experimental
results show our modified Naive Bayes can significantly improve the performance
on real web-scale taxonomies. Keywords: naive bayes, web-scale taxonomy | |||
| News article extraction with template-independent wrapper | | BIBAK | Full-Text | 1085-1086 | |
| Junfeng Wang; Xiaofei He; Can Wang; Jian Pei; Jiajun Bu; Chun Chen; Ziyu Guan; Gang Lu | |||
| We consider the problem of template-independent news extraction. The
state-of-the-art news extraction method is based on template-level wrapper
induction, which has two serious limitations. 1) It cannot correctly extract
pages belonging to an unseen template until the wrapper for that template has
been generated. 2) It is costly to maintain up-to-date wrappers for hundreds of
websites, because any change of a template may lead to the invalidation of the
corresponding wrapper. In this paper we formalize news extraction as a machine
learning problem and learn a template-independent wrapper using a very small
number of labeled news pages from a single site. Novel features dedicated to
news titles and bodies are developed respectively. Correlations between the
news title and the news body are exploited. Our template-independent wrapper
can extract news pages from different sites regardless of templates. In
experiments, a wrapper is learned from 40 pages from a single news site. It
achieved 98.1% accuracy over 3,973 news pages from 12 news sites. Keywords: data extraction | |||
| Graffiti: node labeling in heterogeneous networks | | BIBAK | Full-Text | 1087-1088 | |
| Ralitsa Angelova; Gjergji Kasneci; Fabian M. Suchanek; Gerhard Weikum | |||
| We introduce a multi-label classification model and algorithm for labeling
heterogeneous networks, where nodes belong to different types and different
types have different sets of classification labels. We present a graph-based
approach which models the mutual influence between nodes in the network as a
random walk. When viewing class labels as "colors", the random surfer is
"spraying" different node types with different color palettes; hence the name
Graffiti. We demonstrate the performance gains of our method by comparing it to
three state-of-the-art techniques for graph-based classification. Keywords: graph classification, web 2.0 ir | |||
| Graph based crawler seed selection | | BIBAK | Full-Text | 1089-1090 | |
| Shuyi Zheng; Pavel Dmitriev; C. Lee Giles | |||
| This paper identifies and explores the problem of seed selection in a
web-scale crawler. We argue that seed selection is not a trivial but very
important problem. Selecting proper seeds can increase the number of pages a
crawler will discover, and can result in a collection with more "good" and less
"bad" pages. Based on the analysis of the graph structure of the web, we
propose several seed selection algorithms. Effectiveness of these algorithms is
proved by our experimental results on real web data. Keywords: crawler, graph analysis, pagerank, seed selection | |||
| Building term suggestion relational graphs from collective intelligence | | BIBAK | Full-Text | 1091-1092 | |
| Jyh-Ren Shieh; Yung-Huan Hsieh; Yang-Ting Yeh; Tse-Chung Su; Ching-Yung Lin; Ja-Ling Wu | |||
| This paper proposes an effective approach to provide relevant search terms
for conceptual Web search. 'Semantic Term Suggestion' function has been
included so that users can find the most appropriate query term to what they
really need. Conventional approaches for term suggestion involve extracting
frequently occurring key terms from retrieved documents. They must deal with
term extraction difficulties and interference from irrelevant documents. In
this paper, we propose a semantic term suggestion function called Collective
Intelligence based Term Suggestion (CITS). CITS provides a novel social-network
based framework for relevant terms suggestion with a semantic graph of the
search term without limiting to the specific query term. A visualization of
semantic graph is presented to the users to help browsing search results from
related terms in the semantic graph. The search results are ranked each time
according to their relevance to the related terms in the entire query session.
Comparing to two popular commercial search engines, a user study of 18 users on
50 search terms showed better user satisfactions and indicated the potential
usefulness of proposed method in real-world search applications. Keywords: keyword expansion, re-ranking, social network | |||
| Towards intent-driven bidterm suggestion | | BIBAK | Full-Text | 1093-1094 | |
| William Chang; Patrick Pantel; Ana-Maria Popescu; Evgeniy Gabrilovich | |||
| In online advertising, pervasive in commercial search engines, advertisers
typically bid on few terms, and the scarcity of data makes ad matching
difficult. Suggesting additional bidterms can significantly improve ad
clickability and conversion rates. In this paper, we present a large-scale
bidterm suggestion system that models an advertiser's intent and finds new
bidterms consistent with that intent. Preliminary experiments show that our
system significantly increases the coverage of a state of the art production
system used at Yahoo while maintaining comparable precision. Keywords: sponsored search | |||
| Advertising keyword generation using active learning | | BIBAK | Full-Text | 1095-1096 | |
| Hao Wu; Guang Qiu; Xiaofei He; Yuan Shi; Mingcheng Qu; Jing Shen; Jiajun Bu; Chun Chen | |||
| This paper proposes an efficient relevance feedback based interactive model
for keyword generation in sponsored search advertising. We formulate the
ranking of relevant terms as a supervised learning problem and suggest new
terms for the seed by leveraging user relevance feedback information. Active
learning is employed to select the most informative samples from a set of
candidate terms for user labeling. Experiments show our approach improves the
relevance of generated terms significantly with little user effort required. Keywords: active learning, keyword generation, sponsored search | |||
| Gamesense | | BIBAK | Full-Text | 1097-1098 | |
| Lusong Li; Tao Mei; Chris Liu; Xian-Sheng Hua | |||
| This paper presents a novel game-like advertising system called GameSense,
which is driven by the compelling contents of online images. Given a Web page
which typically contains images, GameSense is able to select suitable images to
create online in-image games for advertising. The contextually relevant ads
(i.e., product logos) are embedded at appropriate positions within the online
games. The ads are selected based on not only textual relevance but also visual
content similarity. The game is able to provide viewers rich experience and
thus promote the embedded ads to provide more effective advertising. Keywords: image advertising, online game | |||
| Rare item detection in e-commerce site | | BIBAK | Full-Text | 1099-1100 | |
| Dan Shen; Xiaoyuan Wu; Alvaro Bolivar | |||
| As the largest online marketplace in the world, eBay has a huge inventory
where there are plenty of great rare items with potentially large, even
rapturous buyers. These items are obscured in long tail of eBay item listing
and hard to find through existing searching or browsing methods. It is observed
that there are great rarity demands from users according to eBay query log. To
keep up with the demands, the paper proposes a method to automatically detect
rare items in eBay online listing. A large set of features relevant to the task
are investigated to filter items and further measure item rareness. The
experiments on the most rarity-demand-intensitive domains show that the method
may effectively detect rare items (>90% precision). Keywords: long tail theory, rare item detection, rareness measure | |||
| A declarative framework for semantic link discovery over relational data | | BIBAK | Full-Text | 1101-1102 | |
| Oktie Hassanzadeh; Lipyeow Lim; Anastasios Kementsietsidis; Min Wang | |||
| In this paper, we present a framework for online discovery of semantic links
from relational data. Our framework is based on declarative specification of
the linkage requirements by the user, that allows matching data items in many
real-world scenarios. These requirements are translated to queries that can run
over the relational data source, potentially using the semantic knowledge to
enhance the accuracy of link discovery. Our framework lets data publishers to
easily find and publish high-quality links to other data sources, and therefore
could significantly enhance the value of the data in the next generation of
web. Keywords: link discovery, linked data, record linkage, semantic web | |||
| Modeling semantics and structure of discussion threads | | BIBAK | Full-Text | 1103-1104 | |
| Chen Lin; Jiang-Ming Yang; Rui Cai; Xin-Jing Wang; Wei Wang; Lei Zhang | |||
| The abundant knowledge in web communities has motivated the research
interests in discussion threads. The dynamic nature of discussion threads poses
interesting and challenging problems for computer scientists. Although
techniques such as semantic models or structural models have been shown to be
useful in a number of areas, they are inefficient in understanding discussion
threads due to the temporal dependence among posts in a discussion thread. Such
dependence causes that semantics and structure coupled with each other in
discussion threads. In this paper, we propose a sparse coding-based model named
SMSS to Simultaneously Model Semantic and Structure of discussion threads. Keywords: reply reconstruction, sparse coding, threaded discussion | |||
| Combining anchor text categorization and graph analysis for paid link detection | | BIBAK | Full-Text | 1105-1106 | |
| Kirill Nikolaev; Ekaterina Zudina; Andrey Gorshkov | |||
| In order to artificially boost the rank of commercial pages in search engine
results, search engine optimizers pay for links to these pages on other
websites. Identifying paid links is important for a web search engine to
produce highly relevant results. In this paper we introduce a novel method of
identifying such links. We start with training a classifier of anchor text
topics and analyzing web pages for diversity of their outgoing commercial
links. Then we use this information and analyze link graph of the Russian Web
to find pages that sell links and sites that buy links and to identify the paid
links. Testing on manually marked samples showed high efficiency of the
algorithm. Keywords: categorization, language model, link analysis, machine learning, search
engines, web mining | |||
| Rethinking email message and people search | | BIBAK | Full-Text | 1107-1108 | |
| Sebastian Michel; Ingmar Weber | |||
| We show how a number of novel email search features can be implemented
without any kind of natural language processing (NLP) or advanced data mining.
Our approach inspects the email headers of all messages a user has ever sent or
received and it creates simple per-contact summaries, including simple
information about the message exchange history, the domain of the sender or
even the sender's gender. With these summaries advanced questions/tasks such as
"Who do I still need to reply to?" or "Find 'fun' messages sent by friends."
become possible. As a proof of concept, we implemented a Mozilla-Thunderbird
extension, adding powerful people search to the popular email client. Keywords: email, email search, inbox 2.0, people search | |||
| Purely URL-based topic classification | | BIBAK | Full-Text | 1109-1110 | |
| Eda Baykan; Monika Henzinger; Ludmila Marian; Ingmar Weber | |||
| Given only the URL of a web page, can we identify its topic? This is the
question that we examine in this paper. Usually, web pages are classified using
their content, but a URL-only classifier is preferable, (i) when speed is
crucial, (ii) to enable content filtering before an (objection-able) web page
is downloaded, (iii) when a page's content is hidden in images, (iv) to
annotate hyperlinks in a personalized web browser, without fetching the target
page, and (v) when a focused crawler wants to infer the topic of a target page
before devoting bandwidth to download it. We apply a machine learning approach
to the topic identification task and evaluate its performance in extensive
experiments on categorized web pages from the Open Directory Project (ODP).
When training separate binary classifiers for each topic, we achieve typical
F-measure values between 80 and 85, and a typical precision of around 85. We
also ran experiments on a small data set of university web pages. For the task
of classifying these pages into faculty, student, course and project pages, our
methods improve over previous approaches by 13.8 points of F-measure. Keywords: ODP, URL, topic classification | |||
| WPBench: a benchmark for evaluating the client-side performance of web 2.0 applications | | BIBAK | Full-Text | 1111-1112 | |
| Kaimin Zhang; Lu Wang; Xiaolin Guo; Aimin Pan; Bin Benjamin Zhu | |||
| In this paper, a benchmark called WPBench is reported to evaluate the
responsiveness of Web browsers for modern Web 2.0 applications. In WPBench,
variations of servers and networks are removed and the benchmark result is the
closest to what Web users would perceive. To achieve these, WPBench records
users' interactions with typical Web 2.0 applications, and then replays Web
navigations when benchmarking browsers. The replay mechanism can emulate the
actual user interactions and the characteristics of the servers and the
networks in a consistent way independent of browsers so that any browser
compliant to the standards can be benchmarked fairly. In addition to describing
the design and generation of WPBench, we also report the WPBench comparison
results on the responsiveness performance for three popular Web browsers:
Internet Explorer, Firefox and Chrome. Keywords: benchmark, browser, javascript, replay, web | |||
| MASTH proxy: an extensible platform for web overload control | | BIBAK | Full-Text | 1113-1114 | |
| Vipul Mathur; Sanket Dhopeshwarkar; Varsha Apte | |||
| Many overload control mechanisms for Web based applications aim to prevent
overload by setting limits on factors such as admitted load, number of server
threads, buffer size. For this they need online measurements of metrics such as
response time, throughput, and resource utilization. This requires
instrumentation of the server by modifying server code, which may not be
feasible or desirable. An alternate approach is to use a proxy between the
clients and servers.
We have developed a proxy-based overload control platform called MASTH Proxy -- Multi-class Admission-controlled Self-Tuning HTTP Proxy. It records detailed measurements, supports multiple request classes, manages queues of HTTP requests, provides tunable parameters and enables easy implementation of dynamic overload control. This gives designers of overload control schemes a platform where they can concentrate on developing the core control logic, without the need to modify upstream server code. Keywords: admission control, multi-class, overload, proxy, web server | |||
| A messaging API for inter-widgets communication | | BIBAK | Full-Text | 1115-1116 | |
| Stéphane Sire; Micaël Paquier; Alain Vagner; Jérôme Bogaerts | |||
| Widget containers are used everywhere on the Web, for instance as
customizable start pages to Web desktops. In this poster, we describe the
extension of a widget container with an inter-widgets communication layer, as
well as the subsequent application programming interfaces (APIs) added to the
Widget object to support this feature. We present the benefits of a drag and
drop facility within widgets and conclude by a call for standardization of
inter-widgets communication on the Web. Keywords: communication, drag and drop, portal, start page, widget | |||
| Why are moved web pages difficult to find?: the WISH approach | | BIBAK | Full-Text | 1117-1118 | |
| Atsuyuki Morishima; Akiyoshi Nakamizo; Toshinari Iida; Shigeo Sugimoto; Hiroyuki Kitagawa | |||
| This paper addresses the problem of finding new locations of moved Web
pages. We discuss why the content-based approach has a limitation in solving
the problem and why it is important to exploit the knowledge on where to search
for the pages. Keywords: broken links, integrity management | |||
| Detecting soft errors by redirection classification | | BIBAK | Full-Text | 1119-1120 | |
| Taehyung Lee; Jinil Kim; Jin Wook Kim; Sung-Ryul Kim; Kunsoo Park | |||
| A soft error redirection is a URL redirection to a page that returns the
HTTP status code 200 (OK) but has actually no relevant content to the client
request. Since such redirections degrade the performance of web search engines
in many ways, it is highly desirable to remove as many of them as possible. We
propose a novel approach to detect soft error redirections by analyzing
redirection logs collected during crawling operation. Experimental results on
huge crawl data show that our measure can classify soft error redirections
effectively. Keywords: search engine, soft 404, spam, url redirection | |||
| Automatic web service composition with abstraction and refinement | | BIBAK | Full-Text | 1121-1122 | |
| Hyunyoung Kil; Wonhong Nam; Dongwon Lee | |||
| The behavioral description based Web Service Composition (WSC) problem aims
at the automatic construction of a coordinator web service that controls a set
of web services to reach a goal state. However, solving the WSC problem exactly
with a realistic model is doubly-exponential in the number of variables in web
service descriptions. In this paper, we propose a novel efficient
approximation-based algorithm using automatic abstraction and refinement to
dramatically reduce the number of variables needed to solve the problem. Keywords: abstraction, refinement, service composition | |||
| Where to adapt dynamic service compositions | | BIBAK | Full-Text | 1123-1124 | |
| Bo Jiang; W. K. Chan; Zhenyu Zhang; T. H. Tse | |||
| Peer services depend on one another to accomplish their tasks, and their
structures may evolve. A service composition may be designed to replace its
member services whenever the quality of the composite service fails to meet
certain quality-of-service (QoS) requirements. Finding services and service
invocation endpoints having the greatest impact on the quality are important to
guide subsequent service adaptations. This paper proposes a technique that
samples the QoS of composite services and continually analyzes them to identify
artifacts for service adaptation. The preliminary results show that our
technique has the potential to effectively find such artifacts in services. Keywords: service adaptation, service composition | |||
| SGPS: a semantic scheme for web service similarity | | BIBAK | Full-Text | 1125-1126 | |
| Sourish Dasgupta; Satish Bhat; Yugyung Lee | |||
| Today's Web becomes a platform for services to be dynamically interconnected
to produce a desired outcome. It is important to formalize the semantics of the
contextual elements of web services. In this paper, we propose a novel
technique called Semantic Genome Propagation Scheme (SGPS) for measuring
similarity between semantic concepts. We show how SGPS is used to compute a
multi-dimensional similarity between two services. We evaluate the SGPS
similarity measurement in terms of the similarity performance and scalability. Keywords: context, semantics, similarity, web services | |||
| Automated synthesis of composite services with correctness guarantee | | BIBAK | Full-Text | 1127-1128 | |
| Ting Deng; Jinpeng Huai; Xianxian Li; Zongxia Du; Huipeng Guo | |||
| In this paper, we propose a novel approach for composing existing web
services to satisfy the correctness constraints to the design, including
freeness of deadlock and unspecified reception, and temporal constraints in
Computation Tree Logic formula. An automated synthesis algorithm based on
learning algorithm is introduced, which guarantees that the composite service
is the most general way of coordinating services so that the correctness is
ensured. We have implemented a prototype system evaluating the effectiveness
and efficiency of our synthesis approach through an experimental study. Keywords: composition synthesis, correctness constraints, learning algorithm | |||
| User-centric content freshness metrics for search engines | | BIBAK | Full-Text | 1129-1130 | |
| Ali Dasdan; Xinh Huynh | |||
| In order to return relevant search results, a search engine must keep its
local repository synchronized to the Web, but it is usually impossible to
attain perfect freshness. Hence, it is vital for a production search engine
continually to monitor and improve repository freshness. Most previous
freshness metrics, formulated in the context of developing better
synchronization policies, focused on the web crawler while ignoring other parts
of a search engine. But, the freshness of documents in a web crawler does not
necessarily translate directly into the freshness of search results as seen by
users. We propose metrics for measuring freshness from a user's perspective,
which take into account the latency between when documents are crawled and when
they are viewed by users, as well as the variation in user click and view
frequency among different documents. We also describe a practical
implementation of these metrics that were used in a production search engine. Keywords: crawling, document age, freshness, latency, metrics, monitoring, search
engine | |||
| Reliability analysis using weighted combinational models for web-based software | | BIBAK | Full-Text | 1131-1132 | |
| Chao-Jung Hsu; Chin-Yu Huang | |||
| In the past, some researches suggested that engineers can use combined
software reliability growth models (SRGMs) to obtain more accurate reliability
prediction during testing. In this paper, three weighted combinational models,
namely, equal, linear, and nonlinear weight, are proposed for reliability
estimation of web-based software. We further investigate the estimation
accuracy of using genetic algorithm to determine the weight assignment for the
proposed models. Preliminary result shows that the linearly and nonlinearly
weighted combinational models have better prediction capability than single
SRGM and equally weighted combinational model for web-based software. Keywords: genetic algorithm., software reliability growth model, weighted combination | |||
| sMash: semantic-based mashup navigation for data API network | | BIBAK | Full-Text | 1133-1134 | |
| Bin Lu; Zhaohui Wu; Yuan Ni; Guotong Xie; Chunying Zhou; Huajun Chen | |||
| With the proliferation of data APIs, it is not uncommon that users who have
no clear ideas about data APIs will encounter difficulties to build Mashups to
satisfy their requirements. In this paper, we present a semantic-based mashup
navigation system, sMash that makes mashup building easy by constructing and
visualizing a real-life data API network. We build a sample network by
gathering more than 300 popular APIs and find that the relationships between
them are so complex that our system will play an important role in navigating
users and give them inspiration to build interesting mashups easily. The system
is accessible at: http://www.dart.zju.edu.cn/mashup. Keywords: data api network, mashup navigation, semantic, social | |||
| Semantic wiki aided business process specification | | BIBAK | Full-Text | 1135-1136 | |
| Toufeeq Hussain; Rajesh Balakrishnan; Amar Viswanathan | |||
| This paper formulates a collaborative system for modeling business
application. The system uses a Semantic Wiki to enable collaboration between
the various stakeholders involved in the design of the system and translates
the captured intelligence into business models which are used for designing a
business system. Keywords: RDF, business modeling, business process modeling language, semantic web,
software | |||
| Raise semantics at the user level for dynamic and interactive SOA-based portals | | BIBAK | Full-Text | 1137-1138 | |
| Jean-Sebastien Brunner; Patrick Gatellier | |||
| In this paper, we describe the fully dynamic semantic portal we implemented,
integrating Semantic Web technologies and Service Oriented Architecture (SOA).
The goals of the portal are twofold: first it helps administrators to easily
propose new features in the portal using semantics to ease the orchestration
process; secondly it automatically generates a customized user interface for
these scenarios. This user interface takes into account different devices and
assists end-users in the use of the portal taking benefit of context awareness.
All the added-value of this portal is based on a core semantics defined by an
ontology. We present here the main features of this portal and how it was
implemented using state-of-the-art technologies and frameworks. Keywords: SOA, context, semantic portal, semantic web services | |||
| Towards lightweight and efficient DDOS attacks detection for web server | | BIBAK | Full-Text | 1139-1140 | |
| Yang Li; Tian-Bo Lu; Li Guo; Zhi-Hong Tian; Qin-Wu Nie | |||
| In this poster, based on our previous work in building a lightweight DDoS
(Distributed Denial-of-Services) attacks detection mechanism for web server
using TCM-KNN (Transductive Confidence Machines for K-Nearest Neighbors) and
genetic algorithm based instance selection methods, we further propose a more
efficient and effective instance selection method, named E-FCM (Extend Fuzzy
C-Means). By using this method, we can obtain much cheaper training time for
TCM-KNN while ensuring high detection performance. Therefore, the optimized
mechanism is more suitable for lightweight DDoS attacks detection in real
network environment. Keywords: E-FCM algorithm, TCM-KNN algorithm, instance selection, web server anomaly
detection | |||
| A general framework for adaptive and online detection of web attacks | | BIBAK | Full-Text | 1141-1142 | |
| Wei Wang; Florent Masseglia; Thomas Guyet; Rene Quiniou; Marie-Odile Cordier | |||
| Detection of web attacks is an important issue in current defense-in-depth
security framework. In this paper, we propose a novel general framework for
adaptive and online detection of web attacks. The general framework can be
based on any online clustering methods. A detection model based on the
framework is able to learn online and deal with "concept drift" in web audit
data streams. Str-DBSCAN that we extended DBSCAN to streaming data as well as
StrAP are both used to validate the framework. The detection model based on the
framework automatically labels the web audit data and adapts to normal behavior
changes while identifies attacks through dynamical clustering of the streaming
data. A very large size of real HTTP Log data collected in our institute is
used to validate the framework and the model. The preliminary testing results
demonstrated its effectiveness. Keywords: anomaly detection, clustering, intrusion detection | |||
| PAKE-based mutual HTTP authentication for preventing phishing attacks | | BIBAK | Full-Text | 1143-1144 | |
| Yutaka Oiwa; Hiromitsu Takagi; Hajime Watanabe; Hirofumi Suzuki | |||
| We developed a new Web authentication protocol with password-based mutual
authentication which prevents various kinds of phishing attacks. This protocol
provides a protection of user's passwords against any phishers even if a
dictionary attack is employed, and prevents phishers from imitating a false
sense of successful authentication to users. The protocol is designed
considering interoperability with many recent Web applications which requires
many features which current HTTP authentication does not provide. The protocol
is proposed as an Internet Draft submitted to IETF, and implemented in both
server side (as an Apache extension) and client side (as a Mozilla-based
browser and an IE-based one). Keywords: http, mutual authentication, web systems | |||
| Inferring private information using social network data | | BIBAK | Full-Text | 1145-1146 | |
| Jack Lindamood; Raymond Heatherly; Murat Kantarcioglu; Bhavani Thuraisingham | |||
| On-line social networks, such as Facebook, are increasingly utilized by many
users. These networks allow people to publish details about themselves and
connect to their friends. Some of the information revealed inside these
networks is private and it is possible that corporations could use learning
algorithms on the released data to predict undisclosed private information. In
this paper, we explore how to launch inference attacks using released social
networking data to predict undisclosed private information about individuals.
We then explore the effectiveness of possible sanitization techniques that can
be used to combat such inference attacks under different scenarios. Keywords: inference, privacy, social networks | |||
| Privacy preserving frequency capping in internet banner advertising | | BIBAK | Full-Text | 1147-1148 | |
| Ayman Farahat | |||
| We describe an optimize-and-dispatch approach for delivering
pay-per-impression advertisements in online advertising. The platform provider
for an advertising network commits to showing advertisers' banner ads while
capping the number of advertising message shown to a unique user as the user
transitions through the network. The traditional approach for enforcing
frequency caps has been to use cross-site cookies to track users. However,
cross-site cookies and other tracking mechanisms can infringe on the user
privacy. In this paper, we propose a novel linear programming approach that
decides when to show an ad to the user based solely on the page currently
viewed by the users. We show that the frequency caps are fulfilled in
expectation. We show the efficacy of that approach using simulation results. Keywords: privacy, user model | |||
| Crosslanguage blog mining and trend visualisation | | BIBAK | Full-Text | 1149-1150 | |
| Andreas Juffinger; Elisabeth Lex | |||
| People use weblogs to express thoughts, present ideas and share knowledge,
therefore weblogs are extraordinarily valuable resources, among others, for
trend analysis. Trends are derived from the chronological sequence of blog post
count per topic. The comparison with a reference corpus allows qualitative
statements over identified trends. We propose a crosslanguage blog mining and
trend visualisation system to analyse blogs across languages and topics. The
trend visualisation facilitates the identification of trends and the comparison
with the reference news article corpus. To prove the correctness of our system
we computed the correlation between trends in blogs and news articles for a
subset of blogs and topics. The evaluation corroborated our hypothesis of a
high correlation coefficient for these subsets and therefore the correctness of
our system for different languages and topics is proven. Keywords: blog mining, crosslanguage, trend visualisation | |||
| Crawling English-Japanese person-name transliterations from the web | | BIBAK | Full-Text | 1151-1152 | |
| Satoshi Sato | |||
| Automatic compilation of lexicon is a dream of lexicon compilers as well as
lexicon users. This paper proposes a system that crawls English-Japanese
person-name transliterations from the Web, which works a back-end collector for
automatic compilation of bilingual person-name lexicon. Our crawler collected
561K transliterations in five months. From them, an English-Japanese
person-name lexicon with 406K entries has been compiled by an automatic post
processing. This lexicon is much larger than other similar resources including
English-Japanese lexicon of HeiNER obtained from Wikipedia. Keywords: automatic lexicon compilation, mining transliteration pairs, person name | |||
| Near real time information mining in multilingual news | | BIBAK | Full-Text | 1153-1154 | |
| Martin Atkinson; Erik Van der Goot | |||
| This paper presents a near real-time multilingual news monitoring and
analysis system that forms the backbone of our research work. The system
integrates technologies to address the problems related to information
extraction and analysis of open source intelligence on the World Wide Web. By
chaining together different techniques in text mining, automated machine
learning and statistical analysis, we can automatically determine who, where
and, to a certain extent, what is being reported in news articles. Keywords: automated media monitoring, information mining and analysis,
multi-linguality, open source text | |||
| Mining multilingual topics from wikipedia | | BIBAK | Full-Text | 1155-1156 | |
| Xiaochuan Ni; Jian-Tao Sun; Jian Hu; Zheng Chen | |||
| In this paper, we try to leverage a large-scale and multilingual knowledge
base, Wikipedia, to help effectively analyze and organize Web information
written in different languages. Based on the observation that one Wikipedia
concept may be described by articles in different languages, we adapt existing
topic modeling algorithm for mining multilingual topics from this knowledge
base. The extracted 'universal' topics have multiple types of representations,
with each type corresponding to one language. Accordingly, new documents of
different languages can be represented in a space using a group of universal
topics, which makes various multilingual Web applications feasible. Keywords: multilingual, topic modeling, universal-topics, wikipedia | |||
| Towards language-independent web genre detection | | BIBAK | Full-Text | 1157-1158 | |
| Philipp Scholl; Renato Domínguez García; Doreen Böhnstedt; Christoph Rensing; Ralf Steinmetz | |||
| The term web genre denotes the type of a given web resource, in contrast to
the topic of its content. In this research, we focus on recognizing the web
genres blog, wiki and forum. We present a set of features that exploit the
hierarchical structure of the web page's HTML mark-up and thus, in contrast to
related approaches, do not depend on a linguistic analysis of the page's
content. Our results show that it is possible to achieve a very good accuracy
for a fully language independent detection of structured web genres. Keywords: machine learning, structural features, web genres | |||
| The web of nations | | BIBAK | Full-Text | 1159-1160 | |
| Sukwon Chung; Dungjit Shiowattana; Pavel Dmitriev; Su Chan | |||
| In this paper, we report on a large-scale study of structural differences
among the national webs. The study is based on a web-scale crawl conducted in
the summer 2008. More specifically, we study two graphs derived from this
crawl, the nation graph, with nodes corresponding to nations and edges -- to
links among nations, and the host graph, with nodes corresponding to hosts and
edges -- to hyperlinks among pages on the hosts. Contrary to some of the
previous work [2], our results show that webs of different nations are often
very different from each other, both in terms of their internal structure, and
in terms of their connectivity with other nations. Keywords: host graph, nation graph, web graph, web structure | |||
| Cascading style sheets: a novel approach towards productive styling with today's standards | | BIBAK | Full-Text | 1161-1162 | |
| Matthias Keller; Martin Nussbaumer | |||
| In this paper we present an approach of generating Cascading Style Sheet
documents automatically if the desired effect on the content elements is
specified. While a Web user agent resolves the CSS rules and computes their
effect, our approach handles the way back. We argue, that this can remarkably
improve CSS productivity, since the process of CSS authoring always involves
this direction implicitly. Our approach claims a new and innovative way to
reuse chunks of markup together with its presentation. It furthermore bears
potential for the optimization and reorganization of CSS documents. We describe
criteria for CSS code quality we oriented on, including a quantitative
indicator for the abstractness of a CSS presentation specification. An
evaluation and recomputation of the CSS for 25.000 HTML documents shows that
concerning these criteria the automatically generated code comes close to
manually authored code. Keywords: CSS, HTML, abstractness factor, presentation authoring | |||
| Automatically filling form-based web interfaces with free text inputs | | BIBAK | Full-Text | 1163-1164 | |
| Guilherme A. Toda; Eli Cortez; Filipe Mesquita; Altigran S. da Silva; Edleno Moura; Marden Neubert | |||
| On the web of today the most prevalent solution for users to interact with
data-intensive applications is the use of form-based interfaces composed by
several data input fields, such as text boxes, radio buttons, pull-down lists,
check boxes, etc. Although these interfaces are popular and effective, in many
cases, free text interfaces are preferred over form-based ones. In this paper
we discuss the proposal and the implementation of a novel IR-based method for
using data rich free text to interact with form-based interfaces. Our solution
takes a free text as input, extracts implicitly data values from it and fills
appropriate fields using them. For this task, we rely on values of previous
submissions for each field, which are freely obtained from the usage of
form-based interfaces. Keywords: data extraction, form filling, web applications | |||
| A densitometric analysis of web template content | | BIBAK | Full-Text | 1165-1166 | |
| Christian Kohlschütter | |||
| What makes template content in the Web so special that we need to remove it?
In this paper I present a large-scale aggregate analysis of textual Web
content, corroborating statistical laws from the field of Quantitative
Linguistics. I analyze the idiosyncrasy of template content compared to regular
"full text" content and derive a simple yet suitable quantitative model. Keywords: content analysis, noise removal, template detection, template removal, web
page segmentation | |||
| A flexible dialogue system for enhancing web usability | | BIBAK | Full-Text | 1167-1168 | |
| Marta Gatius; Meritxell González | |||
| In this paper, we study how the performance and usability of web dialogue
systems could be enhanced by using an appropriate representation of the
different types of knowledge involved in communication: general dialogue
mechanisms, specific domain-restricted linguistic and conceptual knowledge and
information on how well the communication process is doing. We describe the
experiments carried out to analyze how to improve this knowledge representation
in the web dialogue system we developed. Keywords: adaptive dialogue systems, evaluation, mixed-initiative dialogues, web
dialogue systems | |||
| Estimating web site readability using content extraction | | BIBAK | Full-Text | 1169-1170 | |
| Thomas Gottron; Ludger Martin | |||
| Nowadays, information is primarily searched on the WWW. From a user
perspective, the readability is an important criterion for measuring the
accessibility and thereby the quality of an information. We show that modern
content extraction algorithms help to estimate the readability of a web
document quite accurate. Keywords: content extraction, readability, usability | |||
| Web content accessibility guidelines: from 1.0 to 2.0 | | BIBAK | Full-Text | 1171-1172 | |
| Miquel Termens; Mireia Ribera; Mercè Porras; Marc Boldú; Andreu Sulé; Pilar Paris | |||
| This poster explains the changes introduced in the Web Content Accessibility
Guidelines (WCAG) 2.0 from WCAG 1.0 and proposes a checklist for adapting
existing websites. Finally, it describes the most common criticisms of the WCAG
and places them in the context of its origin and initial aims. Keywords: WCAG 1.0, WCAG 2.0, accessibility regulations | |||
| Mining cultural differences from a large number of geotagged photos | | BIBAK | Full-Text | 1173-1174 | |
| Keiji Yanai; Bingyu Qiu | |||
| We propose a novel method to detect cultural differences over the world
automatically by using a large amount of geotagged images on the photo
sharingWeb sites such as Flickr. We employ the state-of-the-art object
recognition technique developed in the research community of computer vision to
mine representative photos of the given concept for representative local
regions from a large-scale unorganized collection of consumer-generated
geotagged photos. The results help us understand how objects, scenes or events
corresponding to the same given concept are visually different depending on
local regions over the world. Keywords: flickr, geotag, object recognition, representative images | |||
| An experimental study of large-scale mobile social network | | BIBAK | Full-Text | 1175-1176 | |
| Zheng-Bin Dong; Guo-Jie Song; Kun-Qing Xie; Jing-Yao Wang | |||
| Mobile social network is a typical social network where one or more
individuals of similar interests or commonalities, conversing and connecting
with one another using the mobile phone. Our works in this paper focus on the
experimental study for this kind of social network with the support of
large-scale real mobile call data. The main contributions can be summarized as
three-fold: firstly, a large-scale real mobile phone call log of one city has
been extracted from a mobile phone carrier in China to construct mobile social
network; secondly, common features of traditional social networks, such as
power law distribution and small diameter etc, have been experimented, with
which we confirm that the mobile social network is a typical scale-free network
and has small-world phenomenon; lastly, different from traditional analytical
methods, important properties of the actors, such as gender and age, have been
introduced into our experiments with some interesting findings about human
behavior, for example, the middle-age people are more active than the young and
old people, and the female is unusual more active than the male while in the
old age. Keywords: betweenness centrality, clustering coefficient, degree distribution,
diameter, shortest path | |||
| A P2P based distributed services network for next generation mobile internet communications | | BIBAK | Full-Text | 1177-1178 | |
| Yang Li; Yi-Chuan Wu; Jian-Ying Zhang; Jin Peng; Hong-Luan Liao; Yun-Fei Zhang | |||
| In this poster, we present a novel P2P (Peer to Peer) based distributed
services network (DSN), which is a next generation operable and manageable
distributed core network architecture and functional structure, proposed by
China Mobile for telecommunication services and wireless Internet. Our
preliminary implementations of P2P VoIP (Voice over Internet Protocol) system
over DSN platform demonstrate its effectiveness and promising future. Keywords: DSN, P2P, distributed computing, mobile communication | |||
| Securely implementing open geospatial consortium web service interface standards in oracle spatial | | BIBAK | Full-Text | 1179-1180 | |
| Ning An; Raja Chatterjee; Mike Horhammer; Siva Ravada | |||
| In this paper, we briefly describe the implementation of various Open
Geospatial Consortium Web Service Interface Standards in Oracle Spatial 11g. We
highlight how we utilize Oracle's implementation of OASIS Web Services Security
(WSS) to provide a robust security framework for these OGC Web Services. We
also discuss our future direction in supporting OGC Web Service Interface
Standards. Keywords: geospatial, ogc web service interface standards, oracle spatial, security | |||
| Visualization of Geo-annotated pictures in mobile phones | | BIBAK | Full-Text | 1181-1182 | |
| Roberto Manca; Francesco Massidda; Davide Carboni | |||
| In this work, a novel mobile browser for geo-referenced pictures is
introduced and described. We use the term browser to denote a system aimed at
browsing pictures selected from a large set like Internet photo sharing
services. The criteria to filter a subset of pictures to browse are three: the
user's actual position, the user's actual heading, and the user's preferences.
In this work we only focus on the first two criteria leaving the integration of
user's preferences for future developments. Keywords: GPS, arduino, compass, geo-browsing, maps, mobile photo browsing | |||
| Deducing trip related information from flickr | | BIBAK | Full-Text | 1183-1184 | |
| Adrian Popescu; Gregory Grefenstette | |||
| Uploading tourist photos is a popular activity on photo sharing platforms.
These photographs and their associated metadata (tags, geo-tags, and temporal
information) should be useful for mining information about the sites visited.
However, user-supplied metadata are often noisy and efficient filtering methods
are needed before extracting useful knowledge. We focus here on exploiting
temporal information, associated with tourist sites that appear in Flickr. From
automatically filtered sets of geo-tagged photos, we deduce answers to
questions like "how long does it take to visit a tourist attraction?" or "what
can I visit in one day in this city?" Our method is evaluated and validated by
comparing the automatically obtained visit duration times to manual
estimations. Keywords: flickr, geographical gazetteer, georeferencing, image mining, text mining,
tourist sites, visit times | |||
| Link based small sample learning for web spam detection | | BIBAK | Full-Text | 1185-1186 | |
| Guang-Gang Geng; Qiudan Li; Xinchang Zhang | |||
| Robust statistical learning based web spam detection system often requires
large amounts of labeled training data. However, labeled samples are more
difficult, expensive and time consuming to obtain than unlabeled ones. This
paper proposed link based semi-supervised learning algorithms to boost the
performance of a classifier, which integrates the traditional Self-training
with the topological dependency based link learning. The experiments with a few
labeled samples on standard WEBSPAM-UK2006 benchmark showed that the algorithms
are effective. Keywords: content spam, link spam, machine learning, web spam | |||
| Detecting image spam using local invariant features and pyramid match kernel | | BIBAK | Full-Text | 1187-1188 | |
| Haiqiang Zuo; Weiming Hu; Ou Wu; Yunfei Chen; Guan Luo | |||
| Image spam is a new obfuscating method which spammers invented to more
effectively bypass conventional text based spam filters. In this paper, we
extract local invariant features of images and run a one-class SVM classifier
which uses the pyramid match kernel as the kernel function to detect image
spam. Experimental results demonstrate that our algorithm is effective for
fighting image spam. Keywords: image spam, local invariant features, pyramid match kernel | |||
| Web image retrieval reranking with multi-view clustering | | BIBAK | Full-Text | 1189-1190 | |
| Mingmin Chi; Peiwu Zhang; Yingbin Zhao; Rui Feng; Xiangyang Xue | |||
| General image retrieval is often carried out by a text-based search engine,
such as Google Image Search. In this case, natural language queries are used as
input to the search engine. Usually, the user queries are quite ambiguous and
the returned results are not well-organized as the ranking often done by the
popularity of an image. In order to address these problems, we propose to use
both textual and visual contents of retrieved images to reRank web retrieved
results. In particular, a machine learning technique, a multi-view clustering
algorithm is proposed to reorganize the original results provided by the
text-based search engine. Preliminary results validate the effectiveness of the
proposed framework. Keywords: multi-view clustering, reranking, web image retrieval | |||
| Characterizing web-based video sharing workloads | | BIBK | Full-Text | 1191-1192 | |
| Siddharth Mitra; Mayank Agrawal; Amit Yadav; Niklas Carlsson; Derek Eager; Anirban Mahanti | |||
Keywords: ugc, video sharing, workload characterization | |||
| Deriving music theme annotations from user tags | | BIBAK | Full-Text | 1193-1194 | |
| Kerstin Bischoff; Claudiu S. Firan; Raluca Paiu | |||
| Music theme annotations would be really beneficial for supporting retrieval,
but are often neglected by users while annotating. Thus, in order to support
users in tagging and to fill the gaps in the tag space, in this paper we
develop algorithms for recommending theme annotations. Our methods exploit
already existing user tags, the lyrics of music tracks, as well as combinations
of both. We compare the results for our recommended theme annotations against
genre and style recommendations -- a much easier and already studied task. We
evaluate the quality of our recommended tags against an expert ground truth
data set. Our results are promising and provide interesting insights into
possible extensions for music tagging systems to support music search. Keywords: collaborative tagging, high-level music descriptors, metadata enrichment,
theme tag recommendations | |||
| Tag-oriented document summarization | | BIBAK | Full-Text | 1195-1196 | |
| Junyan Zhu; Can Wang; Xiaofei He; Jiajun Bu; Chun Chen; Shujie Shang; Mingcheng Qu; Gang Lu | |||
| Social annotations on a Web document are highly generalized description of
topics contained in that page. Their tagged frequency indicates the user
attentions with various degrees. This makes annotations a good resource for
summarizing multiple topics in a Web page. In this paper, we present a
tag-oriented Web document summarization approach by using both document content
and the tags annotated on that document. To improve summarization performance,
a new tag ranking algorithm named EigenTag is proposed in this paper to reduce
noise in tags. Meanwhile, association mining technique is employed to expand
tag set to tackle the sparsity problem. Experimental results show our
tag-oriented summarization has a significant improvement over those not using
tags. Keywords: document summarization, ranking, tag | |||
| Search result re-ranking based on gap between search queries and social tags | | BIBAK | Full-Text | 1197-1198 | |
| Jun Yan; Ning Liu; Elaine Qing Chang; Lei Ji; Zheng Chen | |||
| Both search engine click-through log and social annotation have been
utilized as user feedback for search result re-ranking. However, to our best
knowledge, no previous study has explored the correlation between these two
factors for the task of search result ranking. In this paper, we show that the
gap between search queries and social tags of the same Web page can well
reflect its user preference score. Motivated by this observation, we propose a
novel algorithm, called Query-Tag-Gap (QTG), to re-rank search results for
better user satisfaction. Intuitively, on one hand, the search users'
intentions are generally described by their queries before they read the search
results. On the other hand, the Web annotators semantically tag Web pages after
they read the content of the pages. The difference between users' recognition
of the same page before and after they read it is a good reflection of user
satisfaction. In this extended abstract, we formally define the query set and
tag set of the same page as users' pre- and post- knowledge respectively. We
empirically show the strong correlation between user satisfaction and user's
knowledge gap before and after reading the page. Based on this gap, experiments
have shown outstanding performance of our proposed QTG algorithm in search
result re-ranking. Keywords: query log, search result ranking, social tagging | |||
| Signaling emotion in tagclouds | | BIBAK | Full-Text | 1199-1200 | |
| Takeharu Eda; Toshio Uchiyama; Tadasu Uchiyama; Masatoshi Yoshikawa | |||
| In order to create more attractive tagclouds that get people interested in
tagged content, we propose a simple but novel tagcloud where font size is
determined by tag's entropy value, not the popularity to its content. Our
method raises users' emotional interest in the content by emphasizing more
emotional tags. Our initial experiments show that emotional tagclouds attract
more attention than normal tagclouds at first look; thus they will enhance the
role of tagcloud as a social signaller. Keywords: classification, folksonomy, tagcloud, tagging | |||
| Two birds with one stone: a graph-based framework for disambiguating and tagging people names in web search | | BIBAK | Full-Text | 1201-1202 | |
| Lili Jiang; Jianyong Wang; Ning An; Shengyuan Wang; Jian Zhan; Lian Li | |||
| The ever growing volume of Web data makes it increasingly challenging to
accurately find relevant information about a specific person on the Web. To
address the challenge caused by name ambiguity in Web people search, this paper
explores a novel graph-based framework to both disambiguate and tag people
entities in Web search results. Experimental results demonstrate the
effectiveness of the proposed framework in tag discovery and name
disambiguation. Keywords: clustering, name disambiguation, tagging | |||
| The value of socially tagged urls for a search engine | | BIBAK | Full-Text | 1203-1204 | |
| Santanu Kolay; Ali Dasdan | |||
| Social bookmarking has emerged as a growing source of human generated
content on the web. In essence, bookmarking involves URLs and tags on them. In
this paper, we perform a large scale study of the usefulness of bookmarked URLs
from the top social bookmarking site Delicious. Instead of focusing on the
dimension of tags, which has been covered in the previous work, we explore
social bookmarking from the dimension of URLs. More specifically, we
investigate the Delicious URLs and their content to quantify their value to a
search engine. For their value in leading to good content, we show that the
Delicious URLs have higher quality content and more external outlinks. For
their value in satisfying users, we show that the Delicious URLs have more
clicked URLs as well as get more clicks. We suggest that based on their value,
the Delicious URLs should be used as another source of seed URLs for crawlers. Keywords: content quality, delicious, social bookmarking | |||
| The recurrence dynamics of social tagging | | BIBAK | Full-Text | 1205-1206 | |
| Dell Zhang; Robert Mao; Wei Li | |||
| How often do tags recur? How hard is predicting tag recurrence? What tags
are likely to recur? We try to answer these questions by analysing the RSDC08
dataset, in both individual and collective settings. Our findings provide
useful insights for the development of tag suggestion techniques etc. Keywords: folksonomy, social tagging, web 2.0 | |||
| Playful tagging: folksonomy generation using online games | | BIBAK | Full-Text | 1207-1208 | |
| Markus Krause; Hidir Aras | |||
| Collaborative Tagging is a powerful method to create folksonomies that can
be used to grasp/filter user preferences or enhance web search. Recent research
has shown that depending on the number of users and the quality of
user-provided tags powerful community-driven semantics or "ontologies" can
emerge -- as it was evident analyzing user data from social web applications
such as del.icio.us or Flickr. Unfortunately, most web pages do not contain
tags and, thus, no vocabulary that describes the information provided. A common
problem in web page annotation is to motivate users for constant participation,
i.e. tagging. In this paper we describe our approach of a binary verification
game that embeds collaborative tagging into on-line games in order to produce
domain specific folksonomies. Keywords: folksonomies, games with a purpose, human-based computation, social tagging | |||
| Identifying vertical search intention of query through social tagging propagation | | BIBAK | Full-Text | 1209-1210 | |
| Ning Liu; Jun Yan; Weiguo Fan; Qiang Yang; Zheng Chen | |||
| A pressing task during the unification process is to identify a user's
vertical search intention based on the user's query. In this paper, we propose
a novel method to propagate social annotation, which includes user-supplied tag
data, to both queries and VSEs for semantically bridging them. Our proposed
algorithm consists of three key steps: query annotation, vertical annotation
and query intention identification. Our algorithm, referred to as TagQV,
verifies that the social tagging can be propagated to represent Web objects
such as queries and VSEs besides Web pages. Experiments on real Web search
queries demonstrate the effectiveness of TagQV in query intention
identification. Keywords: metadata, social annotation, vertical search engine (vse). | |||
| Social search and discovery using a unified approach | | BIBAK | Full-Text | 1211-1212 | |
| Einat Amitay; David Carmel; Nadav Har'El; Shila Ofek-Koifman; Aya Soffer; Sivan Yogev; Nadav Golbandi | |||
| We explore new ways of improving a search engine using data from Web 2.0
applications such as blogs and social bookmarks. This data contains entities
such as documents, people and tags, and relationships between them. We propose
a simple yet effective method, based on faceted search, that treats all
entities in a unified manner: returning all of them (documents, people and
tags) on every search, and allowing all of them to be used as search terms. We
describe an implementation of such a social search engine on the intranet of a
large enterprise, and present large-scale experiments which verify the validity
of our approach. Keywords: enterprise search, faceted search, social search | |||
| Extracting community structure through relational hypergraphs | | BIBAK | Full-Text | 1213-1214 | |
| Yu-Ru Lin; Jimeng Sun; Paul Castro; Ravi Konuru; Hari Sundaram; Aisling Kelliher | |||
| Social media websites promote diverse user interaction on media objects as
well as user actions with respect to other users. The goal of this work is to
discover community structure in rich media social networks, and observe how it
evolves over time, through analysis of multi-relational data. The problem is
important in the enterprise domain where extracting emergent community
structure on enterprise social media, can help in forming new collaborative
teams, aid in expertise discovery, and guide long term enterprise
reorganization. Our approach consists of three main parts: (1) a relational
hypergraph model for modeling various social context and interactions; (2) a
novel hypergraph factorization method for community extraction on
multi-relational social data; (3) an on-line method to handle temporal
evolution through incremental hypergraph factorization. Extensive experiments
on real-world enterprise data suggest that our technique is scalable and can
extract meaningful communities. To evaluate the quality of our mining results,
we use our method to predict users' future interests. Our prediction
outperforms baseline methods (frequency counts, pLSA) by 36-250% on the
average, indicating the utility of leveraging multi-relational social context
by using our method. Keywords: community evolution, dynamic social network analysis, non-negative tensor
factorization, relational hypergraph | |||
| Ranking user-created contents by search user's inclination in online communities | | BIBAK | Full-Text | 1215-1216 | |
| Hyoseop Shin; Jeehoon Lee | |||
| Searching posts effectively has become an important issue in large-scale
online communities. Especially, if search users have different inclinations
when they search posts, they have different kinds of posts in their minds. To
address this problem, in this paper, we propose a scheme of ranking posts based
on search users' inclination. User ranking score is employed to capture posts
that are relevant to a specific user inclination. Specifically, we present a
scheme to rank posts in terms of user expertise and popularity. Experimental
results show that different user inclinations can produce quite different
search results and the proposed scheme achieves about 70% accuracy. Keywords: expertise, online community, popularity, post ranking, search user
inclination, user ranking | |||
| Retaining personal expression for social search | | BIBAK | Full-Text | 1217-1218 | |
| Praphul Chandra; Ajay Gupta | |||
| Web is being extensively used for personal expression, which includes
ratings, reviews, recommendations, blogs. This user created content, e.g. book
review on Amazon.com, becomes the property of the website, and the user often
does not have easy access to it. In some cases, user's feedback may get
averaged with feedback from other users e.g. ratings of a video. We argue that
the creator of such content needs to be able to retain (a link to) her created
content. We introduce the concept of MEB which is a user controlled store of
such retained links. A MEB allows a user to access/share all the reviews she
has given on different websites. With this capability users can allow their
friends to search through their feedback. Searching through one's social
network allows harnessing the power of social networks where known
relationships provide the context & trust necessary to interpret feedback. Keywords: 2.0, search, social, social networks, user content, web2.0 | |||
| Discovering the staring people from social networks | | BIBAK | Full-Text | 1219-1220 | |
| Dewei Chen; Jie Tang; Juanzi Li; Lizhu Zhou | |||
| In this paper, we study a novel problem of staring people discovery from
social networks, which is concerned with finding people who are not only
authoritative but also sociable in the social network. We formalize this
problem as an optimization programming problem. Taking the co-author network as
a case study, we define three objective functions and propose two methods to
combine these objective functions. A genetic algorithm based method is further
presented to solve this problem. Experimental results show that the proposed
solution can effectively find the staring people from social networks. Keywords: social network, staring people discovery | |||
| Analysis of community structure in Wikipedia | | BIBAK | Full-Text | 1221-1222 | |
| Dmitry Lizorkin; Olena Medelyan; Maria Grineva | |||
| We present the results of a community detection analysis of the Wikipedia
graph. Distinct communities in Wikipedia contain semantically closely related
articles. The central topic of a community can be identified using PageRank.
Extracted communities can be organized hierarchically similar to manually
created Wikipedia category structure. Keywords: community detection, graph analysis, wikipedia | |||
| Content hole search in community-type content | | BIBAK | Full-Text | 1223-1224 | |
| Akiyo Nadamoto; Eiji Aramaki; Takeshi Abekawa; Yohei Murakami | |||
| In community-type content such as blogs and SNSs, we call the user's
unawareness of information as a "content hole" and the search for this
information as a "content hole search." A content hole search differs from
similarity searching and has a variety of types. In this paper, we propose
different types of content holes and define each type. We also propose an
analysis of dialogue related to community-type content and introduce content
hole search by using Wikipedia as an example. Keywords: SNS, blog, community, content hole search | |||
| Searching for events in the blogosphere | | BIBAK | Full-Text | 1225-1226 | |
| Manolis Platakis; Dimitrios Kotsakos; Dimitrios Gunopulos | |||
| Over the last few years, blogs (web logs) have gained massive popularity and
have become one of the most influential web social media in our times. Every
blog post in the Blogosphere has a well defined timestamp, which is not taken
into account by search engines. By conducting research regarding this feature
of the Blogosphere, we can attempt to discover bursty terms and correlations
between them during a time interval. We apply Kleinberg's automaton on
extracted titles of blog posts to discover bursty terms, we introduce a novel
representation of a term's burstiness evolution called State Series and we
employ a Euclidean-based distance metric to discover potential correlations
between terms without taking into account their context. We evaluate the
results trying to match them with real life events. Finally, we propose some
ideas for further evaluation techniques and future research in the field. Keywords: blogs, burst analysis, hot topics, information retrieval, keyword
correlation, social media, text mining | |||
| Ranking community answers via analogical reasoning | | BIBAK | Full-Text | 1227-1228 | |
| Xudong Tu; Xin-Jing Wang; Dan Feng; Lei Zhang | |||
| Due to the lexical gap between questions and answers, automatically
detecting right answers becomes very challenging for community
question-answering sites. In this paper, we propose an analogical
reasoning-based method. It treats questions and answers as relational data and
ranks an answer by measuring the analogy of its link to a query with the links
embedded in previous relevant knowledge; the answer that links in the most
analogous way to the new question is assumed to be the best answer. We based
our experiments on 29.8 million Yahoo!Answer question-answer threads and showed
the effectiveness of the approach. Keywords: analogical reasoning, community question answering | |||
| Probabilistic question recommendation for question answering communities | | BIBAK | Full-Text | 1229-1230 | |
| Mingcheng Qu; Guang Qiu; Xiaofei He; Cheng Zhang; Hao Wu; Jiajun Bu; Chun Chen | |||
| User-Interactive Question Answering (QA) communities such as Yahoo! Answers
are growing in popularity. However, as these QA sites always have thousands of
new questions posted daily, it is difficult for users to find the questions
that are of interest to them. Consequently, this may delay the answering of the
new questions. This gives rise to question recommendation techniques that help
users locate interesting questions. In this paper, we adopt the Probabilistic
Latent Semantic Analysis (PLSA) model for question recommendation and propose a
novel metric to evaluate the performance of our approach. The experimental
results show our recommendation approach is effective. Keywords: PLSA, question answering, question recommendation | |||
| Buzz-based recommender system | | BIBAK | Full-Text | 1231-1232 | |
| Nish Parikh; Neel Sundaresan | |||
| In this paper, we describe a buzz-based recommender system based on a large
source of queries in an eCommerce application. The system detects bursts in
query trends. These bursts are linked to external entities like news and
inventory information to find the queries currently in-demand which we refer to
as buzz queries. The system follows the paradigm of limited quantity
merchandising, in the sense that on a per-day basis the system shows
recommendations around a single buzz query with the intent of increasing user
curiosity, and improving activity and stickiness on the site. A semantic
neighborhood of the chosen buzz query is selected and appropriate
recommendations are made on products that relate to this neighborhood. Keywords: buzz, novelty, recommenders, serendipity | |||
| Discovering user profiles | | BIBAK | Full-Text | 1233-1234 | |
| Riddhiman Ghosh; Mohamed Dekhil | |||
| In this paper we describe techniques for the discovery and construction of
user profiles. Leveraging from the emergent data web, our system addresses the
problem of sparseness of user profile information currently faced by both
asserted and inferred profile systems. A profile mediator, that dynamically
builds the most suitable user profile for a particular service or interaction
in real-time, is employed in our prototype implementation. Keywords: information management, personalization, semantic web, user profiles | |||
| Bootstrapped extraction of class attributes | | BIBAK | Full-Text | 1235-1236 | |
| Joseph Reisinger; Marius Pasca | |||
| As an alternative to previous studies on extracting class attributes from
unstructured text, which consider either Web documents or query logs as the
source of textual data, A bootstrapped method extracts class attributes
simultaneously from both sources, using a small set of seed attributes. The
method improves extraction precision and also improves attribute relevance
across 40 test classes. Keywords: attribute extraction | |||