| Personalized web exploration with task models | | BIBAK | Full-Text | 1-10 | |
| Jae-wook Ahn; Peter Brusilovsky; Daqing He; Jonathan Grady; Qi Li | |||
| Personalized Web search has emerged as one of the hottest topics for both
the Web industry and academic researchers. However, the majority of studies on
personalized search focused on a rather simple type of search, which leaves an
important research topic -- the personalization in exploratory searches -- as
an under-studied area. In this paper, we present a study of personalization in
task-based information exploration using a system called TaskSieve. TaskSieve
is a Web search system that utilizes a relevance feedback based profile, called
a "task model", for personalization. Its innovations include flexible and user
controlled integration of queries and task models, task-infused text snippet
generation, and on-screen visualization of task models. Through an empirical
study using human subjects conducting task-based exploration searches, we
demonstrate that TaskSieve pushes significantly more relevant documents to the
top of search result lists as compared to a traditional search system.
TaskSieve helps users select significantly more accurate information for their
tasks, allows the users to do so with higher productivity, and is viewed more
favorably by subjects under several usability related characteristics. Keywords: adaptive search, empirical study, personalization, task model, task-based
information exploration, user profile | |||
| Validating the use and role of visual elements of web pages in navigation with an eye-tracking study | | BIBAK | Full-Text | 11-20 | |
| Yeliz Yesilada; Caroline Jay; Robert Stevens; Simon Harper | |||
| This paper presents an eye-tracking study that examines how people use the
visual elements of Web pages to complete certain tasks. Whilst these elements
are available to play their role in these tasks for sighted users, it is not
the case for visually disabled users. This lack of access to some visual
elements of a page means that visually disabled users are hindered in
accomplishing these tasks. Our previous work has introduced a framework that
identifies these elements and then reengineers Web pages such that these
elements can play their intended roles in an audio, as well as visual
presentation. To further improve our understanding of how these elements are
used and to validate our framework, we track the eye movements of sighted users
performing a number of different tasks. The resulting gaze data show that there
is a strong relationship between the aspects of a page that receive visual
attention and the objects identified by our framework. The study also shows
some limitations, as well as yielding information to address these
short-comings. Perhaps the most important result is the support provided for a
particular kind of object called a Way Edge -- the visual construct used to
group content into sections. There is a significant effect of Way Edges on the
distribution of attention across tasks. This is a result that not only provides
strong evidence for the utility of re-engineering, but also has consequences
for our understanding of how people allocate attention to different parts of a
page. We speculate that the phenomenon of 'Banner Blindness' owes as much to
Way Edges, as it does to colour and font size. Keywords: eye-tracking, navigation, travel, travel objects, visual disability | |||
| Improving relevance judgment of web search results with image excerpts | | BIBAK | Full-Text | 21-30 | |
| Zhiwei Li; Shuming Shi; Lei Zhang | |||
| Current web search engines return result pages containing mostly text
summary even though the matched web pages may contain informative pictures. A
text excerpt (i.e. snippet) is generated by selecting keywords around the
matched query terms for each returned page to provide context for user's
relevance judgment. However, in many scenarios, we found that the pictures in
web pages, if selected properly, could be added into search result pages and
provide richer contextual description because a picture is worth a thousand
words. Such new summary is named as image excerpts. By well designed user
study, we demonstrate image excerpts can help users make much quicker relevance
judgment of search results for a wide range of query types. To implement this
idea, we propose a practicable approach to automatically generate image
excerpts in the result pages by considering the dominance of each picture in
each web page and the relevance of the picture to the query. We also outline an
efficient way to incorporate image excerpts in web search engines. Web search
engines can adopt our approach by slightly modifying their index and inserting
a few low cost operations in their workflow. Our experiments on a large web
dataset indicate the performance of the proposed approach is very promising. Keywords: dominant image, image excerpts, usability, user interface, web search | |||
| Keysurf: a character controlled browser for people with physical disabilities | | BIBAK | Full-Text | 31-40 | |
| Leo Spalteholz; Kin Fun Li; Nigel Livingston; Foad Hamidi | |||
| For many users with a physical or motor disability, using a computer mouse
or other pointing device to navigate the web is cumbersome or impossible due to
problems with pointing accuracy. At the same time, web accessibility using a
keyboard in major browsers is rudimentary, requiring many key presses to select
links or other elements. We introduce KeySurf, a character controlled web
navigation system which addresses this situation by presenting an interface
which allows a user to activate any web page element with only two or three
keystrokes. Through an implementation of a user-centric incremental search
algorithm, elements are matched according to user expectation as characters are
entered into the interface. We show how our interface can be integrated with a
speech recognition input, as well as with specialized on-screen keyboards for
people with disabilities. Using the user's browsing history, we improve the
efficiency of the selection process and find potentially interesting page links
for the user within the current web page. We present the results from a pilot
study evaluating the performance of various components of our system. Keywords: keyboard access, web accessibility, web navigation | |||
| Query-sets: using implicit feedback and query patterns to organize web documents | | BIBAK | Full-Text | 41-50 | |
| Barbara Poblete; Ricardo Baeza-Yates | |||
| In this paper we present a new document representation model based on
implicit user feedback obtained from search engine queries. The main objective
of this model is to achieve better results in non-supervised tasks, such as
clustering and labeling, through the incorporation of usage data obtained from
search engine queries. This type of model allows us to discover the motivations
of users when visiting a certain document. The terms used in queries can
provide a better choice of features, from the user's point of view, for
summarizing the Web pages that were clicked from these queries. In this work we
extend and formalize as "query model" an existing but not very well known idea
of "query view" for document representation. Furthermore, we create a novel
model based on "frequent query patterns" called the "query-set model". Our
evaluation shows that both "query-based" models outperform the vector-space
model when used for clustering and labeling documents in a website. In our
experiments, the query-set model reduces by more than 90% the number of
features needed to represent a set of documents and improves by over 90% the
quality of the results. We believe that this can be explained because our model
chooses better features and provides more accurate labels according to the
user's expectations. Keywords: feature selection, labeling, search engine queries, usage mining, web page
organization | |||
| Mining the search trails of surfing crowds: identifying relevant websites from user activity | | BIBAK | Full-Text | 51-60 | |
| Mikhail Bilenko; Ryen W. White | |||
| The paper proposes identifying relevant information sources from the history
of combined searching and browsing behavior of many Web users. While it has
been previously shown that user interactions with search engines can be
employed to improve document ranking, browsing behavior that occurs beyond
search result pages has been largely overlooked in prior work. The paper
demonstrates that users' post-search browsing activity strongly reflects
implicit endorsement of visited pages, which allows estimating topical
relevance of Web resources by mining large-scale datasets of search trails. We
present heuristic and probabilistic algorithms that rely on such datasets for
suggesting authoritative websites for search queries. Experimental evaluation
shows that exploiting complete post-search browsing trails outperforms
alternatives in isolation (e.g., clickthrough logs), and yields accuracy
improvements when employed as a feature in learning to rank for Web search. Keywords: implicit feedback, learning from user behavior, mining search and browsing
logs, web search | |||
| Using the wisdom of the crowds for keyword generation | | BIBAK | Full-Text | 61-70 | |
| Ariel Fuxman; Panayiotis Tsaparas; Kannan Achan; Rakesh Agrawal | |||
| In the sponsored search model, search engines are paid by businesses that
are interested in displaying ads for their site alongside the search results.
Businesses bid for keywords, and their ad is displayed when the keyword is
queried to the search engine. An important problem in this process is 'keyword
generation': given a business that is interested in launching a campaign,
suggest keywords that are related to that campaign. We address this problem by
making use of the query logs of the search engine. We identify queries related
to a campaign by exploiting the associations between queries and URLs as they
are captured by the user's clicks. These queries form good keyword suggestions
since they capture the "wisdom of the crowd" as to what is related to a site.
We formulate the problem as a semi-supervised learning problem, and propose
algorithms within the Markov Random Field model. We perform experiments with
real query logs, and we demonstrate that our algorithms scale to large query
logs and produce meaningful results. Keywords: absorbing random walks, keyword generation, markov random fields, query
click logs, sponsored search | |||
| Floatcascade learning for fast imbalanced web mining | | BIBAK | Full-Text | 71-80 | |
| Xiaoxun Zhang; Xueying Wang; Honglei Guo; Zhili Guo; Xian Wu; Zhong Su | |||
| This paper is concerned with the problem of Imbalanced Classification (IC)
in web mining, which often arises on the web due to the "Matthew Effect". As
web IC applications usually need to provide online service for user and deal
with large volume of data, classification speed emerges as an important issue
to be addressed. In face detection, Asymmetric Cascade is used to speed up
imbalanced classification by building a cascade structure of simple
classifiers, but it often causes a loss of classification accuracy due to the
iterative feature addition in its learning procedure. In this paper, we adopt
the idea of cascade classifier in imbalanced web mining for fast classification
and propose a novel asymmetric cascade learning method called FloatCascade to
improve the accuracy. To the end, FloatCascade selects fewer yet more effective
features at each stage of the cascade classifier. In addition, a decision-tree
scheme is adopted to enhance feature diversity and discrimination capability
for FloatCascade learning. We evaluate FloatCascade through two typical IC
applications in web mining: web page categorization and citation matching.
Experimental results demonstrate the effectiveness and efficiency of
FloatCascade comparing to the state-of-the-art IC methods like Asymmetric
Cascade, Asymmetric AdaBoost and Weighted SVM. Keywords: cascade learning, citation matching, fast imbalanced classification, float
searching, web page categorization | |||
| Recommending questions using the mdl-based tree cut model | | BIBAK | Full-Text | 81-90 | |
| Yunbo Cao; Huizhong Duan; Chin-Yew Lin; Yong Yu; Hsiao-Wuen Hon | |||
| The paper is concerned with the problem of question recommendation.
Specifically, given a question as query, we are to retrieve and rank other
questions according to their likelihood of being good recommendations of the
queried question. A good recommendation provides alternative aspects around
users' interest. We tackle the problem of question recommendation in two steps:
first represent questions as graphs of topic terms, and then rank
recommendations on the basis of the graphs. We formalize both steps as the
tree-cutting problems and then employ the MDL (Minimum Description Length) for
selecting the best cuts. Experiments have been conducted with the real
questions posted at Yahoo! Answers. The questions are about two domains,
'travel' and 'computers & internet'. Experimental results indicate that the use
of the MDL-based tree cut model can significantly outperform the baseline
methods of word-based VSM or phrase-based VSM. The results also show that the
use of the MDL-based tree cut model is essential to our approach. Keywords: minimum description length, query suggestion, question recommendation, tree
cut model | |||
| Learning to classify short and sparse text & web with hidden topics from large-scale data collections | | BIBAK | Full-Text | 91-100 | |
| Xuan-Hieu Phan; Le-Minh Nguyen; Susumu Horiguchi | |||
| This paper presents a general framework for building classifiers that deal
with short and sparse text & Web segments by making the most of hidden topics
discovered from large-scale data collections. The main motivation of this work
is that many classification tasks working with short segments of text & Web,
such as search snippets, forum & chat messages, blog & news feeds, product
reviews, and book & movie summaries, fail to achieve high accuracy due to the
data sparseness. We, therefore, come up with an idea of gaining external
knowledge to make the data more related as well as expand the coverage of
classifiers to handle future data better. The underlying idea of the framework
is that for each classification task, we collect a large-scale external data
collection called "universal dataset", and then build a classifier on both a
(small) set of labeled training data and a rich set of hidden topics discovered
from that data collection. The framework is general enough to be applied to
different data domains and genres ranging from Web search results to medical
text. We did a careful evaluation on several hundred megabytes of Wikipedia
(30M words) and MEDLINE (18M words) with two tasks: "Web search domain
disambiguation" and "disease categorization for medical text", and achieved
significant quality enhancement. Keywords: sparse text, topic analysis, web data analysis/classification | |||
| Topic modeling with network regularization | | BIBAK | Full-Text | 101-110 | |
| Qiaozhu Mei; Deng Cai; Duo Zhang; ChengXiang Zhai | |||
| In this paper, we formally define the problem of topic modeling with network
structure (TMN). We propose a novel solution to this problem, which regularizes
a statistical topic model with a harmonic regularizer based on a graph
structure in the data. The proposed method bridges topic modeling and social
network analysis, which leverages the power of both statistical topic models
and discrete regularization. The output of this model well summarizes topics in
text, maps a topic on the network, and discovers topical communities. With
concrete selection of a topic model and a graph-based regularizer, our model
can be applied to text mining problems such as author-topic analysis, community
discovery, and spatial text mining. Empirical experiments on two different
genres of data show that our approach is effective, which improves
text-oriented methods as well as network-oriented methods. The proposed model
is general; it can be applied to any text collections with a mixture of topics
and an associated network structure. Keywords: graph-based regularization, social networks, statistical topic models | |||
| Modeling online reviews with multi-grain topic models | | BIBAK | Full-Text | 111-120 | |
| Ivan Titov; Ryan McDonald | |||
| In this paper we present a novel framework for extracting the ratable
aspects of objects from online user reviews. Extracting such aspects is an
important challenge in automatically mining product opinions from the web and
in generating opinion-based summaries of user reviews [18, 19, 7, 12, 27, 36,
21]. Our models are based on extensions to standard topic modeling methods such
as LDA and PLSA to induce multi-grain topics. We argue that multi-grain models
are more appropriate for our task since standard models tend to produce topics
that correspond to global properties of objects (e.g., the brand of a product
type) rather than the aspects of an object that tend to be rated by a user. The
models we present not only extract ratable aspects, but also cluster them into
coherent topics, e.g., 'waitress' and 'bartender' are part of the same topic
'staff' for restaurants. This differentiates it from much of the previous work
which extracts aspects through term frequency analysis with minimal clustering.
We evaluate the multi-grain models both qualitatively and quantitatively to
show that they improve significantly upon standard topic models. Keywords: clustering, opinion mining, topic models | |||
| Opinion integration through semi-supervised topic modeling | | BIBAK | Full-Text | 121-130 | |
| Yue Lu; Chengxiang Zhai | |||
| Web 2.0 technology has enabled more and more people to freely express their
opinions on the Web, making the Web an extremely valuable source for mining
user opinions about all kinds of topics. In this paper we study how to
automatically integrate opinions expressed in a well-written expert review with
lots of opinions scattering in various sources such as blogspaces and forums.
We formally define this new integration problem and propose to use
semi-supervised topic models to solve the problem in a principled way.
Experiments on integrating opinions about two quite different topics (a product
and a political figure) show that the proposed method is effective for both
topics and can generate useful aligned integrated opinion summaries. The
proposed method is quite general. It can be used to integrate a well written
review with opinions in an arbitrary text collection about any topic to
potentially support many interesting applications in multiple domains. Keywords: expert review, opinion integration, probabilistic topic modeling,
semi-supervised | |||
| Efficient similarity joins for near duplicate detection | | BIBAK | Full-Text | 131-140 | |
| Chuan Xiao; Wei Wang; Xuemin Lin; Jeffrey Xu Yu | |||
| With the increasing amount of data and the need to integrate data from
multiple data sources, a challenging issue is to find near duplicate records
efficiently. In this paper, we focus on efficient algorithms to find pairs of
records such that their similarities are above a given threshold. Several
existing algorithms rely on the prefix filtering principle to avoid computing
similarity values for all possible pairs of records. We propose new filtering
techniques by exploiting the ordering information; they are integrated into the
existing methods and drastically reduce the candidate sizes and hence improve
the efficiency. Experimental results show that our proposed algorithms can
achieve up to 2.6x -- 5x speed-up over previous algorithms on several real
datasets and provide alternative solutions to the near duplicate Web page
detection problem. Keywords: near duplicate detection, similarity join | |||
| Learning multiple graphs for document recommendations | | BIBAK | Full-Text | 141-150 | |
| Ding Zhou; Shenghuo Zhu; Kai Yu; Xiaodan Song; Belle L. Tseng; Hongyuan Zha; C. Lee Giles | |||
| The Web offers rich relational data with different semantics. In this paper,
we address the problem of document recommendation in a digital library, where
the documents in question are networked by citations and are associated with
other entities by various relations. Due to the sparsity of a single graph and
noise in graph construction, we propose a new method for combining multiple
graphs to measure document similarities, where different factorization
strategies are used based on the nature of different graphs. In particular, the
new method seeks a single low-dimensional embedding of documents that captures
their relative similarities in a latent space. Based on the obtained embedding,
a new recommendation framework is developed using semi-supervised learning on
graphs. In addition, we address the scalability issue and propose an
incremental algorithm. The new incremental method significantly improves the
efficiency by calculating the embedding for new incoming documents only. The
new batch and incremental methods are evaluated on two real world datasets
prepared from CiteSeer. Experiments demonstrate significant quality improvement
for our batch method and significant efficiency improvement with tolerable
quality loss for our incremental method. Keywords: collaborative filtering, recommender systems, semi-supervised learning,
social network analysis, spectral clustering | |||
| Enhanced hierarchical classification via isotonic smoothing | | BIBAK | Full-Text | 151-160 | |
| Kunal Punera; Joydeep Ghosh | |||
| Hierarchical topic taxonomies have proliferated on the World Wide Web [5,
18], and exploiting the output space decompositions they induce in automated
classification systems is an active area of research. In many domains,
classifiers learned on a hierarchy of classes have been shown to outperform
those learned on a flat set of classes. In this paper we argue that the
hierarchical arrangement of classes leads to intuitive relationships between
the corresponding classifiers' output scores, and that enforcing these
relationships as a post-processing step after classification can improve its
accuracy. We formulate the task of smoothing classifier outputs as a
regularized isotonic tree regression problem, and present a dynamic programming
based method that solves it optimally. This new problem generalizes the classic
isotonic tree regression problem, and both, the new formulation and algorithm,
might be of independent interest. In our empirical analysis of two real-world
text classification scenarios, we show that our approach to smoothing
classifier outputs results in improved classification accuracy. Keywords: dynamic programming, hierarchical classification, regularized isotonic
regression, taxonomy | |||
| Externalities in online advertising | | BIBAK | Full-Text | 161-168 | |
| Arpita Ghosh; Mohammad Mahdian | |||
| Most models for online advertising assume that an advertiser's value from
winning an ad auction, which depends on the clickthrough rate or conversion
rate of the advertisement, is independent of other advertisements served
alongside it in the same session. This ignores an important 'externality
effect': as the advertising audience has a limited attention span, a
high-quality ad on a page can detract attention from other ads on the same
page. That is, the utility to a winner in such an auction also depends on the
set of other winners.
In this paper, we introduce the problem of modeling externalities in online advertising, and study the winner determination problem in these models. Our models are based on choice models on the audience side. We show that in the most general case, the winner determination problem is hard even to approximate. However, we give an approximation algorithm for this problem with an approximation factor that is logarithmic in the ratio of the maximum to the minimum bid. Furthermore, we show that there are some interesting special cases, such as the case where the audience preferences are single peaked, where the problem can be solved exactly in polynomial time. For all these algorithms, we prove that the winner determination algorithm can be combined with VCG-style payments to yield truthful mechanisms. Keywords: advertising, approximation algorithms, auctions, externalities | |||
| A combinatorial allocation mechanism with penalties for banner advertising | | BIBAK | Full-Text | 169-178 | |
| Ureil Feige; Nicole Immorlica; Vahab Mirrokni; Hamid Nazerzadeh | |||
| Most current banner advertising is sold through negotiation thereby
incurring large transaction costs and possibly suboptimal allocations. We
propose a new automated system for selling banner advertising. In this system,
each advertiser specifies a collection of host webpages which are relevant to
his product, a desired total quantity of impressions on these pages, and a
maximum per-impression price. The system selects a subset of advertisers as
'winners' and maps each winner to a set of impressions on pages within his
desired collection. The distinguishing feature of our system as opposed to
current combinatorial allocation mechanisms is that, mimicking the current
negotiation system, we guarantee that winners receive at least as many
advertising opportunities as they requested or else receive ample compensation
in the form of a monetary payment by the host. Such guarantees are essential in
markets like banner advertising where a major goal of the advertising campaign
is developing brand recognition.
As we show, the problem of selecting a feasible subset of advertisers with maximum total value is inapproximable. We thus present two greedy heuristics and discuss theoretical techniques to measure their performances. Our first algorithm iteratively selects advertisers and corresponding sets of impressions which contribute maximum marginal per-impression profit to the current solution. We prove a bi-criteria approximation for this algorithm, showing that it generates approximately as much value as the optimum algorithm on a slightly harder problem. However, this algorithm might perform poorly on instances in which the value of the optimum solution is quite large, a clearly undesirable failure mode. Hence, we present an adaptive greedy algorithm which again iteratively selects advertisers with maximum marginal per-impression profit, but additionally reassigns impressions at each iteration. For this algorithm, we prove a structural approximation result, a newly defined framework for evaluating heuristics [10]. We thereby prove that this algorithm has a better performance guarantee than the simple greedy algorithm. Keywords: combinatorial auctions, internet advertising, structural approximation,
supply guarantee | |||
| Dynamic cost-per-action mechanisms and applications to online advertising | | BIBAK | Full-Text | 179-188 | |
| Hamid Nazerzadeh; Amin Saberi; Rakesh Vohra | |||
| We study the Cost-Per-Action or Cost-Per-Acquisition (CPA) charging scheme
in online advertising. In this scheme, instead of paying per click, the
advertisers pay only when a user takes a specific action (e.g. fills out a
form) or completes a transaction on their websites.
We focus on designing efficient and incentive compatible mechanisms that use this charging scheme. We describe a mechanism based on a sampling-based learning algorithm that under suitable assumptions is asymptotically individually rational, asymptotically Bayesian incentive compatible and asymptotically ex-ante efficient. In particular, we demonstrate our mechanism for the case where the utility functions of the advertisers are independent and identically-distributed random variables as well as the case where they evolve like independent reflected Brownian motions. Keywords: cost-per-action, internet advertising, mechanism design | |||
| Optimal marketing strategies over social networks | | BIBAK | Full-Text | 189-198 | |
| Jason Hartline; Vahab Mirrokni; Mukund Sundararajan | |||
| We discuss the use of social networks in implementing viral marketing
strategies. While influence maximization has been studied in this context (see
Chapter 24 of [10]), we study revenue maximization, arguably, a more natural
objective. In our model, a buyer's decision to buy an item is influenced by the
set of other buyers that own the item and the price at which the item is
offered.
We focus on algorithmic question of finding revenue maximizing marketing strategies. When the buyers are completely symmetric, we can find the optimal marketing strategy in polynomial time. In the general case, motivated by hardness results, we investigate approximation algorithms for this problem. We identify a family of strategies called influence-and-exploit strategies that are based on the following idea: Initially influence the population by giving the item for free to carefully a chosen set of buyers. Then extract revenue from the remaining buyers using a 'greedy' pricing strategy. We first argue why such strategies are reasonable and then show how to use recently developed set-function maximization techniques to find the right set of buyers to influence. Keywords: marketing, monetizing social networks, pricing, submodular maximization | |||
| Trust-based recommendation systems: an axiomatic approach | | BIBAK | Full-Text | 199-208 | |
| Reid Andersen; Christian Borgs; Jennifer Chayes; Uriel Feige; Abraham Flaxman; Adam Kalai; Vahab Mirrokni; Moshe Tennenholtz | |||
| High-quality, personalized recommendations are a key feature in many online
systems. Since these systems often have explicit knowledge of social network
structures, the recommendations may incorporate this information. This paper
focuses on networks that represent trust and recommendation systems that
incorporate these trust relationships. The goal of a trust-based recommendation
system is to generate personalized recommendations by aggregating the opinions
of other users in the trust network.
In analogy to prior work on voting and ranking systems, we use the axiomatic approach from the theory of social choice. We develop a set of five natural axioms that a trust-based recommendation system might be expected to satisfy. Then, we show that no system can simultaneously satisfy all the axioms. However, for any subset of four of the five axioms we exhibit a recommendation system that satisfies those axioms. Next we consider various ways of weakening the axioms, one of which leads to a unique recommendation system based on random walks. We consider other recommendation systems, including systems based on personalized PageRank, majority of majorities, and minimum cuts, and search for alternative axiomatizations that uniquely characterize these systems. Finally, we determine which of these systems are incentive compatible, meaning that groups of agents interested in manipulating recommendations can not induce others to share their opinion by lying about their votes or modifying their trust links. This is an important property for systems deployed in a monetized environment. Keywords: axiomatic approach, recommendation systems, reputation systems, trust
networks | |||
| Secure or insure?: a game-theoretic analysis of information security games | | BIBAK | Full-Text | 209-218 | |
| Jens Grossklags; Nicolas Christin; John Chuang | |||
| Despite general awareness of the importance of keeping one's system secure,
and widespread availability of consumer security technologies, actual
investment in security remains highly variable across the Internet population,
allowing attacks such as distributed denial-of-service (DDoS) and spam
distribution to continue unabated. By modeling security investment
decision-making in established (e.g., weakest-link, best-shot) and novel games
(e.g., weakest-target), and allowing expenditures in self-protection versus
self-insurance technologies, we can examine how incentives may shift between
investment in a public good (protection) and a private good (insurance),
subject to factors such as network size, type of attack, loss probability, loss
magnitude, and cost of technology. We can also characterize Nash equilibria and
social optima for different classes of attacks and defenses. In the
weakest-target game, an interesting result is that, for almost all parameter
settings, more effort is exerted at Nash equilibrium than at the social
optimum. We may attribute this to the "strategic uncertainty" of players
seeking to self-protect at just slightly above the lowest protection level. Keywords: economics of the internet, game theory, incentive-centered design and
engineering, protection, public goods, security, self-insurance | |||
| Analyzing search engine advertising: firm behavior and cross-selling in electronic markets | | BIBAK | Full-Text | 219-226 | |
| Anindya Ghose; Sha Yang | |||
| The phenomenon of sponsored search advertising is gaining ground as the
largest source of revenues for search engines. Firms across different
industries have are beginning to adopt this as the primary form of online
advertising. This process works on an auction mechanism in which advertisers
bid for different keywords, and final rank for a given keyword is allocated by
the search engine. But how different are firm's actual bids from their optimal
bids? Moreover, what are other ways in which firms can potentially benefit from
sponsored search advertising? Based on the model and estimates from prior work
[10], we conduct a number of policy simulations in order to investigate to what
extent an advertiser can benefit from bidding optimally for its keywords.
Further, we build a Hierarchical Bayesian modeling framework to explore the
potential for cross-selling or spillovers effects from a given keyword
advertisement across multiple product categories, and estimate the model using
Markov Chain Monte Carlo (MCMC) methods. Our analysis suggests that advertisers
are not bidding optimally with respect to maximizing profits. We conduct a
detailed analysis with product level variables to explore the extent of
cross-selling opportunities across different categories from a given keyword
advertisement. We find that there exists significant potential for
cross-selling through search keyword advertisements in that consumers often end
up buying products from other categories in addition to the product they were
searching for. Latency (the time it takes for consumer to place a purchase
order after clicking on the advertisement) and the presence of a brand name in
the keyword are associated with consumer spending on product categories that
are different from the one they were originally searching for on the Internet. Keywords: electronic commerce, hierarchical bayesian modeling, online advertising,
paid search advertising, search engines, web 2.0 | |||
| Online learning from click data for sponsored search | | BIBAK | Full-Text | 227-236 | |
| Massimiliano Ciaramita; Vanessa Murdock; Vassilis Plachouras | |||
| Sponsored search is one of the enabling technologies for today's Web search
engines. It corresponds to matching and showing ads related to the user query
on the search engine results page. Users are likely to click on topically
related ads and the advertisers pay only when a user clicks on their ad. Hence,
it is important to be able to predict if an ad is likely to be clicked, and
maximize the number of clicks. We investigate the sponsored search problem from
a machine learning perspective with respect to three main sub-problems: how to
use click data for training and evaluation, which learning framework is more
suitable for the task, and which features are useful for existing models. We
perform a large scale evaluation based on data from a commercial Web search
engine. Results show that it is possible to learn and evaluate directly and
exclusively on click data encoding pairwise preferences following simple and
conservative assumptions. We find that online multilayer perceptron learning,
based on a small set of features representing content similarity of different
kinds, significantly outperforms an information retrieval baseline and other
learning models, providing a suitable framework for the sponsored search task. Keywords: online learning, perceptrons, ranking, sponsored search | |||
| Supporting anonymous location queries in mobile environments with privacygrid | | BIBAK | Full-Text | 237-246 | |
| Bhuvan Bamba; Ling Liu; Peter Pesti; Ting Wang | |||
| This paper presents PrivacyGrid -- a framework for supporting anonymous
location-based queries in mobile information delivery systems. The PrivacyGrid
framework offers three unique capabilities. First, it provides a location
privacy protection preference profile model, called location P3P, which allows
mobile users to explicitly define their preferred location privacy requirements
in terms of both location hiding measures (e.g., location k-anonymity and
location l-diversity) and location service quality measures (e.g., maximum
spatial resolution and maximum temporal resolution). Second, it provides fast
and effective location cloaking algorithms for location k-anonymity and
location l-diversity in a mobile environment. We develop dynamic bottom-up and
top-down grid cloaking algorithms with the goal of achieving high anonymization
success rate and efficiency in terms of both time complexity and maintenance
cost. A hybrid approach that carefully combines the strengths of both bottom-up
and top-down cloaking approaches to further reduce the average anonymization
time is also developed. Last but not the least, PrivacyGrid incorporates
temporal cloaking into the location cloaking process to further increase the
success rate of location anonymization. We also discuss PrivacyGrid mechanisms
for supporting anonymous location queries. Experimental evaluation shows that
the PrivacyGrid approach can provide close to optimal location k-anonymity as
defined by per user location P3P without introducing significant performance
penalties. Keywords: k-anonymity, l-diversity, location privacy | |||
| Learning transportation mode from raw gps data for geographic applications on the web | | BIBAK | Full-Text | 247-256 | |
| Yu Zheng; Like Liu; Longhao Wang; Xing Xie | |||
| Geographic information has spawned many novel Web applications where global
positioning system (GPS) plays important roles in bridging the applications and
end users. Learning knowledge from users' raw GPS data can provide rich context
information for both geographic and mobile applications. However, so far, raw
GPS data are still used directly without much understanding. In this paper, an
approach based on supervised learning is proposed to automatically infer
transportation mode from raw GPS data. The transportation mode, such as
walking, driving, etc., implied in a user's GPS data can provide us valuable
knowledge to understand the user. It also enables context-aware computing based
on user's present transportation mode and design of an innovative user
interface for Web users. Our approach consists of three parts: a change
point-based segmentation method, an inference model and a post-processing
algorithm based on conditional probability. The change point-based segmentation
method was compared with two baselines including uniform duration based and
uniform length based methods. Meanwhile, four different inference models
including Decision Tree, Bayesian Net, Support Vector Machine (SVM) and
Conditional Random Field (CRF) are studied in the experiments. We evaluated the
approach using the GPS data collected by 45 users over six months period. As a
result, beyond other two segmentation methods, the change point based method
achieved a higher degree of accuracy in predicting transportation modes and
detecting transitions between them. Decision Tree outperformed other inference
models over the change point based segmentation method. Keywords: GPS, classification, geographic application, machine learning,
transportation mode | |||
| Deciphering mobile search patterns: a study of Yahoo! mobile search queries | | BIBAK | Full-Text | 257-266 | |
| Jeonghee Yi; Farzin Maghoul; Jan Pedersen | |||
| In this paper we study the characteristics of search queries submitted from
mobile devices using various Yahoo! one-Search applications during a 2 months
period in the second half of 2007, and report the query patterns derived from
20 million English sample queries submitted by users in US, Canada, Europe, and
Asia. We examine the query distribution and topical categories the queries
belong to in order to find new trends. We compare and contrast the search
patterns between US vs international queries, and between queries from various
search interfaces (XHTML/WAP, java widgets, and SMS). We also compare our
results with previous studies wherever possible, either to confirm previous
findings, or to find interesting differences in the query distribution and
pattern. Keywords: cell phones, mobile applications, mobile devices, mobile query analysis,
mobile search, oneSearch, personal devices, query analysis, query log analysis,
wireless devices | |||
| Service-oriented data denormalization for scalable web applications | | BIBAK | Full-Text | 267-276 | |
| Zhou Wei; Jiang Dejun; Guillaume Pierre; Chi-Hung Chi; Maarten van Steen | |||
| Many techniques have been proposed to scale web applications. However, the
data interdependencies between the database queries and transactions issued by
the applications limit their efficiency. We claim that major scalability
improvements can be gained by restructuring the web application data into
multiple independent data services with exclusive access to their private data
store. While this restructuring does not provide performance gains by itself,
the implied simplification of each database workload allows a much more
efficient use of classical techniques. We illustrate the data denormalization
process on three benchmark applications: TPC-W, RUBiS and RUBBoS. We deploy the
resulting service-oriented implementation of TPC-W across an 85-node cluster
and show that restructuring its data can provide at least an order of magnitude
improvement in the maximum sustainable throughput compared to master-slave
database replication, while preserving strong consistency and transactional
properties. Keywords: data denormalization, scalability, web applications | |||
| Anycast CDNS revisited | | BIBAK | Full-Text | 277-286 | |
| Hussein A. Alzoubi; Seungjoon Lee; Michael Rabinovich; Oliver Spatscheck; Jacobus Van der Merwe | |||
| Because it is an integral part of the Internet routing apparatus, and
because it allows multiple instances of the same service to be "naturally"
discovered, IP Anycast has many attractive features for any service that
involve the replication of multiple instances across the Internet. While
briefly considered as an enabler when content distribution networks (CDNs)
first emerged, the use of IP Anycast was deemed infeasible in that environment.
The main reasons for this decision were the lack of load awareness of IP
Anycast and unwanted side effects of Internet routing changes on the IP Anycast
mechanism. Prompted by recent developments in route control technology, as well
as a better understanding of the behavior of IP Anycast in operational
settings, we revisit this decision and propose a load-aware IP Anycast CDN
architecture that addresses these concerns while benefiting from inherent IP
Anycast features. Our architecture makes use of route control mechanisms to
take server and network load into account to realize load-aware Anycast. We
show that the resulting redirection requirements can be formulated as a
Generalized Assignment Problem and present practical algorithms that address
these requirements while at the same time limiting session disruptions that
plague regular IP Anycast. We evaluate our algorithms through trace based
simulation using traces obtained from an operation CDN network. Keywords: CDN, anycast, autonomous system, load balancing, routing | |||
| A comparative analysis of web and peer-to-peer traffic | | BIBAK | Full-Text | 287-296 | |
| Naimul Basher; Aniket Mahanti; Anirban Mahanti; Carey Williamson; Martin Arlitt | |||
| Peer-to-Peer (P2P) applications continue to grow in popularity, and have
reportedly overtaken Web applications as the single largest contributor to
Internet traffic. Using traces collected from a large edge network, we conduct
an extensive analysis of P2P traffic, compare P2P traffic with Web traffic, and
discuss the implications of increased P2P traffic. In addition to studying the
aggregate P2P traffic, we also analyze and compare the two main constituents of
P2P traffic in our data, namely BitTorrent and Gnutella. The results presented
in the paper may be used for generating synthetic workloads, gaining insights
into the functioning of P2P applications, and developing network management
strategies. For example, our results suggest that new models are necessary for
Internet traffic. As a first step, we present flow-level distributional models
for Web and P2P traffic that may be used in network simulation and emulation
experiments. Keywords: peer-to-peer, traffic characterization, traffic models, web | |||
| Generating diverse and representative image search results for landmarks | | BIBAK | Full-Text | 297-306 | |
| Lyndon S. Kennedy; Mor Naaman | |||
| Can we leverage the community-contributed collections of rich media on the
web to automatically generate representative and diverse views of the world's
landmarks? We use a combination of context- and content-based tools to generate
representative sets of images for location-driven features and landmarks, a
common search task. To do that, we using location and other metadata, as well
as tags associated with images, and the images' visual features. We present an
approach to extracting tags that represent landmarks. We show how to use
unsupervised methods to extract representative views and images for each
landmark. This approach can potentially scale to provide better search and
representation for landmarks, worldwide. We evaluate the system in the context
of image search using a real-life dataset of 110,000 images from the San
Francisco area. Keywords: geo-referenced photographs, photo collections, social media | |||
| Pagerank for product image search | | BIBAK | Full-Text | 307-316 | |
| Yushi Jing; Shumeet Baluja | |||
| In this paper, we cast the image-ranking problem into the task of
identifying "authority" nodes on an inferred visual similarity graph and
propose an algorithm to analyze the visual link structure that can be created
among a group of images. Through an iterative procedure based on the PageRank
computation, a numerical weight is assigned to each image; this measures its
relative importance to the other images being considered. The incorporation of
visual signals in this process differs from the majority of large-scale
commercial-search engines in use today. Commercial search-engines often solely
rely on the text clues of the pages in which images are embedded to rank
images, and often entirely ignore the content of the images themselves as a
ranking signal. To quantify the performance of our approach in a real-world
system, we conducted a series of experiments based on the task of retrieving
images for 2000 of the most popular products queries. Our experimental results
show significant improvement, in terms of user satisfaction and relevancy, in
comparison to the most recent Google Image Search results. Keywords: graph theory, pagerank, visual similarity | |||
| Graph theoretical framework for simultaneously integrating visual and textual features for efficient web image clustering | | BIBAK | Full-Text | 317-326 | |
| Manjeet Rege; Ming Dong; Jing Hua | |||
| With the explosive growth of Web and the recent development in digital media
technology, the number of images on the Web has grown tremendously.
Consequently, Web image clustering has emerged as an important application.
Some of the initial efforts along this direction revolved around clustering Web
images based on the visual features of images or textual features by making use
of the text surrounding the images. However, not much work has been done in
using multimodal information for clustering Web images. In this paper, we
propose a graph theoretical framework for simultaneously integrating visual and
textual features for efficient Web image clustering. Specifically, we model
visual features, images and words from surrounding text using a tripartite
graph. Partitioning this graph leads to clustering of the Web images. Although,
graph partitioning approach has been adopted before, the main contribution of
this work lies in a new algorithm that we propose -- Consistent Isoperimetric
High-order Co-clustering (CIHC), for partitioning the tripartite graph.
Computationally, CIHC is very quick as it requires a simple solution to a
sparse system of linear equations. Our theoretical analysis and extensive
experiments performed on real Web images demonstrate the performance of CIHC in
terms of the quality, efficiency and scalability in partitioning the visual
feature-image-word tripartite graph. Keywords: co-clustering, consistency, isoperimetric, web images | |||
| Flickr tag recommendation based on collective knowledge | | BIBAK | Full-Text | 327-336 | |
| Börkur Sigurbjörnsson; Roelof van Zwol | |||
| Online photo services such as Flickr and Zooomr allow users to share their
photos with family, friends, and the online community at large. An important
facet of these services is that users manually annotate their photos using so
called tags, which describe the contents of the photo or provide additional
contextual and semantical information. In this paper we investigate how we can
assist users in the tagging phase. The contribution of our research is twofold.
We analyse a representative snapshot of Flickr and present the results by means
of a tag characterisation focussing on how users tags photos and what
information is contained in the tagging. Based on this analysis, we present and
evaluate tag recommendation strategies to support the user in the photo
annotation task by recommending a set of tags that can be added to the photo.
The results of the empirical evaluation show that we can effectively recommend
relevant tags for a variety of photos with different levels of exhaustiveness
of original tagging. Keywords: aggregated tag suggestion, collective knowledge, flickr, photo annotations,
tag characterisation, tag co-occurence, tag recommendation | |||
| Modeling anchor text and classifying queries to enhance web document retrieval | | BIBAK | Full-Text | 337-346 | |
| Atsushi Fujii | |||
| Several types of queries are widely used on the World Wide Web and the
expected retrieval method can vary depending on the query type. We propose a
method for classifying queries into informational and navigational types.
Because terms in navigational queries often appear in anchor text for links to
other pages, we analyze the distribution of query terms in anchor texts on the
Web for query classification purposes. While content-based retrieval is
effective for informational queries, anchor-based retrieval is effective for
navigational queries. Our retrieval system combines the results obtained with
the content-based and anchor-based retrieval methods, in which the weight for
each retrieval result is determined automatically depending on the result of
the query classification. We also propose a method for improving anchor-based
retrieval. Our retrieval method, which computes the probability that a document
is retrieved in response to the given query, identifies synonyms of query terms
in the anchor texts on the Web and uses these synonyms for smoothing purposes
in the probability estimation. We use the NTCIR test collections and show the
effectiveness of individual methods and the entire Web retrieval system
experimentally. Keywords: anchor text, query classification, web retrieval | |||
| Unsupervised query segmentation using generative language models and wikipedia | | BIBAK | Full-Text | 347-356 | |
| Bin Tan; Fuchun Peng | |||
| In this paper, we propose a novel unsupervised approach to query
segmentation, an important task in Web search. We use a generative query model
to recover a query's underlying concepts that compose its original segmented
form. The model's parameters are estimated using an expectation-maximization
(EM) algorithm, optimizing the minimum description length objective function on
a partial corpus that is specific to the query. To augment this unsupervised
learning, we incorporate evidence from Wikipedia.
Experiments show that our approach dramatically improves performance over the traditional approach that is based on mutual information, and produces comparable results with a supervised method. In particular, the basic generative language model contributes a 7.4% improvement over the mutual information based method (measured by segment F1 on the Intersection test set). EM optimization further improves the performance by 14.3%. Additional knowledge from Wikipedia provides another improvement of 24.3%, adding up to a total of 46% improvement (from 0.530 to 0.774). Keywords: concept discovery, query segmentation | |||
| Spatial variation in search engine queries | | BIBAK | Full-Text | 357-366 | |
| Lars Backstrom; Jon Kleinberg; Ravi Kumar; Jasmine Novak | |||
| Local aspects of Web search -- associating Web content and queries with
geography -- is a topic of growing interest. However, the underlying question
of how spatial variation is manifested in search queries is still not well
understood. Here we develop a probabilistic framework for quantifying such
spatial variation; on complete Yahoo! query logs, we find that our model is
able to localize large classes of queries to within a few miles of their
natural centers based only on the distribution of activity for the query. Our
model provides not only an estimate of a query's geographic center, but also a
measure of its spatial dispersion, indicating whether it has highly local
interest or broader regional or national appeal. We also show how variations on
our model can track geographically shifting topics over time, annotate a map
with each location's "distinctive queries", and delineate the "spheres of
influence" for competing queries in the same general domain. Keywords: geolocation, web search | |||
| Genealogical trees on the web: a search engine user perspective | | BIBAK | Full-Text | 367-376 | |
| Ricardo Baeza-Yates; Álvaro Pereira; Nivio Ziviani | |||
| This paper presents an extensive study about the evolution of textual
content on the Web, which shows how some new pages are created from scratch
while others are created using already existing content. We show that a
significant fraction of the Web is a byproduct of the latter case. We introduce
the concept of Web genealogical tree, in which every page in a Web snapshot is
classified into a component. We study in detail these components,
characterizing the copies and identifying the relation between a source of
content and a search engine, by comparing page relevance measures, documents
returned by real queries performed in the past, and click-through data. We
observe that sources of copies are more frequently returned by queries and more
clicked than other documents. Keywords: content evolution, search engine, text, web, web mining | |||
| A graph-theoretic approach to webpage segmentation | | BIBAK | Full-Text | 377-386 | |
| Deepayan Chakrabarti; Ravi Kumar; Kunal Punera | |||
| We consider the problem of segmenting a webpage into visually and
semantically cohesive pieces. Our approach is based on formulating an
appropriate optimization problem on weighted graphs, where the weights capture
if two nodes in the DOM tree should be placed together or apart in the
segmentation; we present a learning framework to learn these weights from
manually labeled data in a principled manner. Our work is a significant
departure from previous heuristic and rule-based solutions to the segmentation
problem. The results of our empirical analysis bring out interesting aspects of
our framework, including variants of the optimization problem and the role of
learning. Keywords: correlation clustering, energy minimization, graph cuts, webpage sectioning,
webpage segmentation | |||
| Performance of compressed inverted list caching in search engines | | BIBAK | Full-Text | 387-396 | |
| Jiangong Zhang; Xiaohui Long; Torsten Suel | |||
| Due to the rapid growth in the size of the web, web search engines are
facing enormous performance challenges. The larger engines in particular have
to be able to process tens of thousands of queries per second on tens of
billions of documents, making query throughput a critical issue. To satisfy
this heavy workload, search engines use a variety of performance optimizations
including index compression, caching, and early termination.
We focus on two techniques, inverted index compression and index caching, which play a crucial rule in web search engines as well as other high-performance information retrieval systems. We perform a comparison and evaluation of several inverted list compression algorithms, including new variants of existing algorithms that have not been studied before. We then evaluate different inverted list caching policies on large query traces, and finally study the possible performance benefits of combining compression and caching. The overall goal of this paper is to provide an updated discussion and evaluation of these two techniques, and to show how to select the best set of approaches and settings depending on parameter such as disk speed and main memory cache size. Keywords: index caching, index compression, inverted index, search engines | |||
| Ranking refinement and its application to information retrieval | | BIBAK | Full-Text | 397-406 | |
| Rong Jin; Hamed Valizadegan; Hang Li | |||
| We consider the problem of ranking refinement, i.e., to improve the accuracy
of an existing ranking function with a small set of labeled instances. We are,
particularly, interested in learning a better ranking function using two
complementary sources of information, ranking information given by the existing
ranking function (i.e., a base ranker) and that obtained from users' feedbacks.
This problem is very important in information retrieval where the feedback is
gradually collected. The key challenge in combining the two sources of
information arises from the fact that the ranking information presented by the
base ranker tends to be imperfect and the ranking information obtained from
users' feedbacks tends to be noisy. We present a novel boosting framework for
ranking refinement that can effectively leverage the uses of the two sources of
information. Our empirical study shows that the proposed algorithm is effective
for ranking refinement, and furthermore significantly outperforms the baseline
algorithms that incorporate the outputs from the base ranker as an additional
feature. Keywords: background information, boosting, incremental learning, learning to rank | |||
| Learning to rank relational objects and its application to web search | | BIBAK | Full-Text | 407-416 | |
| Tao Qin; Tie-Yan Liu; Xu-Dong Zhang; De-Sheng Wang; Wen-Ying Xiong; Hang Li | |||
| Learning to rank is a new statistical learning technology on creating a
ranking model for sorting objects. The technology has been successfully applied
to web search, and is becoming one of the key machineries for building search
engines. Existing approaches to learning to rank, however, did not consider the
cases in which there exists relationship between the objects to be ranked,
despite of the fact that such situations are very common in practice. For
example, in web search, given a query certain relationships usually exist among
the retrieved documents, e.g., URL hierarchy, similarity, etc., and sometimes
it is necessary to utilize the information in ranking of the documents. This
paper addresses the issue and formulates it as a novel learning problem,
referred to as, 'learning to rank relational objects'. In the new learning
task, the ranking model is defined as a function of not only the contents
(features) of objects but also the relations between objects. The paper further
focuses on one setting of the learning problem in which the way of using
relation information is predetermined. It formalizes the learning task as an
optimization problem in the setting. The paper then proposes a new method to
perform the optimization task, particularly an implementation based on SVM.
Experimental results show that the proposed method outperforms the baseline
methods for two ranking tasks (Pseudo Relevance Feedback and Topic
Distillation) in web search, indicating that the proposed method can indeed
make effective use of relation information and content information in ranking. Keywords: learning to rank relational objects, pseudo relevance feedback, relational
ranking svm, topic distillation | |||
| Contextual advertising by combining relevance with click feedback | | BIBAK | Full-Text | 417-426 | |
| Deepayan Chakrabarti; Deepak Agarwal; Vanja Josifovski | |||
| Contextual advertising supports much of the Web's ecosystem today. User
experience and revenue (shared by the site publisher and the ad network) depend
on the relevance of the displayed ads to the page content. As with other
document retrieval systems, relevance is provided by scoring the match between
individual ads (documents) and the content of the page where the ads are shown
(query). In this paper we show how this match can be improved significantly by
augmenting the ad-page scoring function with extra parameters from a logistic
regression model on the words in the pages and ads. A key property of the
proposed model is that it can be mapped to standard cosine similarity matching
and is suitable for efficient and scalable implementation over inverted
indexes. The model parameter values are learnt from logs containing ad
impressions and clicks, with shrinkage estimators being used to combat
sparsity. To scale our computations to train on an extremely large training
corpus consisting of several gigabytes of data, we parallelize our fitting
algorithm in a Hadoop framework [10]. Experimental evaluation is provided
showing improved click prediction over a holdout set of impression and click
events from a large scale real-world ad placement engine. Our best model
achieves a 25% lift in precision relative to a traditional information
retrieval model which is based on cosine similarity, for recalling 10% of the
clicks in our test data. Keywords: clickthrough rate, interaction effects, modeling | |||
| IRLbot: scaling to 6 billion pages and beyond | | BIBAK | Full-Text | 427-436 | |
| Hsin-Tsang Lee; Derek Leonard; Xiaoming Wang; Dmitri Loguinov | |||
| This paper shares our experience in designing a web crawler that can
download billions of pages using a single-server implementation and models its
performance. We show that with the quadratically increasing complexity of
verifying URL uniqueness, BFS crawl order, and fixed per-host rate-limiting,
current crawling algorithms cannot effectively cope with the sheer volume of
URLs generated in large crawls, highly-branching spam, legitimate
multi-million-page blog sites, and infinite loops created by server-side
scripts. We offer a set of techniques for dealing with these issues and test
their performance in an implementation we call IRLbot. In our recent experiment
that lasted 41 days, IRLbot running on a single server successfully crawled 6.3
billion valid HTML pages ($7.6$ billion connection requests) and sustained an
average download rate of 319 mb/s (1,789 pages/s). Unlike our prior experiments
with algorithms proposed in related work, this version of IRLbot did not
experience any bottlenecks and successfully handled content from over 117
million hosts, parsed out 394 billion links, and discovered a subset of the web
graph with 41 billion unique nodes. Keywords: IRLbot, crawling, large-scale | |||
| Recrawl scheduling based on information longevity | | BIBAK | Full-Text | 437-446 | |
| Christopher Olston; Sandeep Pandey | |||
| It is crucial for a web crawler to distinguish between ephemeral and
persistent content. Ephemeral content (e.g., quote of the day) is usually not
worth crawling, because by the time it reaches the index it is no longer
representative of the web page from which it was acquired. On the other hand,
content that persists across multiple page updates (e.g., recent blog postings)
may be worth acquiring, because it matches the page's true content for a
sustained period of time.
In this paper we characterize the longevity of information found on the web, via both empirical measurements and a generative model that coincides with these measurements. We then develop new recrawl scheduling policies that take longevity into account. As we show via experiments over real web data, our policies obtain better freshness at lower cost, compared with previous approaches. Keywords: information longevity, recrawling | |||
| iRobot: an intelligent crawler for web forums | | BIBAK | Full-Text | 447-456 | |
| Rui Cai; Jiang-Ming Yang; Wei Lai; Yida Wang; Lei Zhang | |||
| We study in this paper the Web forum crawling problem, which is a very
fundamental step in many Web applications, such as search engine and Web data
mining. As a typical user-created content (UCC), Web forum has become an
important resource on the Web due to its rich information contributed by
millions of Internet users every day. However, Web forum crawling is not a
trivial problem due to the in-depth link structures, the large amount of
duplicate pages, as well as many invalid pages caused by login failure issues.
In this paper, we propose and build a prototype of an intelligent forum
crawler, iRobot, which has intelligence to understand the content and the
structure of a forum site, and then decide how to choose traversal paths among
different kinds of pages. To do this, we first randomly sample (download) a few
pages from the target forum site, and introduce the page content layout as the
characteristics to group those pre-sampled pages and re-construct the forum's
sitemap. After that, we select an optimal crawling path which only traverses
informative pages and skips invalid and duplicate ones. The extensive
experimental results on several forums show the performance of our system in
the following aspects: 1) Effectiveness -- Compared to a generic crawler,
iRobot significantly decreases the duplicate and invalid pages; 2) Efficiency
-- With a small cost of pre-sampling a few pages for learning the necessary
knowledge, iRobot saves substantial network bandwidth and storage as it only
fetches informative pages from a forum site; and 3) Long threads that are
divided into multiple pages can be re-concatenated and archived as a whole
thread, which is of great help for further indexing and data mining. Keywords: forum crawler, sitemap construction, traversal path selection | |||
| Automatic online news issue construction in web environment | | BIBAK | Full-Text | 457-466 | |
| Canhui Wang; Min Zhang; Shaoping Ma; Liyun Ru | |||
| In many cases, rather than a keyword search, people intend to see what is
going on through the Internet. Then the integrated comprehensive information on
news topics is necessary, which we called news issues, including the
background, history, current progress, different opinions and discussions, etc.
Traditionally, news issues are manually generated by website editors. It is
quite a time-consuming hard work, and hence real-time update is difficult to
perform. In this paper, a three-step automatic online algorithm for news issue
construction is proposed. The first step is a topic detection process, in which
newly appearing stories are clustered into new topic candidates. The second
step is a topic tracking process, where those candidates are compared with
previous topics, either merged into old ones or generating a new one. In the
final step, news issues are constructed by the combination of related topics
and updated by the insertion of new topics. An automatic online news issue
construction process under practical Web circumstances is simulated to perform
news issue construction experiments. F-measure of the best results is either
above (topic detection) or close to (topic detection and tracking) 90%. Four
news issue construction results are successfully generated in different time
granularities: one meets the needs like "what's new", and the other three will
answer questions like "what's hot" or "what's going on". Through the proposed
algorithm, news issues can be effectively and automatically constructed with
real-time update, and lots of human efforts will be released from tedious
manual work. Keywords: clustering, news issue, topic detection and tracking | |||
| Finding the right facts in the crowd: factoid question answering over social media | | BIBAK | Full-Text | 467-476 | |
| Jiang Bian; Yandong Liu; Eugene Agichtein; Hongyuan Zha | |||
| Community Question Answering has emerged as a popular and effective paradigm
for a wide range of information needs. For example, to find out an obscure
piece of trivia, it is now possible and even very effective to post a question
on a popular community QA site such as Yahoo! Answers, and to rely on other
users to provide answers, often within minutes. The importance of such
community QA sites is magnified as they create archives of millions of
questions and hundreds of millions of answers, many of which are invaluable for
the information needs of other searchers. However, to make this immense body of
knowledge accessible, effective answer retrieval is required. In particular, as
any user can contribute an answer to a question, the majority of the content
reflects personal, often unsubstantiated opinions. A ranking that combines both
relevance and quality is required to make such archives usable for factual
information retrieval. This task is challenging, as the structure and the
contents of community QA archives differ significantly from the web setting. To
address this problem we present a general ranking framework for factual
information retrieval from social media. Results of a large scale evaluation
demonstrate that our method is highly effective at retrieving well-formed,
factual answers to questions, as evaluated on a standard factoid QA benchmark.
We also show that our learning framework can be tuned with the minimum of
manual labeling. Finally, we provide result analysis to gain deeper
understanding of which features are significant for social media search and
retrieval. Our system can be used as a crucial building block for combining
results from a variety of social media content with general web search results,
and to better integrate social media content for effective information access. Keywords: community, question answering, ranking | |||
| Personalized interactive faceted search | | BIBAK | Full-Text | 477-486 | |
| Jonathan Koren; Yi Zhang; Xue Liu | |||
| Faceted search is becoming a popular method to allow users to interactively
search and navigate complex information spaces. A faceted search system
presents users with key-value metadata that is used for query refinement. While
popular in e-commerce and digital libraries, not much research has been
conducted on which metadata to present to a user in order to improve the search
experience. Nor are there repeatable benchmarks for evaluating a faceted search
engine. This paper proposes the use of collaborative filtering and
personalization to customize the search interface to each user's behavior. This
paper also proposes a utility based framework to evaluate the faceted
interface. In order to demonstrate these ideas and better understand
personalized faceted search, several faceted search algorithms are proposed and
evaluated using the novel evaluation methodology. Keywords: collaborative recommendation, evaluation, faceted search, interactive
search, personalization, user modeling | |||
| Privacy-enhanced sharing of personal content on the web | | BIBAK | Full-Text | 487-496 | |
| Mohammad Mannan; Paul C. van Oorschot | |||
| Publishing personal content on the web is gaining increased popularity with
dramatic growth in social networking websites, and availability of cheap
personal domain names and hosting services. Although the Internet enables easy
publishing of any content intended to be generally accessible, restricting
personal content to a selected group of contacts is more difficult. Social
networking websites partially enable users to restrict access to a selected
group of users of the same network by explicitly creating a "friends' list."
While this limited restriction supports users' privacy on those (few) selected
websites, personal websites must still largely be protected manually by sharing
passwords or obscure links. Our focus is the general problem of privacy-enabled
web content sharing from any user-chosen web server. By leveraging the existing
"circle of trust" in popular Instant Messaging (IM) networks, we propose a
scheme called IM-based Privacy-Enhanced Content Sharing (IMPECS) for personal
web content sharing. IMPECS enables a publishing user's personal data to be
accessible only to her IM contacts. A user can put her personal web page on any
web server she wants (vs. being restricted to a specific social networking
website), and maintain privacy of her content without requiring site-specific
passwords. Our prototype of IMPECS required only minor modifications to an IM
server, and PHP scripts on a web server. The general idea behind IMPECS extends
beyond IM and IM circles of trust; any equivalent scheme, (ideally) containing
pre-arranged groups, could similarly be leveraged. Keywords: access control, circle of trust, privacy, sharing | |||
| Detecting image spam using visual features and near duplicate detection | | BIBAK | Full-Text | 497-506 | |
| Bhaskar Mehta; Saurabh Nangia; Manish Gupta; Wolfgang Nejdl | |||
| Email spam is a much studied topic, but even though current email spam
detecting software has been gaining a competitive edge against text based email
spam, new advances in spam generation have posed a new challenge: image-based
spam. Image based spam is email which includes embedded images containing the
spam messages, but in binary format. In this paper, we study the
characteristics of image spam to propose two solutions for detecting
image-based spam, while drawing a comparison with the existing techniques. The
first solution, which uses the visual features for classification, offers an
accuracy of about 98%, i.e. an improvement of at least 6% compared to existing
solutions. SVMs (Support Vector Machines) are used to train classifiers using
judiciously decided color, texture and shape features. The second solution
offers a novel approach for near duplication detection in images. It involves
clustering of image GMMs (Gaussian Mixture Models) based on the Agglomerative
Information Bottleneck (AIB) principle, using Jensen-Shannon divergence (JS) as
the distance measure. Keywords: email spam, image analysis, machine learning | |||
| Better abstractions for secure server-side scripting | | BIBAK | Full-Text | 507-516 | |
| Dachuan Yu; Ajay Chander; Hiroshi Inamura; Igor Serikov | |||
| It is notoriously difficult to program a solid web application. Besides
addressing web interactions, state maintenance, and whimsical user navigation
behaviors, programmers must also avoid a minefield of security vulnerabilities.
The problem is twofold. First, we lack a clear understanding of the new
computation model underlying web applications. Second, we lack proper
abstractions for hiding common and subtle coding details that are orthogonal to
the business functionalities of specific web applications.
This paper addresses both issues. First, we present a language BASS for declarative server-side scripting. BASS allows programmers to work in an ideal world, using new abstractions to tackle common but problematic aspects of web programming. The meta properties of BASS provide useful security guarantees. Second, we present a language MOSS reflecting realistic web programming concepts and scenarios, thus articulating the computation model behind web programming. Finally, we present a translation from BASS to MOSS, demonstrating how the ideal programming model and security guarantees of BASS can be implemented in practice. Keywords: server-side scripting, web application security | |||
| Sessionlock: securing web sessions against eavesdropping | | BIBAK | Full-Text | 517-524 | |
| Ben Adida | |||
| Typical web sessions can be hijacked easily by a network eavesdropper in
attacks that have come to be designated "sidejacking." The rise of ubiquitous
wireless networks, often unprotected at the transport layer, has significantly
aggravated this problem. While SSL can protect against eavesdropping, its
usability disadvantages often make it unsuitable when the data is not
considered highly confidential. Most web-based email services, for example, use
SSL only on their login page and are thus vulnerable to sidejacking.
We propose SessionLock, a simple approach to securing web sessions against eavesdropping without extending the use of SSL. SessionLock is easily implemented by web developers using only JavaScript and simple server-side logic. Its performance impact is negligible, and all major web browsers are supported. Interestingly, it is particularly easy to implement on single-page AJAX web applications, e.g. Gmail or Yahoo mail, with approximately 200 lines of JavaScript and 60 lines of server-side verification code. Keywords: web security | |||
| Forcehttps: protecting high-security web sites from network attacks | | BIBAK | Full-Text | 525-534 | |
| Collin Jackson; Adam Barth | |||
| As wireless networks proliferate, web browsers operate in an increasingly
hostile network environment. The HTTPS protocol has the potential to protect
web users from network attackers, but real-world deployments must cope with
misconfigured servers, causing imperfect web sites and users to compromise
browsing sessions inadvertently. ForceHTTPS is a simple browser security
mechanism that web sites or users can use to opt in to stricter error
processing, improving the security of HTTPS by preventing network attacks that
leverage the browser's lax error processing. By augmenting the browser with a
database of custom URL rewrite rules, ForceHTTPS allows sophisticated users to
transparently retrofit security onto some insecure sites that support HTTPS. We
provide a prototype implementation of ForceHTTPS as a Firefox browser
extension. Keywords: HTTPS, eavesdropping, pharming, same-origin policy | |||
| SMash: secure component model for cross-domain mashups on unmodified browsers | | BIBAK | Full-Text | 535-544 | |
| Frederik De Keukelaere; Sumeer Bhola; Michael Steiner; Suresh Chari; Sachiko Yoshihama | |||
| Mashup applications mix and merge content (data and code) from multiple
content providers in a user's browser, to provide high-value web applications
that can rival the user experience provided by desktop applications. Current
browser security models were not designed to support such applications and they
are therefore implemented with insecure workarounds. In this paper, we present
a secure component model, where components are provided by different trust
domains, and can interact using a communication abstraction that allows ease of
specification of a security policy. We have developed an implementation of this
model that works currently in all major browsers, and addresses challenges of
communication integrity and frame-phishing. An evaluation of the performance of
our implementation shows that this approach is not just feasible but also
practical. Keywords: browser, component model, mashup, phishing, web 2.0 | |||
| Compoweb: a component-oriented web architecture | | BIBAK | Full-Text | 545-554 | |
| Rui Guo; Bin B. Zhu; Min Feng; Aimin Pan; Bosheng Zhou | |||
| In this paper, client-site Web mashups are studied from component-oriented
perspective, and CompoWeb, a component-oriented Web architecture, is proposed.
In CompoWeb, a Web application is decomposed into Web components called
gadgets. A gadget is an abstraction of functional or logical Web component. It
is isolated from other gadgets for security and reliability. Contract-based
channels are the only way to interact with each other. An abstraction of
contract-based channels supported or required by a gadget is also presented. It
enables binding of gadgets at deployment, and promotes interchangeable gadgets.
Unlike the model of a normal function call where the function logic is executed
in caller's context, CompoWeb ensures that the function logic is executed in
callee's context so that both the caller and callee are protected.
Implementation of a prototype CompoWeb system and its performance are also
presented. Keywords: browser, component, delayed-binding, encapsulation, interface, isolation,
mashup, protection, reuse, same-origin policy, security, web | |||
| Structured objects in owl: representation and reasoning | | BIBAK | Full-Text | 555-564 | |
| Boris Motik; Bernardo Cuenca Grau; Ulrike Sattler | |||
| Applications of semantic technologies often require the representation of
and reasoning with structured objects -- that is, objects composed of parts
connected in complex ways. Although OWL is a general and powerful language, its
class descriptions and axioms cannot be used to describe arbitrarily connected
structures. An OWL representation of structured objects can thus be
underconstrained, which reduces the inferences that can be drawn and causes
performance problems in reasoning. To address these problems, we extend OWL
with description graphs, which allow for the description of structured objects
in a simple and precise way. To represent conditional aspects of the domain, we
also allow for SWRL-like rules over description graphs. Based on an observation
about the nature of structured objects, we ensure decidability of our
formalism. We also present a hypertableau-based decision procedure, which we
implemented in the HermiT reasoner. To evaluate its performance, we have
extracted description graphs from the GALEN and FMA ontologies, classified them
successfully, and even detected a modeling error in GALEN. Keywords: description logics, owl, structured objects | |||
| Computing minimum cost diagnoses to repair populated DL-based ontologies | | BIBAK | Full-Text | 565-574 | |
| Jianfeng Du; Yi-Dong Shen | |||
| Ontology population is prone to cause inconsistency because the populating
process is imprecise or the populated data may conflict with the original data.
By assuming that the intensional part of the populated DL-based ontology is
fixed and each removable ABox assertion is given a removal cost, we repair the
ontology by deleting a subset of removable ABox assertions in which the sum of
removal costs is minimum. We call such subset a minimum cost diagnosis. We show
that, unless P=NP, the problem of finding a minimum cost diagnosis for a
DL-Lite ontology is insolvable in PTIME w.r.t. data complexity. In spite of
that, we present a feasible computational method for more general (i.e. SHIQ)
ontologies. It transforms a SHIQ ontology to a set of disjoint propositional
programs, thus reducing the original problem into a set of independent
subproblems. Each such subproblem computes an optimal model and is solvable in
logarithmic calls to a SAT solver. Experimental results show that the method
can handle moderately complex ontologies with over thousands of ABox
assertions, where all ABox assertions can be assumed removable. Keywords: description logics, diagnosis, disjunctive datalog, ontologies | |||
| Scalable querying services over fuzzy ontologies | | BIBAK | Full-Text | 575-584 | |
| Jeff Z. Pan; Giorgos Stamou; Giorgos Stoilos; Stuart Taylor; Edward Thomas | |||
| Fuzzy ontologies are envisioned to be useful in the Semantic Web. Existing
fuzzy ontology reasoners are not scalable enough to handle the scale of data
that the Web provides. In this paper, we propose a framework of fuzzy query
languages for fuzzy ontologies, and present query answering algorithms for
these query languages over fuzzy DL-Lite ontologies. Moreover, this paper
reports on implementation of our approach in the fuzzy DL-Lite query engine in
the ONTOSEARCH2 system and preliminary, but encouraging, benchmarking results.
To the best of our knowledge, this is the first ever scalable query engine for
fuzzy ontologies. Keywords: fuzzy SPARQL, fuzzy ontology, lightweight ontology language, scalable query
answerin, semantic web | |||
| Networked graphs: a declarative mechanism for SPARQL rules, SPARQL views and RDF data integration on the web | | BIBAK | Full-Text | 585-594 | |
| Simon Schenk; Steffen Staab | |||
| Easy reuse and integration of declaratively described information in a
distributed setting is one of the main motivations for building the Semantic
Web. Despite of this claim, reuse and recombination of RDF data today is mostly
done using data replication and procedural code. A simple declarative mechanism
for reusing and combining RDF data would help users to generate content for the
semantic web. Having such a mechanism, the Semantic Web could better benefit
from user generated content, as it is broadly present in the so called Web 2.0,
but also from better linkage of existing content.
We propose Networked Graphs, which allow users to define RDF graphs both, by extensionally listing content, but also by using views on other graphs. These views can be used to include parts of other graphs, to transform data before including it and to denote rules. The relationships between graphs are described declaratively using SPARQL queries and an extension of the SPARQL semantics. Networked Graphs are easily exchangeable between and interpretable on different computers. Using existing protocols, Networked Graphs can be evaluated in a distributed setting. Keywords: SPARQL, distributed rules, rules, semantic web, views, well founded
semantics | |||
| SPARQL basic graph pattern optimization using selectivity estimation | | BIBAK | Full-Text | 595-604 | |
| Markus Stocker; Andy Seaborne; Abraham Bernstein; Christoph Kiefer; Dave Reynolds | |||
| In this paper, we formalize the problem of Basic Graph Pattern (BGP)
optimization for SPARQL queries and main memory graph implementations of RDF
data. We define and analyze the characteristics of heuristics for
selectivity-based static BGP optimization. The heuristics range from simple
triple pattern variable counting to more sophisticated selectivity estimation
techniques. Customized summary statistics for RDF data enable the selectivity
estimation of joined triple patterns and the development of efficient
heuristics. Using the Lehigh University Benchmark (LUBM), we evaluate the
performance of the heuristics for the queries provided by the LUBM and discuss
some of them in more details. Keywords: SPARQL, query optimization, selectivity estimation | |||
| Scaling RDF with Time | | BIBAK | Full-Text | 605-614 | |
| Andrea Pugliese; Octavian Udrea; V. S. Subrahmanian | |||
| The World Wide Web Consortium's RDF standard primarily consists of (subject,
property, object) triples that specify the value that a given subject has for a
given property. However, it is frequently the case that even for a fixed
subject and property, the value varies with time. As a consequence, efforts
have been made to annotate RDF triples with "valid time" intervals. However, to
date, no proposals exist for efficient indexing of such temporal RDF databases.
It is clearly beneficial to store RDF data in a relational DB -- however,
standard relational indexes are inadequately equipped to handle RDF's graph
structure. In this paper, we propose the tGRIN index structure that builds a
specialized index for temporal RDF that is physically stored in an RDBMS. Past
efforts to store RDF in relational stores include Jena2 from HP, Sesame from
OpenRDF.org, and 3store from the University of Southampton. We show that even
when these efforts are augmented with well known temporal indexes like R+
trees, SR-trees, ST-index, and MAP21, the tGRIN index exhibits superior
performance. In terms of index build time, tGRIN takes two thirds or less of
the time used by any other system, and it uses a comparable amount of memory
and less disk space than Jena, Sesame and 3store. More importantly, tGRIN can
answer queries three to six times faster for average query graph patterns and
five to ten times faster for complex queries than these systems. Keywords: RDF indexing, resource description framework, temporal RDF | |||
| Wiki content templating | | BIBAK | Full-Text | 615-624 | |
| Angelo Di Iorio; Fabio Vitali; Stefano Zacchiroli | |||
| Wiki content templating enables reuse of content structures among wiki
pages. In this paper we present a thorough study of this widespread feature,
showing how its two state of the art models (functional and creational
templating) are sub-optimal. We then propose a third, better, model called
lightly constrained (LC) templating and show its implementation in the Moin
wiki engine. We also show how LC templating implementations are the appropriate
technologies to push forward semantically rich web pages on the lines of
(lowercase) semantic web and microformats. Keywords: microformats, semantic web, template, wiki engine | |||
| Querying for meta knowledge | | BIBAK | Full-Text | 625-634 | |
| Bernhard Schueler; Sergej Sizov; Steffen Staab; Duc Thanh Tran | |||
| The Semantic Web is based on accessing and reusing RDF data from many
different sources, which one may assign different levels of authority and
credibility. Existing Semantic Web query languages, like SPARQL, have targeted
the retrieval, combination and reuse of facts, but have so far ignored all
aspects of meta knowledge, such as origins, authorship, recency or certainty of
data, to name but a few.
In this paper, we present an original, generic, formalized and implemented approach for managing many dimensions of meta knowledge, like source, authorship, certainty and others. The approach re-uses existing RDF modeling possibilities in order to represent meta knowledge. Then, it extends SPARQL query processing in such a way that given a SPARQL query for data, one may request meta knowledge without modifying the original query. Thus, our approach achieves highly flexible and automatically coordinated querying for data and meta knowledge, while completely separating the two areas of concern. Keywords: RDF, SPARQL, semantic web | |||
| Automatically refining the wikipedia infobox ontology | | BIBAK | Full-Text | 635-644 | |
| Fei Wu; Daniel S. Weld | |||
| The combined efforts of human volunteers have recently extracted numerous
facts from Wikipedia, storing them as machine-harvestable
object-attribute-value triples in Wikipedia infoboxes. Machine learning
systems, such as Kylin, use these infoboxes as training data, accurately
extracting even more semantic knowledge from natural language text. But in
order to realize the full power of this information, it must be situated in a
cleanly-structured ontology. This paper introduces KOG, an autonomous system
for refining Wikipedia's infobox-class ontology towards this end. We cast the
problem of ontology refinement as a machine learning problem and solve it using
both SVMs and a more powerful joint-inference approach expressed in Markov
Logic Networks. We present experiments demonstrating the superiority of the
joint-inference approach and evaluating other aspects of our system. Using
these techniques, we build a rich ontology, integrating Wikipedia's
infobox-class schemata with WordNet. We demonstrate how the resulting ontology
may be used to enhance Wikipedia with improved query processing and other
features. Keywords: markov logic networks, ontology, semantic web, wikipedia | |||
| Statistical analysis of the social network and discussion threads in slashdot | | BIBAK | Full-Text | 645-654 | |
| Vicenç Gómez; Andreas Kaltenbrunner; Vicente López | |||
| We analyze the social network emerging from the user comment activity on the
website Slashdot. The network presents common features of traditional social
networks such as a giant component, small average path length and high
clustering, but differs from them showing moderate reciprocity and neutral
assortativity by degree. Using Kolmogorov-Smirnov statistical tests, we show
that the degree distributions are better explained by log-normal instead of
power-law distributions. We also study the structure of discussion threads
using an intuitive radial tree representation. Threads show strong
heterogeneity and self-similarity throughout the different nesting levels of a
conversation. We use these results to propose a simple measure to evaluate the
degree of controversy provoked by a post. Keywords: bulletin board, h-index, log-normal, online communities, power-law, radial
tree, social networks, thread, weblogs | |||
| Yes, there is a correlation: -- from social networks to personal behavior on the web | | BIBAK | Full-Text | 655-664 | |
| Parag Singla; Matthew Richardson | |||
| Characterizing the relationship that exists between a person's social group
and his/her personal behavior has been a long standing goal of social network
analysts. In this paper, we apply data mining techniques to study this
relationship for a population of over 10 million people, by turning to online
sources of data. The analysis reveals that people who chat with each other
(using instant messaging) are more likely to share interests (their Web
searches are the same or topically similar). The more time they spend talking,
the stronger this relationship is. People who chat with each other are also
more likely to share other personal characteristics, such as their age and
location (and, they are likely to be of opposite gender). Similar findings hold
for people who do not necessarily talk to each other but do have a friend in
common. Our analysis is based on a well-defined mathematical formulation of the
problem, and is the largest such study we are aware of. Keywords: demographics, instant messaging, search, social networks | |||
| Knowledge sharing and yahoo answers: everyone knows something | | BIBAK | Full-Text | 665-674 | |
| Lada A. Adamic; Jun Zhang; Eytan Bakshy; Mark S. Ackerman | |||
| Yahoo Answers (YA) is a large and diverse question-answer forum, acting not
only as a medium for sharing technical knowledge, but as a place where one can
seek advice, gather opinions, and satisfy one's curiosity about a countless
number of things. In this paper, we seek to understand YA's knowledge sharing
and activity. We analyze the forum categories and cluster them according to
content characteristics and patterns of interaction among the users. While
interactions in some categories resemble expertise sharing forums, others
incorporate discussion, everyday advice, and support. With such a diversity of
categories in which one can participate, we find that some users focus narrowly
on specific topics, while others participate across categories. This not only
allows us to map related categories, but to characterize the entropy of the
users' interests. We find that lower entropy correlates with receiving higher
answer ratings, but only for categories where factual expertise is primarily
sought after. We combine both user attributes and answer characteristics to
predict, within a given category, whether a particular answer will be chosen as
the best answer by the asker. Keywords: expertise finding, help seeking, knowledge sharing, online communities,
question answering, social network analysis | |||
| Tag-based social interest discovery | | BIBAK | Full-Text | 675-684 | |
| Xin Li; Lei Guo; Yihong Eric Zhao | |||
| The success and popularity of social network systems, such as del.icio.us,
Facebook, MySpace, and YouTube, have generated many interesting and challenging
problems to the research community. Among others, discovering social interests
shared by groups of users is very important because it helps to connect people
with common interests and encourages people to contribute and share more
contents. The main challenge to solving this problem comes from the difficulty
of detecting and representing the interest of the users. The existing
approaches are all based on the online connections of users and so unable to
identify the common interest of users who have no online connections.
In this paper, we propose a novel social interest discovery approach based on user-generated tags. Our approach is motivated by the key observation that in a social network, human users tend to use descriptive tags to annotate the contents that they are interested in. Our analysis on a large amount of real-world traces reveals that in general, user-generated tags are consistent with the web content they are attached to, while more concise and closer to the understanding and judgments of human users about the content. Thus, patterns of frequent co-occurrences of user tags can be used to characterize and capture topics of user interests. We have developed an Internet Social Interest Discovery system, ISID, to discover the common user interests and cluster users and their saved URLs by different interest topics. Our evaluation shows that ISID can effectively cluster similar documents by interest topics and discover user communities with common interests no matter if they have any online connections. Keywords: ISID, del.icio.us, social networks, tag | |||
| Facetnet: a framework for analyzing communities and their evolutions in dynamic networks | | BIBAK | Full-Text | 685-694 | |
| Yu-Ru Lin; Yun Chi; Shenghuo Zhu; Hari Sundaram; Belle L. Tseng | |||
| We discover communities from social network data, and analyze the community
evolution. These communities are inherent characteristics of human interaction
in online social networks, as well as paper citation networks. Also,
communities may evolve over time, due to changes to individuals' roles and
social status in the network as well as changes to individuals' research
interests. We present an innovative algorithm that deviates from the
traditional two-step approach to analyze community evolutions. In the
traditional approach, communities are first detected for each time slice, and
then compared to determine correspondences. We argue that this approach is
inappropriate in applications with noisy data. In this paper, we propose
FacetNet for analyzing communities and their evolutions through a robust
unified process. In this novel framework, communities not only generate
evolutions, they also are regularized by the temporal smoothness of evolutions.
As a result, this framework will discover communities that jointly maximize the
fit to the observed data and the temporal evolution. Our approach relies on
formulating the problem in terms of non-negative matrix factorization, where
communities and their evolutions are factorized in a unified way. Then we
develop an iterative algorithm, with proven low time complexity, which is
guaranteed to converge to an optimal solution. We perform extensive
experimental studies, on both synthetic datasets and real datasets, to
demonstrate that our method discovers meaningful communities and provides
additional insights not directly obtainable from traditional methods. Keywords: community, community net, evolution, evolution net, non-negative matrix
factorization, soft membership | |||
| Statistical properties of community structure in large social and information networks | | BIBAK | Full-Text | 695-704 | |
| Jure Leskovec; Kevin J. Lang; Anirban Dasgupta; Michael W. Mahoney | |||
| A large body of work has been devoted to identifying community structure in
networks. A community is often though of as a set of nodes that has more
connections between its members than to the remainder of the network. In this
paper, we characterize as a function of size the statistical and structural
properties of such sets of nodes. We define the network community profile plot,
which characterizes the "best" possible community -- according to the
conductance measure -- over a wide range of size scales, and we study over 70
large sparse real-world networks taken from a wide range of application
domains. Our results suggest a significantly more refined picture of community
structure in large real-world networks than has been appreciated previously.
Our most striking finding is that in nearly every network dataset we examined, we observe tight but almost trivial communities at very small scales, and at larger size scales, the best possible communities gradually "blend in" with the rest of the network and thus become less "community-like." This behavior is not explained, even at a qualitative level, by any of the commonly-used network generation models. Moreover, this behavior is exactly the opposite of what one would expect based on experience with and intuition from expander graphs, from graphs that are well-embeddable in a low-dimensional structure, and from small social networks that have served as testbeds of community detection algorithms. We have found, however, that a generative model, in which new edges are added via an iterative "forest fire" burning process, is able to produce graphs exhibiting a network community structure similar to our observations. Keywords: community structure, conductance, graph partitioning, random walks, social
networks | |||
| Why web 2.0 is good for learning and for research: principles and prototypes | | BIBAK | Full-Text | 705-714 | |
| Carsten Ullrich; Kerstin Borau; Heng Luo; Xiaohong Tan; Liping Shen; Ruimin Shen | |||
| The term "Web 2.0" is used to describe applications that distinguish
themselves from previous generations of software by a number of principles.
Existing work shows that Web 2.0 applications can be successfully exploited for
technology-enhance learning. However, in-depth analyses of the relationship
between Web 2.0 technology on the one hand and teaching and learning on the
other hand are still rare. In this article, we will analyze the technological
principles of the Web 2.0 and describe their pedagogical implications on
learning. We will furthermore show that Web 2.0 is not only well suited for
learning but also for research on learning: the wealth of services that is
available and their openness regarding API and data allow to assemble
prototypes of technology-supported learning applications in amazingly small
amount of time. These prototypes can be used to evaluate research hypotheses
quickly. We will present two example prototypes and discuss the lessons we
learned from building and using these prototypes. Keywords: education, research, web 2.0 | |||
| Exploring social annotations for information retrieval | | BIBAK | Full-Text | 715-724 | |
| Ding Zhou; Jiang Bian; Shuyi Zheng; Hongyuan Zha; C. Lee Giles | |||
| Social annotation has gained increasing popularity in many Web-based
applications, leading to an emerging research area in text analysis and
information retrieval. This paper is concerned with developing probabilistic
models and computational algorithms for social annotations. We propose a
unified framework to combine the modeling of social annotations with the
language modeling-based methods for information retrieval. The proposed
approach consists of two steps: (1) discovering topics in the contents and
annotations of documents while categorizing the users by domains; and (2)
enhancing document and query language models by incorporating user domain
interests as well as topical background models. In particular, we propose a new
general generative model for social annotations, which is then simplified to a
computationally tractable hierarchical Bayesian network. Then we apply
smoothing techniques in a risk minimization framework to incorporate the
topical information to language models. Experiments are carried out on a
real-world annotation data set sampled from del.icio.us. Our results
demonstrate significant improvements over the traditional approaches. Keywords: folksonomy, information retrieval, language modeling, social annotations | |||
| Lock-free consistency control for web 2.0 applications | | BIBAK | Full-Text | 725-734 | |
| Jiangming Yang; Haixun Wang; Ning Gu; Yiming Liu; Chunsong Wang; Qiwei Zhang | |||
| Online collaboration and sharing is the central theme of many web-based
services that create the so-called Web 2.0 phenomena. Using the Internet as a
computing platform, many Web 2.0 applications set up mirror sites to provide
large-scale availability and to achieve load balance. However, in the age of
Web 2.0, where every user is also a writer and publisher, the deployment of
mirror sites makes consistency maintenance a Web scale problem. Traditional
concurrency control methods (e.g. two phase lock, serialization, etc.) are not
up to the task for several reasons. First, large network latency between mirror
sites will make two phase locking a throughput bottleneck. Second, locking will
block a large portion of concurrent operations, which makes it impossible to
provide large-scale availability. On the other hand, most Web 2.0 operations do
not need strict serializability -- it is not the intention of a user who is
correcting a typo in a shared document to block another who is adding a
comment, as long as consistency can still be achieved. Thus, in order to enable
maximal online collaboration and sharing, we need a lock-free mechanism that
can maintain consistency among mirror sites on the Web. In this paper, we
propose a flexible and efficient method to achieve consistency maintenance in
the Web 2.0 world. Our experiments show its good performance improvement
compared with existing methods based on distributed lock. Keywords: concurrency control, consistency maintenance, xml | |||
| Mining, indexing, and searching for textual chemical molecule information on the web | | BIBAK | Full-Text | 735-744 | |
| Bingjun Sun; Prasenjit Mitra; C. Lee Giles | |||
| Current search engines do not support user searches for chemical entities
(chemical names and formulae) beyond simple keyword searches. Usually a
chemical molecule can be represented in multiple textual ways. A simple keyword
search would retrieve only the exact match and not the others. We show how to
build a search engine that enables searches for chemical entities and
demonstrate empirically that it improves the relevance of returned documents.
Our search engine first extracts chemical entities from text, performs novel
indexing suitable for chemical names and formulae, and supports different query
models that a scientist may require. We propose a model of hierarchical
conditional random fields for chemical formula tagging that considers long-term
dependencies at the sentence level. To substring searches of chemical names, a
search engine must index substrings of chemical names. Indexing all possible
sub-sequences is not feasible in practice. We propose an algorithm for
independent frequent subsequence mining to discover sub-terms of chemical names
with their probabilities. We then propose an unsupervised hierarchical text
segmentation (HTS) method to represent a sequence with a tree structure based
on discovered independent frequent subsequences, so that sub-terms on the HTS
tree should be indexed. Query models with corresponding ranking functions are
introduced for chemical name searches. Experiments show that our approaches to
chemical entity tagging perform well. Furthermore, we show that index pruning
can reduce the index size and query time without changing the returned ranked
results significantly. Finally, experiments show that our approaches
out-perform traditional methods for document search with ambiguous chemical
terms. Keywords: conditional random fields, entity extraction, hierarchical text
segmentation, independent frequent subsequence, index pruning, ranking,
similarity search, substring search | |||
| Value-driven design for "infosuasive" web applications | | BIBAK | Full-Text | 745-754 | |
| Davide Bolchini; Franca Garzotto; Paolo Paolini | |||
| An infosuasive web application is mainly intended to be at the same time
informative and persuasive, i.e., it aims at supporting knowledge needs and it
has also the (declared or not declared) goal of influencing user's opinions,
attitudes and behaviors. Most web applications, in fact, are infosuasive
(except those whose aim is mainly operational). In this paper, we investigate
the complex set of elements that informs the very early design of infosuasive
web applications. We propose a conceptual framework aimed at supporting the
actors involved in this process to integrate their different viewpoints, to
organize the variety of issues that need to be analyzed, to find a direction in
the numerous design options, and to represent the results of this activity in
an effective way.
Our approach is value-driven since it is centered around the concept of communication value, regarded as a vehicle to fulfill communication goals on specific communication targets. We place the analysis of these aspects in the wider context of web requirements analysis, highlighting their relationships with business values analysis and user needs analysis. We pinpoint how values and communication goals impact on various design dimensions of infosuasive web application -- contents, information architecture, interaction, operations, and lay-out. Our approach is multidisciplinary, and was inspired to goal-based and value-based requirements engineering (often used in web engineering), to brand design (often used in marketing), and to value-centered design "frameworks"(as proposed by the HCI community). A case study exemplifies our methodological proposal, discussing a large project in which we are currently involved. Keywords: brand, communication goals, conceptual model, globalization, persuasive
design, requirements, value, value-driven design, web design, web engineering | |||
| Organizing and sharing distributed personal web-service data | | BIBAK | Full-Text | 755-764 | |
| Roxana Geambasu; Cherie Cheung; Alexander Moshchuk; Steven D. Gribble; Henry M. Levy | |||
| The migration from desktop applications to Web-based services is scattering
personal data across a myriad of Web sites, such as Google, Flickr, YouTube,
and Amazon S3. This dispersal poses new challenges for users, making it more
difficult for them to: (1) organize, search, and archive their data, much of
which is now hosted by Web sites; (2) create heterogeneous, multi-Web-service
object collections and share them in a protected way; and (3) manipulate their
data with standard applications or scripts.
In this paper, we show that a Web-service interface supporting standardized naming, protection, and object-access services can solve these problems and can greatly simplify the creation of a new generation of object-management services for the Web. We describe the implementation of Menagerie, a proof-of-concept prototype that provides these services for Web-based applications. At a high level, Menagerie creates an integrated file and object system from heterogeneous, personal Web-service objects dispersed across the Internet. We present several object-management applications we developed on Menagerie to show the practicality and benefits of our approach. Keywords: data organization, data sharing, menagerie, web services | |||
| Matching independent global constraints for composite web services | | BIBAK | Full-Text | 765-774 | |
| Nalaka Gooneratne; Zahir Tari | |||
| Service discovery employs matching techniques to select services by
comparing their descriptions against user constraints. Semantic-based matching
approaches achieve higher recall than syntactic-based ones (as they employ
ontological reasoning mechanisms to match syntactically heterogeneous
descriptions). However, semantic-based approaches still have problems (e.g.
lack of scalability as an exhaustive search is often performed to located
services conforming to constraints). This paper proposes two approaches that
deal with the problem of scalability/performance for composite service
location. First, services are indexed based on the values they assign to their
restricted attributes (the attributes restricted by a given constraint). Then,
services that assign "conforming values" to those attributes are combined to
form composite services. The first proposed approach extends a local
optimisation technique to perform the latter, since identifying such values is
NP-hard. However, this approach returns false negatives since the local
optimisation technique does not consider all the values. Hence, a second
approach that derives conforming values using domain rules is defined. The used
rules are returned with each composite service so that a user can understand
the context in which it is retrieved. Results obtained from the experiments
that varied the number of available services demonstrate that the performance
of the local optimisation-based approach is 76% better than existing
semantic-based approaches and recall is 98% higher than syntactic-based
approaches. Keywords: composite matching, service discovery, service matching | |||
| Wishful search: interactive composition of data mashups | | BIBAK | Full-Text | 775-784 | |
| Anton V. Riabov; Eric Boillet; Mark D. Feblowitz; Zhen Liu; Anand Ranganathan | |||
| With the emergence of Yahoo Pipes and several similar services, data mashup
tools have started to gain interest of business users. Making these tools
simple and accessible ton users with no or little programming experience has
become a pressing issue. In this paper we introduce MARIO (Mashup Automation
with Runtime Orchestration and Invocation), a new tool that radically
simplifies data mashup composition. We have developed an intelligent automatic
composition engine in MARIO together with a simple user interface using an
intuitive "wishful search" abstraction. It thus allows users to explore the
space of potentially composable data mashups and preview composition results as
they iteratively refine their "wishes", i.e. mashup composition goals. It also
lets users discover and make use of system capabilities without having to
understand the capabilities of individual components, and instantly reflects
changes made to the components by presenting an aggregate view of changed
capabilities of the entire system. We describe our experience with using MARIO
to compose flows of Yahoo Pipes components. Keywords: composition, programmable web, tag cloud | |||
| Extending the compatibility notion for abstract WS-BPEL processes | | BIBAK | Full-Text | 785-794 | |
| Dieter König; Niels Lohmann; Simon Moser; Christian Stahl; Karsten Wolf | |||
| WS-BPEL defines a standard for executable processes. Executable processes
are business processes which can be automated through an IT infrastructure. The
WS-BPEL specification also introduces the concept of abstract processes: In
contrast to their executable siblings, abstract processes are not executable
and can have parts where business logic is disguised. Nevertheless, the WS-BPEL
specification introduces a notion of compatibility between such an
under-specified abstract process and a fully specified executable one.
Basically, this compatibility notion defines a set of syntactical rules that
can be augmented or restricted by profiles. So far, there exist two of such
profiles: the Abstract Process Profile for Observable Behavior and the Abstract
Process Profile for Templates. None of these profiles defines a concept of
behavioral equivalence. Therefore, both profiles are too strict with respect to
the rules they impose when deciding whether an executable process is compatible
to an abstract one. In this paper, we propose a novel profile that extends the
existing Abstract Process Profile for Observable Behavior by defining a
behavioral relationship. We also show that our novel profile allows for more
flexibility when deciding whether an executable and an abstract process are
compatible. Keywords: WS-BPEL, abstract profile, compliance, petri nets | |||
| Investigating web services on the world wide web | | BIBAK | Full-Text | 795-804 | |
| Eyhab Al-Masri; Qusay H. Mahmoud | |||
| Searching for Web service access points is no longer attached to service
registries as Web search engines have become a new major source for discovering
Web services. In this work, we conduct a thorough analytical investigation on
the plurality of Web service interfaces that exist on the Web today. Using our
Web Service Crawler Engine (WSCE), we collect metadata service information on
retrieved interfaces through accessible UBRs, service portals and search
engines. We use this data to determine Web service statistics and distribution
based on object sizes, types of technologies employed, and the number of
functioning services. This statistical data can be used to help determine the
current status of Web services. We determine an intriguing result that 63% of
the available Web services on the Web are considered to be active. We further
use our findings to provide insights on improving the service retrieval
process. Keywords: UDDI, UDDI business registries, WSCE, WSDL, crawler, crawling, interface,
searching, service portals, web services | |||
| Restful web services vs. "big"' web services: making the right architectural decision | | BIBAK | Full-Text | 805-814 | |
| Cesare Pautasso; Olaf Zimmermann; Frank Leymann | |||
| Recent technology trends in the Web Services (WS) domain indicate that a
solution eliminating the presumed complexity of the WS-* standards may be in
sight: advocates of REpresentational State Transfer (REST) have come to believe
that their ideas explaining why the World Wide Web works are just as applicable
to solve enterprise application integration problems and to simplify the
plumbing required to build service-oriented architectures. In this paper we
objectify the WS-* vs. REST debate by giving a quantitative technical
comparison based on architectural principles and decisions. We show that the
two approaches differ in the number of architectural decisions that must be
made and in the number of available alternatives. This discrepancy between
freedom-from-choice and freedom-of-choice explains the complexity difference
perceived. However, we also show that there are significant differences in the
consequences of certain decisions in terms of resulting development and
maintenance costs. Our comparison helps technical decision makers to assess the
two integration styles and technologies more objectively and select the one
that best fits their needs: REST is well suited for basic, ad hoc integration
scenarios, WS-* is more flexible and addresses advanced quality of service
requirements commonly occurring in enterprise computing. Keywords: HTTP, REST, SOAP, WS-* vs. REST, WSDL, architectural decision modeling,
resource oriented architectures, service oriented architectures, technology
comparison, web services | |||
| Non-intrusive monitoring and service adaptation for WS-BPEL | | BIBAK | Full-Text | 815-824 | |
| Oliver Moser; Florian Rosenberg; Schahram Dustdar | |||
| Web service processes currently lack monitoring and dynamic (runtime)
adaptation mechanisms. In highly dynamic processes, services frequently need to
be exchanged due to a variety of reasons. In this paper we present VieDAME, a
system which allows monitoring of BPEL processes according to Quality of
Service (QoS) attributes and replacement of existing partner services based on
various (pluggable) replacement strategies. The chosen replacement services can
be syntactically or semantically equivalent to the BPEL interface. Services can
be automatically replaced during runtime without any downtime of the overall
system. We implemented our solution with an aspect-oriented approach by
intercepting SOAP messages and allow services to be exchanged during runtime
with little performance penalty costs, as shown in our experiments, thereby
making our approach suitable for high-availability BPEL environments. Keywords: BPEL, message mediation, monitoring, quality of service, service selection | |||
| Learning deterministic regular expressions for the inference of schemas from XML data | | BIBAK | Full-Text | 825-834 | |
| Geert Jan Bex; Wouter Gelade; Frank Neven; Stijn Vansummeren | |||
| Inferring an appropriate DTD or XML Schema Definition (XSD) for a given
collection of XML documents essentially reduces to learning deterministic
regular expressions from sets of positive example words. Unfortunately, there
is no algorithm capable of learning the complete class of deterministic regular
expressions from positive examples only, as we will show. The regular
expressions occurring in practical DTDs and XSDs, however, are such that every
alphabet symbol occurs only a small number of times. As such, in practice it
suffices to learn the subclass of regular expressions in which each alphabet
symbol occurs at most k times, for some small k. We refer to such expressions
as k-occurrence regular expressions (k-OREs for short). Motivated by this
observation, we provide a probabilistic algorithm that learns k-OREs for
increasing values of k, and selects the one that best describes the sample
based on a Minimum Description Length argument. The effectiveness of the method
is empirically validated both on real world and synthetic data. Furthermore,
the method is shown to be conservative over the simpler classes of expressions
considered in previous work. Keywords: XML, regular expressions, schema inference | |||
| Efficient evaluation of generalized path pattern queries on XML data | | BIBAK | Full-Text | 835-844 | |
| Xiaoying Wu; Stefanos Souldatos; Dimitri Theodoratos; Theodore Dalamagas; Timos Sellis | |||
| Finding the occurrences of structural patterns in XML data is a key
operation in XML query processing. Existing algorithms for this operation focus
almost exclusively on path-patterns or tree-patterns. Requirements in flexible
querying of XML data have motivated recently the introduction of query
languages that allow a partial specification of path-patterns in a query. In
this paper, we focus on the efficient evaluation of partial path queries, a
generalization of path pattern queries. Our approach explicitly deals with
repeated labels (that is, multiple occurrences of the same label in a query).
We show that partial path queries can be represented as rooted dags for which a topological ordering of the nodes exists. We present three algorithms for the efficient evaluation of these queries under the indexed streaming evaluation model. The first one exploits a structural summary of data to generate a set of path-patterns that together are equivalent to a partial path query. To evaluate these path-patterns, we extend PathStack so that it can work on path-patterns with repeated labels. The second one extracts a spanning tree from the query dag, uses a stack-based algorithm to find the matches of the root-to-leaf paths in the tree, and merge-joins the matches to compute the answer. Finally, the third one exploits multiple pointers of stack entries and a topological ordering of the query dag to apply a stack-based holistic technique. An analysis of the algorithms and extensive experimental evaluation shows that the holistic algorithm outperforms the other ones. Keywords: XML, Xpath query evaluation | |||
| On incremental maintenance of 2-hop labeling of graphs | | BIBAK | Full-Text | 845-854 | |
| Ramadhana Bramandia; Byron Choi; Wee Keong Ng | |||
| Recent interests on XML, Semantic Web, and Web ontology, among other topics,
have sparked a renewed interest on graph-structured databases. A fundamental
query on graphs is the reachability test of nodes. Recently, 2-hop labeling has
been proposed to index large collections of XML and/or graphs for efficient
reachability tests. However, there has been few work on updates of 2-hop
labeling. This is compounded by the fact that Web data changes over time. In
response to these, this paper studies the incremental maintenance of 2-hop
labeling. We identify the main reason for the inefficiency of updates of
existing 2-hop labels. We propose two updatable 2-hop labelings, hybrids of
2-hop labeling, and their incremental maintenance algorithms. The proposed
2-hop labeling is derived from graph connectivities, as opposed to set cover
which is used by all previous work. Our experimental evaluation illustrates the
space efficiency and update performance of various kinds of 2-hop labeling. The
main conclusion is that there is a natural way to spare some index size for
update performance in 2-hop labeling. Keywords: 2-hop, graph indexing, incremental maintenance, reachability test | |||
| Utility-driven load shedding for xml stream processing | | BIBAK | Full-Text | 855-864 | |
| Mingzhu Wei; Elke A. Rundensteiner; Murali Mani | |||
| Because of the high volume and unpredictable arrival rate, stream processing
systems may not always be able to keep up with the input data streams --
resulting in buffer overflow and uncontrolled loss of data. Load shedding, the
prevalent strategy for solving this overflow problem, has so far only been
considered for relational stream processing, but not for XML. Shedding applied
to XML stream processing brings new opportunities and challenges due to complex
nested nature of XML structures. In this paper, we tackle this unsolved XML
shedding problem using a three-pronged approach. First, we develop an XQuery
preference model that enables users to specify the relative importance of
preserving different subpatterns in the XML result structure. This transforms
shedding into the problem of rewriting the user query into shed queries that
return approximate query answers with utility as measured by the given user
preference model. Second, we develop a cost model to compare the performance of
alternate shed queries. Third, we develop two shedding algorithms, OptShed and
FastShed. OptShed guarantees to find an optimal solution however at the cost of
exponential complexity. FastShed, as confirmed by our experiments, achieves a
close-to-optimal result in a wide range of test cases. Finally we describe the
in-automaton shedding mechanism for XQuery stream engines. The experiments show
that our proposed utility-driven shedding solutions consistently achieve higher
utility results compared to the existing relational shedding techniques. Keywords: load shedding, preference model, xml query processing, xml streams | |||
| Xml data dissemination using automata on top of structured overlay networks | | BIBAK | Full-Text | 865-874 | |
| Iris Miliaraki; Zoi Kaoudi; Manolis Koubarakis | |||
| We present a novel approach for filtering XML documents using
nondeterministic finite automata and distributed hash tables. Our approach
differs architecturally from recent proposals that deal with distributed XML
filtering; they assume an XML broker architecture, whereas our solution is
built on top of distributed hash tables. The essence of our work is a
distributed implementation of YFilter, a state-of-the-art automata-based XML
filtering system on top of Chord. We experimentally evaluate our approach and
demonstrate that our algorithms can scale to millions of XPath queries under
various filtering scenarios, and also exhibit very good load balancing
properties. Keywords: automata, load balancing, structured overlay networks, xml data
dissemination | |||
| Sewnet -: a framework for creating services utilizing telecom functionality | | BIBAK | Full-Text | 875-884 | |
| Sumit Mittal; Dipanjan Chakraborty; Sunil Goyal; Sougata Mukherjea | |||
| With Telecom market reaching saturation in many geographies and revenues
from voice calls decreasing, Telecom operators are trying to identify new
sources of revenue. For this purpose, these operators can take advantage of
their core functionalities like Location, Call Control, etc. by exposing them
as services to be composed by developers with third party offerings available
over the Web. To hide the complexity of underlying Telecom protocols from
application developers, the operators are steadily adopting Service Oriented
Architecture (SOA) and reference standards like Parlay-X and IMS. However, a
number of challenges still remain in rapid utilization of Telecom
functionalities for creating new applications -- existence of multiple
protocols, different classes of developers, and the need to coordinate and
manage usage of these functionalities. In this paper, we present SewNet, a
framework for creating applications exploiting Telecom functionality exposed
over a (converged) IP network. More specifically, SewNet a) provides an
abstraction model for encapsulating invocation, coordination and enrichment of
the Telecom functionalities, b) renders a service creation environment on top
of this model, and c) caters to various different categories of developers.
With the help of two use-case scenarios, we demonstrate how SewNet can create
services utilizing rich Telecom functionality. Keywords: abstraction model, service composition, telecom, web 2.0 | |||
| Characterizing typical and atypical user sessions in clickstreams | | BIBAK | Full-Text | 885-894 | |
| Narayanan Sadagopan; Jie Li | |||
| Millions of users retrieve information from the Internet using search
engines. Mining these user sessions can provide valuable information about the
quality of user experience and the perceived quality of search results. Often
search engines rely on accurate estimates of Click Through Rate (CTR) to
evaluate the quality of user experience. The vast heterogeneity in the user
population and presence of automated software programs (bots) can result in
high variance in the estimates of CTR. To improve the estimation accuracy of
user experience metrics like CTR, we argue that it is important to identify
typical and atypical user sessions in clickstreams. Our approach to identify
these sessions is based on detecting outliers using Mahalanobis distance in the
user session space. Our user session model incorporates several key clickstream
characteristics including a novel conformance score obtained by Markov Chain
analysis. Editorial results show that our approach of identifying typical and
atypical sessions has a precision of about 89%. Filtering out these atypical
sessions reduces the uncertainty (95% confidence interval) of the mean CTR by
about 40%. These results demonstrate that our approach of identifying typical
and atypical user sessions is extremely valuable for cleaning "noisy" user
session data for increased accuracy in evaluating user experience. Keywords: clickstream analysis, outlier detection, web search | |||
| Video suggestion and discovery for youtube: taking random walks through the view graph | | BIBAK | Full-Text | 895-904 | |
| Shumeet Baluja; Rohan Seth; D. Sivakumar; Yushi Jing; Jay Yagnik; Shankar Kumar; Deepak Ravichandran; Mohamed Aly | |||
| The rapid growth of the number of videos in YouTube provides enormous
potential for users to find content of interest to them. Unfortunately, given
the difficulty of searching videos, the size of the video repository also makes
the discovery of new content a daunting task. In this paper, we present a novel
method based upon the analysis of the entire user-video graph to provide
personalized video suggestions for users. The resulting algorithm, termed
Adsorption, provides a simple method to efficiently propagate preference
information through a variety of graphs. We extensively test the results of the
recommendations on a three month snapshot of live data from YouTube. Keywords: collaborative filtering, label propagation, random walks, recommendation
systems, video search | |||
| How people use the web on mobile devices | | BIBAK | Full-Text | 905-914 | |
| Yanqing Cui; Virpi Roto | |||
| This paper describes a series of user studies on how people use the Web via
mobile devices. The data primarily comes from contextual inquiries with 47
participants between 2004 and 2007, and is complemented with a phone log
analysis of 577 panelists in 2007. We report four key contextual factors in
using the Web on mobile devices and propose mobile Web activity taxonomy. The
framework contains three user activity categories identical to previous
stationary Web studies: information seeking, communication, and transaction,
and a new category: personal space extension. The new category refers to the
practice that people put their content on the Web for personal access,
therefore extending their personal information space. Keywords: activity taxonomy, content object handling, information seeking, mobile web,
personal space extension | |||
| Planetary-scale views on a large instant-messaging network | | BIBAK | Full-Text | 915-924 | |
| Jure Leskovec; Eric Horvitz | |||
| We present a study of anonymized data capturing a month of high-level
communication activities within the whole of the Microsoft Messenger
instant-messaging system. We examine characteristics and patterns that emerge
from the collective dynamics of large numbers of people, rather than the
actions and characteristics of individuals. The dataset contains summary
properties of 30 billion conversations among 240 million people. From the data,
we construct a communication graph with 180 million nodes and 1.3 billion
undirected edges, creating the largest social network constructed and analyzed
to date. We report on multiple aspects of the dataset and synthesized graph. We
find that the graph is well-connected and robust to node removal. We
investigate on a planetary-scale the oft-cited report that people are separated
by "six degrees of separation" and find that the average path length among
Messenger users is 6.6. We find that people tend to communicate more with each
other when they have similar age, language, and location, and that cross-gender
conversations are both more frequent and of longer duration than conversations
with the same gender. Keywords: communication networks, large data, online communication, social networks,
user demographics | |||
| Online auctions efficiency: a survey of ebay auctions | | BIBAK | Full-Text | 925-934 | |
| Wenyan Hu; Alvaro Bolivar | |||
| Online auctions have become a pervasive transaction mechanism for
e-commerce. As the largest online marketplace in the world, eBay is an
attractive case study that enables the study of online auctions utilizing data
involving real people and transactions.
In this paper, we present a detailed investigation and analysis of multiple online auction properties including: consumer surplus, sniping, bidding strategy and their cross-relationships. Our goal is to evaluate the theoretical foundations of online auctions and discover patterns and behaviors hidden due to the lack of real and extensive transaction data. Among our findings, we uncover an important correlation among sniping and high surplus ratios, which implies the uncertainty of true value in a competitive environment. The key issue is the wrong assumption that bidders' valuations are independent from each other, which leads to inefficient auctions. In order to address the inefficiencies of current online formats we introduce a declining price auction model customized for online transactions. Conceptually, this model ought to deal with the complexities of competition in an online environment while maximizing social welfare. Keywords: auction efficiency, auction theory, online auctions | |||
| Organizing the unorganized -- employing IT to empower the under-privileged | | BIBAK | Full-Text | 935-944 | |
| Arun Kumar; Nitendra Rajput; Sheetal Agarwal; Dipanjan Chakraborty; Amit Anit Nanavati | |||
| Various sectors in developing countries are typically dominated by the
presence of a large number of small and micro-businesses that operate in an
informal, unorganized manner. Many of these are single person run
micro-businesses and cannot afford to buy and maintain their own IT
infrastructure. For others, easy availability of cheap labour provides a
convenient alternative even though it results in inefficiency, as little or no
records are maintained, and only manual, paper-based processes are followed.
This results in high response times for customers, no formal accountability and
higher charges. For the businesses this translates to lower earnings and losses
due to inefficiencies. In this paper, we look at few such micro-business
segments and explore their current models of operation, while identifying
existing inefficiencies and pain points. We build upon the findings and propose
an approach for delivering benefits of IT solutions to such micro-business
segments. Finally, we present technology that realizes the proposed approach in
the specific context of two such segments. Keywords: developing regions, virtual communities, voice applications | |||
| Dtwiki: a disconnection and intermittency tolerant wiki | | BIBAK | Full-Text | 945-952 | |
| Bowei Du; Eric A. Brewer | |||
| Wikis have proven to be a valuable tool for collaboration and content
generation on the web. Simple semantics and ease-of-use make wiki systems well
suited for meeting many emerging region needs in the areas of education,
collaboration and local content generation. Despite their usefulness, current
wiki software does not work well in the network environments found in emerging
regions. For example, it is common to have long-lasting network partitions due
to cost, power and poor connectivity. Network partitions make a traditional
centralized wiki architecture unusable due to the unavailability of the central
server. Existing solutions towards addressing connectivity problems include
web-caching proxies and snapshot distribution. While proxies and snapshots
allow wiki data to be read while disconnected, they prevent users from
contributing updates back to the wiki.
In this paper we detail the design and implementation of DTWiki, a wiki system which explicitly addresses the problem of operating a wiki system in an intermittent environment. The DTWiki system is able to cope with long-lasting partitions and bad connectivity while providing the functionality of popular wiki software such as MediaWiki and TWiki. Keywords: delay-tolerant networking, developing regions, wiki | |||
| Action science approach to nonprofit housing services using web 2.0 mapping tools | | BIBAK | Full-Text | 953-958 | |
| Yao-Jen Chang; Hsin-Yu Hsu; Tsen-Yung Wang | |||
| The study follows action science approach to the problem of nonprofit
housing services. After 4 months of action science-based activities, such as
organized participant observations, in-depth interviews, field work, and focus
group studies, the main findings are (1) Web 2.0 suits nonprofit organizations
better than traditional webs in terms of maintenance cost and usability, (2)
mapping tools make better GUI with respect to web-based housing services, and
(3) context-aware personalization can translate to better user experiences as
an RDFS-based working prototype has been built and tested. A user survey shows
high level user satisfaction. Although the case study was carried in a
nonprofit housing organization, the practices in action research approach can
be applied to other NPOs as well. Keywords: RDFS, action research, context-aware services, mapping tools, web 2.0 | |||
| Hidden sentiment association in Chinese web opinion mining | | BIBAK | Full-Text | 959-968 | |
| Qi Su; Xinying Xu; Honglei Guo; Zhili Guo; Xian Wu; Xiaoxun Zhang; Bin Swen; Zhong Su | |||
| The boom of product review websites, blogs and forums on the web has
attracted many research efforts on opinion mining. Recently, there was a
growing interest in the finer-grained opinion mining, which detects opinions on
different review features as opposed to the whole review level. The researches
on feature-level opinion mining mainly rely on identifying the explicit
relatedness between product feature words and opinion words in reviews.
However, the sentiment relatedness between the two objects is usually
complicated. For many cases, product feature words are implied by the opinion
words in reviews. The detection of such hidden sentiment association is still a
big challenge in opinion mining. Especially, it is an even harder task of
feature-level opinion mining on Chinese reviews due to the nature of Chinese
language. In this paper, we propose a novel mutual reinforcement approach to
deal with the feature-level opinion mining problem. More specially, 1) the
approach clusters product features and opinion words simultaneously and
iteratively by fusing both their content information and sentiment link
information. 2) under the same framework, based on the product feature
categories and opinion word groups, we construct the sentiment association set
between the two groups of data objects by identifying their strongest n
sentiment links. Moreover, knowledge from multi-source is incorporated to
enhance clustering in the procedure. Based on the pre-constructed association
set, our approach can largely predict opinions relating to different product
features, even for the case without the explicit appearance of product feature
words in reviews. Thus it provides a more accurate opinion evaluation. The
experimental results demonstrate that our method outperforms the state-of-art
algorithms. Keywords: association, mutual reinforcement, opinion mining, opinion word, product
feature | |||
| Can Chinese web pages be classified with english data source? | | BIBAK | Full-Text | 969-978 | |
| Xiao Ling; Gui-Rong Xue; Wenyuan Dai; Yun Jiang; Qiang Yang; Yong Yu | |||
| As the World Wide Web in China grows rapidly, mining knowledge in Chinese
Web pages becomes more and more important. Mining Web information usually
relies on the machine learning techniques which require a large amount of
labeled data to train credible models. Although the number of Chinese Web pages
increases quite fast, it still lacks Chinese labeled data. However, there are
relatively sufficient English labeled Web pages. These labeled data, though in
different linguistic representations, share a substantial amount of semantic
information with Chinese ones, and can be utilized to help classify Chinese Web
pages. In this paper, we propose an information bottleneck based approach to
address this cross-language classification problem. Our algorithm first
translates all the Chinese Web pages to English. Then, all the Web pages,
including Chinese and English ones, are encoded through an information
bottleneck which can allow only limited information to pass. Therefore, in
order to retain as much useful information as possible, the common part between
Chinese and English Web pages is inclined to be encoded to the same code (i.e.
class label), which makes the cross-language classification accurate. We
evaluated our approach using the Web pages collected from Open Directory
Project (ODP). The experimental results show that our method significantly
improves several existing supervised and semi-supervised classifiers. Keywords: cross-language classification, information bottleneck | |||
| Substructure similarity measurement in Chinese recipes | | BIBAK | Full-Text | 979-988 | |
| Liping Wang; Qing Li; Na Li; Guozhu Dong; Yu Yang | |||
| Improving the precision of information retrieval has been a challenging
issue on Chinese Web. As exemplified by Chinese recipes on the Web, it is not
easy/natural for people to use keywords (e.g. recipe names) to search recipes,
since the names can be literally so abstract that they do not bear much, if
any, information on the underlying ingredients or cooking methods. In this
paper, we investigate the underlying features of Chinese recipes, and based on
workflow-like cooking procedures, we model recipes as graphs. We further
propose a novel similarity measurement based on the frequent patterns, and
devise an effective filtering algorithm to prune unrelated data so as to
support efficient on-line searching. Benefiting from the characteristics of
graphs, frequent common patterns can be mined from a cooking graph database. So
in our prototype system called RecipeView, we extend the subgraph mining
algorithm FSG to cooking graphs and combine it with our proposed similarity
measurement, resulting in an approach that well caters for specific users'
needs. Our initial experimental studies show that the filtering algorithm can
efficiently prune unrelated cooking graphs without affecting the retrieval
performance and the similarity measurement gets a relatively higher
precision/recall against its counterparts Keywords: cooking graph, filtering, recipes, similarity measurement, subgraph mining | |||
| Efficient multi-keyword search over p2p web | | BIBAK | Full-Text | 989-998 | |
| Hanhua Chen; Hai Jin; Jiliang Wang; Lei Chen; Yunhao Liu; Lionel M. Ni | |||
| Current search mechanisms of DHT-based P2P systems can well handle a single
keyword search problem. Other than single keyword search, multi-keyword search
is quite popular and useful in many real applications. Simply using the
solution for single keyword search will require distributed intersection/union
operations in wide area networks, leading to unacceptable traffic cost. As it
is well known that Bloom Filter (BF) is effective in reducing traffic, we would
like to use BF encoding to handle multi-keyword search.
Applying BF is not difficult, but how to get optimal results is not trivial. In this study we show, through mathematical proof, that the optimal setting of BF in terms of traffic cost is determined by the global statistical information of keywords, not the minimized false positive rate as claimed by previous methods. Through extensive experiments, we demonstrate how to obtain optimal settings. We further argue that the intersection order between sets is important for multi-keyword search. Thus, we design optimal order strategies based on BF for both "and" and "or" queries. To better evaluate the performance of this design, we conduct extensive simulations on TREC WT10G test collection and the query log of a commercial search engine. Results show that our design significantly reduces the search traffic of existing approach by 73%. Keywords: DHT, P2P, bloom filter, multi-keyword search | |||
| Towards a global schema for web entities | | BIBAK | Full-Text | 999-1008 | |
| Conglei Yao; Yongjian Yu; Sicong Shou; Xiaoming Li | |||
| Popular entities often have thousands of instances on the Web. In this
paper, we focus on the case where they are presented in table-like format,
namely appearing with their attribute names. It is observed that, on one hand,
for the same entity, different web pages often incorporate different
attributes; on the other, for the same attribute, different web pages often use
different attribute names (labels). Therefore, it is imaginably difficult to
produce a global attribute schema for all the web entities of a given entity
type based on their web instances, although the global attribute schema is
usually highly desired in web entity instances integration and web object
extraction. To this end, we propose a novel framework of automatically learning
a global attribute schema for all web entities of one specific entity type.
Under this framework, an iterative instances extraction procedure is first
employed to extract sufficient web entity instances to discover enough
attribute labels. Next, based on the labels, entity instances, and related web
pages, a maximum entropy-based schema discovery approach is adopted to learn
the global attribute schema for the target entity type. Experimental results on
the Chinese Web achieve weighted average Fscores of 0.7122 and 0.7123 on two
global attribute schemas for person-type and movie-type web entities,
respectively. These results show that our framework is general, efficient and
effective. Keywords: attribute labels, entity instances, global attribute schema, web entities | |||
| Web video topic discovery and tracking via bipartite graph reinforcement model | | BIBAK | Full-Text | 1009-1018 | |
| Lu Liu; Lifeng Sun; Yong Rui; Yao Shi; Shiqiang Yang | |||
| Automatic topic discovery and tracking on web-shared videos can greatly
benefit both web service providers and end users. Most of current solutions of
topic detection and tracking were done on news and cannot be directly applied
on web videos, because the semantic information of web videos is much less than
that of news videos. In this paper, we propose a bipartite graph model to
address this issue. The bipartite graph represents the correlation between web
videos and their keywords, and automatic topic discovery is achieved through
two steps -- coarse topic filtering and fine topic re-ranking. First, a
weight-updating co-clustering algorithm is employed to filter out topic
candidates at a coarse level. Then the videos on each topic are re-ranked by
analyzing the link structures of the corresponding bipartite graph. After the
topics are discovered, the interesting ones can also be tracked over a period
of time using the same bipartite graph model. The key is to propagate the
relevant scores and keywords from the videos of interests to other relevant
ones through the bipartite graph links. Experimental results on real web videos
from YouKu, a YouTube counterpart in China, demonstrate the effectiveness of
the proposed methods. We report very promising results. Keywords: bipartite graph model, co-clustering, reinforcement, topic discovery, topic
tracking, web videos | |||
| The seamless browser: enhancing the speed of web browsing by zooming and preview thumbnails | | BIBAK | Full-Text | 1019-1020 | |
| ByungIn Yoo; JongHo Lea; YeunBae Kim | |||
| In this paper, we present a new web browsing system, Seamless Browser, for
fast link traversal on a large screen like TV In navigating web, users mainly
suffer from cognitive overhead of determining whether or not to follow links.
This overhead can be reduced by providing preview information of the
destination of links, and also by providing semantic cues on the nearest
location in relation to the anchor. In order to reduce disorientation and
annoyance from the preview information, we propose that users will focus on the
small area nearside around a pointer, and a small number of hyperlink previews
in that focused area will appear differently depending on the distances between
the pointer and the hyperlinks: the nearer the distance is, the richer the
content of the information scent is. We also propose that users can navigate
the link paths by controlling the pointer and the zooming interface, so that
users may go backward and forward seamlessly along several possible link paths.
We found that combining the pointer and a zoom significantly improved
performance for navigational tasks. Keywords: preview thumbnails, seamless browser, zoomable interaction | |||
| What do they think?: aggregating local views about news events and topics | | BIBAK | Full-Text | 1021-1022 | |
| Jiahui Liu; Larry Birnbaum | |||
| The web has become an important medium for news delivery and consumption.
Fresh content about a variety of topics and events is constantly being created
and published on the web by many sources. As intuitively understood by readers,
and studied in journalism, news articles produced by different social groups
present different attitudes towards and interpretations of the same news
issues. In this paper, we propose a new paradigm for aggregating news articles
according to the news sources related to the stakeholders of the news issues.
We implement this paradigm in a prototype system called LocalSavvy. The system
provides users the capability to aggregate and browse various local views about
the news issues in which they are interested. Keywords: local opinion, news aggregation | |||
| Personalized view-based search and visualization as a means for deep/semantic web data access | | BIBAK | Full-Text | 1023-1024 | |
| Michal Tvarozek; Maria Bielikova | |||
| Effective access to and navigation in information stored in deep Web
ontological repositories or relational databases has yet to be realized due to
issues with usability of user interfaces and the overall scope and complexity
of information as well as the nature of exploratory user tasks. We propose the
integration and adaptation of novel navigation and visualization approaches to
faceted browsing such as visual depiction of facets and restrictions, visual
navigation in (clusters of) search results and graph like exploration of
individual search results' properties. Keywords: graph visualization, navigation, personalized faceted browsing, view-based
search, visualization | |||
| Personalized multimedia web summarizer for tourist | | BIBAK | Full-Text | 1025-1026 | |
| Xiao Wu; Jintao Li; Yongdong Zhang; Sheng Tang; Shi-Yong Neo | |||
| In this paper, we highlight the use of multimedia technology in generating
intrinsic summaries of tourism related information. The system utilizes an
automated process to gather, filter and classify information on various tourist
spots on the Web. The end result present to the user is a personalized
multimedia summary generated with respect to users queries filled with text,
image, video and real-time news made retrievable for mobile devices.
Preliminary experiments demonstrate the superiority of our presentation scheme
to traditional methods. Keywords: classification, summarization, video and image analysis | |||
| A generic framework for collaborative multi-perspective ontology acquisition | | BIBAK | Full-Text | 1027-1028 | |
| Maayan Zhitomirsky-Geffet; Judit Bar-Ilan; Yitzchak Miller; Snunith Shoham | |||
| The research objective of this work is to develop a general framework that
incorporates collaborative social tagging with a novel ontology scheme
conveying multiple perspectives. We propose a framework where multiple users
tag the same object (an image in our case), and an ontology is extended based
on these tags while being tolerant about different points of view. We are not
aware of any other work that attempted to devise such an environment and to
study its dynamics. The proposed framework characterizes the underlying
processes for controlled collaborative development of a multi-perspective
ontology and its application to improve image annotation, searching and
browsing. Our case study experiment with a set of selected annotated images
indicates the soundness of the proposed ontological model. Keywords: collaborative multi-perspective ontology | |||
| As we may perceive: finding the boundaries of compound documents on the web | | BIBAK | Full-Text | 1029-1030 | |
| Pavel Dmitriev | |||
| This paper considers the problem of identifying on the Web compound
documents (cDocs) -- groups of web pages that in aggregate constitute
semantically coherent information entities. Examples of cDocs are a news
article consisting of several html pages, or a set of pages describing
specifications, price, and reviews of a digital camera. Being able to identify
cDocs would be useful in many applications including web and intranet search,
user navigation, automated collection generation, and information extraction.
In the past, several heuristic approaches have been proposed to identify cDocs [1][5]. However, heuristics fail to capture the variety of types, styles and goals of information on the web, and do not account for the fact that the definition of a cDoc often depends on the context. This paper presents an experimental evaluation of three machine learning-based algorithms for cDoc discovery. These algorithms are responsive to the varying structure of cDocs and adaptive to their application-specific nature. Based on our previous work [4], this paper proposes a different scenario for discovering cDocs, and compares in this new setting the local machine learned clustering algorithm from [4] to a global purely graph based approach [3] and a Conditional Markov Network approach previously applied to noun coreference task [6]. The results show that the approach of [4] outperforms the other algorithms, suggesting that global relational characteristics of web sites are too noisy for cDoc identification purposes. Keywords: compound documents, machine learning, www | |||
| Personalized search and exploration with mytag | | BIBK | Full-Text | 1031-1032 | |
| Max Braun; Klaas Dellschaft; Thomas Franz; Dominik Hering; Peter Jungen; Hagen Metzler; Eugen Müller; Alexander Rostilov; Carsten Saathoff | |||
Keywords: personomy, searching, tagging | |||
| Emergence of terminological conventions as an author-searcher coordination game | | BIBAK | Full-Text | 1033-1034 | |
| David Bodoff; Sheizaf Rafaeli | |||
| All information exchange on the Internet -- whether through full text,
controlled vocabularies, ontologies, or other mechanisms -- ultimately requires
that an information provider and seeker use the same word or symbol. In this
paper, we investigate what happens when both searchers and authors are
dynamically choosing terms to match the other side. With each side trying to
anticipate the other, does a terminological convention ever emerge, or do
searchers and providers continue to miss potential partners through mis-match
of terms? We use a game-theoretic setup to frame questions, and learning theory
to make predictions about whether and which term will emerge as a convention. Keywords: game theory, keyword advertising, learning, manual indexing | |||
| System II: a hypergraph based native rdf repository | | BIBAK | Full-Text | 1035-1036 | |
| Gang Wu; Juanzi Li; Kehong Wang | |||
| To manage the increasing amount of RDF data, an RDF repository should
provide not only necessary scalability and efficiency, but also sufficient
inference capabilities. In this paper, we propose a native RDF repository,
System, to pursue a better tradeoff among the above requirements. System takes
the hypergraph representation for RDF as the data model for its persistent
storage, which effectively avoids the costs of data model transformation when
accessing RDF data. In addition, a set of efficient semantic query processing
techniques are designed. The results of performance evaluation on the LUBM
benchmark show that System has a better combined metric value than the other
comparable systems. Keywords: hypergraph, rdf, repository | |||
| Efficient vectorial operators for processing xml twig queries | | BIBAK | Full-Text | 1037-1038 | |
| Guoliang Li; Jianhua Feng; Jianyong Wang; Feng Lin; Lizhu Zhou | |||
| This paper proposes several vectorial operators for processing XML twig
queries, which are easy to be performed and inherently efficient for both
Ancestor-Descendant (A-D) and Parent-Child (P-C) relationships. We develop
optimizations on the vectorial operators to improve the efficiency of answering
twig queries in holistic. We propose an algorithm to answer GTP queries based
on our vectorial operators. Keywords: holistic twig join, subsequence match, vectorial operators | |||
| User behavior oriented web spam detection | | BIBAK | Full-Text | 1039-1040 | |
| Yiqun Liu; Min Zhang; Shaoping Ma; Liyun Ru | |||
| Combating Web spam has become one of the top challenges for Web search
engines. State-of-the-art spam detection techniques are usually designed for
specific known types of Web spam and are incapable and inefficient for
recently-appeared spam. With user behavior analyses into Web access logs, we
propose a spam page detection algorithm based on Bayes learning. Preliminary
experiments on Web access data collected by a commercial Web site (containing
over 2.74 billion user clicks in 2 months) show the effectiveness of the
proposed detection framework and algorithm. Keywords: spam detection, user behavior analysis, web search engine | |||
| Feature weighting in content based recommendation system using social network analysis | | BIBAK | Full-Text | 1041-1042 | |
| Souvik Debnath; Niloy Ganguly; Pabitra Mitra | |||
| We propose a hybridization of collaborative filtering and content based
recommendation system. Attributes used for content based recommendations are
assigned weights depending on their importance to users. The weight values are
estimated from a set of linear regression equations obtained from a social
network graph which captures human judgment about similarity of items. Keywords: feature similarity, recommender system, social network | |||
| Extracting spam blogs with co-citation clusters | | BIBAK | Full-Text | 1043-1044 | |
| Kazunari Ishida | |||
| This paper reports the estimated number of spam blogs in order to assess
their current state in the blogosphere. To extract spam blogs, I developed a
traversal method among co-citation clusters of blogs from a spam seed. Spam
seeds were collected in terms of high out-degree and spam keyword. According to
the experiment, a mixed seed set composed of high out-degree and spam keyword
seeds is more effective than individual seed sets in terms of F-Measure. In
conclusion, mixed seeds from different methods are effective in improving the
F-Measure results of spam extraction with co-citation clusters. Keywords: advertisement link, co-citation cluster, spam blog extraction | |||
| Race: finding and ranking compact connected trees for keyword proximity search over xml documents | | BIBAK | Full-Text | 1045-1046 | |
| Guoliang Li; Jianhua Feng; Jianyong Wang; Bei Yu; Yukai He | |||
| In this paper, we study the problem of keyword proximity search over XML
documents and leverage the efficiency and effectiveness. We take the
disjunctive semantics among input keywords into consideration and identify
meaningful compact connected trees as the answers of keyword proximity queries.
We introduce the notions of Compact Lowest Common Ancestor (CLCA) and Maximal
CLCA (MCLCA) and propose Compact Connected Trees (CCTrees) and Maximal CCTrees
(MCCTrees) to efficiently and effectively answer keyword queries. We propose a
novel ranking mechanism, RACE, to Rank compAct Connected trEes, by taking into
consideration both the structural similarity and the textual similarity. Our
extensive experimental study shows that our method achieves both high search
efficiency and effectiveness, and outperforms existing approaches
significantly. Keywords: compact lca (clca), lowest common ancestor (lca), maximal clca (mclca) | |||
| The scale-free nature of semantic web ontology | | BIBAK | Full-Text | 1047-1048 | |
| Hongyu Zhang | |||
| Semantic web ontology languages, such as OWL, have been widely used for
knowledge representation. Through empirical analysis of real-world ontologies
we discover that, like many natural and social phenomenon, the semantic web
ontology is also "scale-free". Keywords: ontology, power-law, scale-free, semantic web | |||
| Asymmetrical query recommendation method based on bipartite network resource allocation | | BIBAK | Full-Text | 1049-1050 | |
| Zhiyuan Liu; Maosong Sun | |||
| This paper presents a new query recommendation method that generates
recommended query list by mining large-scale user logs. Starting from the user
logs of click-through data, we construct a bipartite network where the nodes on
one side correspond to unique queries, on the other side to unique URLs.
Inspired by the bipartite network based resource allocation method, we try to
extract the hidden information from the Query-URL bipartite network. The
recommended queries generated by the method are asymmetrical which means two
related queries may have different strength to recommend each other. To
evaluate the method, we use one week user logs from Chinese search engine
Sogou. The method is not only 'content ignorant', but also can be easily
implemented in a paralleled manner, which is feasible for commercial search
engines to handle large scale user logs. Keywords: asymmetrical query recommendation, bipartite network., network resource
allocation, user log analysis | |||
| Efficient mining of frequent sequence generators | | BIBAK | Full-Text | 1051-1052 | |
| Chuancong Gao; Jianyong Wang; Yukai He; Lizhu Zhou | |||
| Sequential pattern mining has raised great interest in data mining research
field in recent years. However, to our best knowledge, no existing work studies
the problem of frequent sequence generator mining. In this paper we present a
novel algorithm, FEAT (abbr. Frequent sEquence generATor miner), to perform
this task. Experimental results show that FEAT is more efficient than
traditional sequential pattern mining algorithms but generates more concise
result set, and is very effective for classifying Web product reviews. Keywords: sequence, sequence generators, web mining | |||
| Efficiently querying rdf data in triple stores | | BIBAK | Full-Text | 1053-1054 | |
| Ying Yan; Chen Wang; Aoying Zhou; Weining Qian; Li Ma; Yue Pan | |||
| Efficiently querying RDF data is being an important factor in applying
Semantic Web technologies to real-world applications. In this context, many
efforts have been made to store and query RDF data in relational database using
particular schemas. In this paper, we propose a new scheme to store, index, and
query RDF data in triple stores. Graph feature of RDF data is taken into
considerations which might help reduce the join costs on the vertical database
structure. We would partition RDF triples into overlapped groups, store them in
a triple table with one more column of group identity, and build up a signature
tree to index them. Based on this infrastructure, a complex RDF query is
decomposed into multiple pieces of sub-queries which could be easily filtered
into some RDF groups using signature tree index, and finally is evaluated with
a composed and optimized SQL with specific constraints. We compare the
performance of our method with prior art on typical queries over a large scaled
LUBM and UOBM benchmark data (more than 10 million triples). For some extreme
cases, they can promote 3 to 4 orders of magnitude. Keywords: graph partitioning, indexing, rdf, signature | |||
| Collaborative knowledge semantic graph image search | | BIBAK | Full-Text | 1055-1056 | |
| Jyh-Ren Shieh; Yang-Ting Yeh; Chih-Hung Lin; Ching-Yung Lin; Ja-Ling Wu | |||
| In this paper, we propose a Collaborative Knowledge Semantic Graphs Image
Search (CKSGIS) system. It provides a novel way to conduct image search by
utilizing the collaborative nature in Wikipedia and by performing network
analysis to form semantic graphs for search-term expansion. The collaborative
article editing process used by Wikipedia's contributors is formalized as
bipartite graphs that are folded into networks between terms. When a user types
in a search term, CKSGIS automatically retrieves an interactive semantic graph
of related terms that allow users to easily find related images not limited to
a specific search term. Interactive semantic graph then serve as an interface
to retrieve images through existing commercial search engines. This method
significantly saves users' time by avoiding multiple search keywords that are
usually required in generic search engines. It benefits both naïve users
who do not possess a large vocabulary and professionals who look for images on
a regular basis. In our experiments, 85% of the participants favored CKSGIS
system rather than commercial search engines. Keywords: keyword expansion, re-ranking, social network | |||
| A logical framework for modeling and reasoning about semantic web services contract | | BIBAK | Full-Text | 1057-1058 | |
| Hai Liu; Qing Li; Naijie Gu; An Liu | |||
| In this paper, we incorporate concrete domain and action theory into a very
expressive Description Logic (DL), called ALCQO. Notably, this extension can
significantly augment the expressive power for modeling and reasoning about
dynamic aspects of services contracting. Meanwhile, the original nature and
advantages of classical DLs are also preserved to the extent possible. Keywords: dls, semantic web services, services contract | |||
| Dissemination of heterogeneous xml data | | BIBAK | Full-Text | 1059-1060 | |
| Yuan Ni; Chee-Yong Chan | |||
| A lot of recent research has focused on the content-based dissemination of
XML data. However, due to the heterogeneous data schemas used by different data
publishers even for data in the same domain, an important challenge is how to
efficiently and effectively disseminate relevant data to subscribers whose
subscriptions might be specified based on schemas that are different from those
used by the data publishers. This paper examines the options to resolve this
schema heterogeneity problem in XML data dissemination, and proposes a novel
paradigm that is based on data rewriting. Our experimental results demonstrate
the effectiveness of the data rewriting paradigm and identifies the tradeoffs
of the various approaches. Keywords: data rewriting, dissemination, heterogeneous, xml | |||
| Sailer: an effective search engine for unified retrieval of heterogeneous xml and web documents | | BIBAK | Full-Text | 1061-1062 | |
| Guoliang Li; Jianhua Feng; Jianyong Wang; Xiaoming Song; Lizhu Zhou | |||
| This paper studies the problem of unified ranked retrieval of heterogeneous
XML documents and Web data. We propose an effective search engine called Sailer
to adaptively and versatilely answer keyword queries over the heterogenous
data. We model the Web pages and XML documents as graphs. We propose the
concept of pivotal trees to effectively answer keyword queries and present an
effective method to identify the top-k pivotal trees with the highest ranks
from the graphs. Moreover, we propose effective indexes to facilitate the
effective unified ranked retrieval. We have conducted an extensive experimental
study using real datasets, and the experimental results show that Sailer
achieves both high search efficiency and accuracy, and outperforms the existing
approaches significantly. Keywords: keyword search, unified keyword search, web pages, xml | |||
| Personalized tag suggestion for flickr | | BIBAK | Full-Text | 1063-1064 | |
| Nikhil Garg; Ingmar Weber | |||
| We present a system for personalized tag suggestion for Flickr: While the
user is entering/selecting new tags for a particular picture, the system is
suggesting related tags to her, based on the tags that she or other people have
used in the past along with (some of) the tags already entered. The suggested
tags are dynamically updated with every additional tag entered/selected. We
describe three algorithms which can be applied to this problem. In experiments,
our best-performing method yields an improvement in precision of 10-15% over a
baseline method very similar to the system currently used by Flickr. Our system
is accessible at http://ltaa5.epfl.ch/flickr-tags/.
To the best of our knowledge, this is the first study on tag suggestion in a setting where (i) no full text information is available, such as for blogs, (ii) no item has been tagged by more than one person, such as for social bookmarking sites, and (iii) suggestions are dynamically updated, requiring efficient yet effective algorithms. Keywords: experiments, flickr, tag suggestions, tagging systems | |||
| Larger is better: seed selection in link-based anti-spamming algorithms | | BIBAK | Full-Text | 1065-1066 | |
| Qiancheng Jiang; Lei Zhang; Yizhen Zhu; Yan Zhang | |||
| Seed selection is of significant importance for the biased PageRank
algorithms such as TrustRank to combat link spamming. Previous work usually
uses a small seed set, which has a big problem that the top ranking results
have a strong bias towards seeds. In this paper, we analyze the relationship
between the result bias and the number of seeds. Furthermore, we experimentally
show that an automatically selected large seed set can work better than a
carefully selected small seed set. Keywords: biased pagerank, link spamming, seed selection | |||
| Using subspace analysis for event detection from web click-through data | | BIBAK | Full-Text | 1067-1068 | |
| Ling Chen; Yiqun Hu; Wolfgang Nejdl | |||
| Although most of existing research usually detects events by analyzing the
content or structural information of Web documents, a recent direction is to
study the usage data. In this paper, we focus on detecting events from Web
click-through data generated by Web search engines. We propose a novel approach
which effectively detects events from click-through data based on robust
subspace analysis. We first transform click-through data to the 2D polar space.
Next, an algorithm based on Generalized Principal Component Analysis (GPCA) is
used to estimate subspaces of transformed data such that each subspace contains
query sessions of similar topics. Then, we prune uninteresting subspaces which
do not contain query sessions corresponding to real events by considering both
the semantic certainty and the temporal certainty of query sessions in each
subspace. Finally, various events are detected from interesting subspaces by
utilizing a nonparametric clustering technique. Compared with existing
approaches, our experimental results based on real-life click-through data have
shown that the proposed approach is more accurate in detecting real events and
more effective in determining the number of events. Keywords: click-through data, event detection, gpca, subspace estimation | |||
| A domain-specific language for the model-driven construction of advanced web-based dialogs | | BIBAK | Full-Text | 1069-1070 | |
| Patrick Freudenstein; Martin Nussbaumer; Florian Allerding; Martin Gaedke | |||
| Complex dialogs with comprehensive underlying data models are gaining
increasing importance in today's Web applications. This in turn accelerates the
need for highly dynamic dialogs offering guidance to the users and thus
reducing cognitive overload. Beyond that, requirements from the fields of
aesthetics, Web accessibility, platform-independence, and Web service
integration arise. To this end, we present an evolutionary, extensible approach
for the model-driven construction of advanced dialogs. It is based on a
Domain-specific Language (DSL) focusing on simplicity and fostering
collaboration with stakeholders. Keywords: dsl, model-driven, user interaction, web engineering | |||
| Web people search: results of the first evaluation and the plan for the second | | BIBAK | Full-Text | 1071-1072 | |
| Javier Artiles; Satoshi Sekine; Julio Gonzalo | |||
| This paper presents the motivation, resources and results for the first Web
People Search task, which was organized as part of the SemEval-2007 evaluation
exercise. Also, we will describe a survey and proposal for a new task,
"attribute extraction", which is planned for inclusion in the second
evaluation, planned for autumn, 2008. Keywords: attributes of people, disambiguation, information extraction, person names | |||
| Context-based page unit recommendation for web-based sensemaking tasks | | BIBAK | Full-Text | 1073-1074 | |
| Wen-Huang Cheng; David Gotz | |||
| Sensemaking tasks require users to perform complex research behaviors to
gather and comprehend information from many sources. Such tasks are common and
include, for example, researching vacation destinations or deciding how to
invest. In this paper, we present an algorithm and interface that provides
context-based page unit recommendation to assist in connection discovery during
sensemaking tasks. We exploit the natural note-taking activity common to
sensemaking behavior as the basis for a task-specific context model. Each web
page visited by a user is dynamically analyzed to determine the most relevant
content fragments which are then recommended to the user. Our initial
evaluations indicate that our approach improves user performance. Keywords: recommendation, search, sensemaking, www | |||
| Web page rank prediction with markov models | | BIBAK | Full-Text | 1075-1076 | |
| Michalis Vazirgiannis; Dimitris Drosos; Pierre Senellart; Akrivi Vlachou | |||
| In this paper we propose a method for predicting the ranking position of a
Web page. Assuming a set of successive past top-k rankings, we study the
evolution of Web pages in terms of ranking trend sequences used for Markov
Models training, which are in turn used to predict future rankings. The
predictions are highly accurate for all experimental setups and similarity
measures. Keywords: markov models, ranking prediction | |||
| Ajax for mobility: mobileweaver ajax framework | | BIBAK | Full-Text | 1077-1078 | |
| Louenas Hamdi; Huaigu Wu; Serhan Dagtas; Abdel Benharref | |||
| This paper provides a brief description of a research project on using Ajax
to enhance mobile Web applications for enterprise use. The project, known as
MobileWeaver Ajax Framework, leverages enterprise SOA (Service Oriented
Architecture) and the latest web technologies on mobile devices. Keywords: ajax, enterprise soa, mobility, rest, soap, web 2.0 | |||
| Cm-pmi: improved web-based association measure with contextual label matching | | BIBAK | Full-Text | 1079-1080 | |
| Xiaojun Wan | |||
| WebPMI is a popular web-based association measure to evaluate the semantic
similarity between two queries (i.e. words or entities) by leveraging search
results returned by search engines. This paper proposes a novel measure named
CM-PMI to evaluate query similarity at a finer granularity than WebPMI, under
the assumption that a query is usually associated with more than one aspect and
two queries are deemed semantically related if their associated aspect sets are
highly consistent with each other. CM-PMI first extracts contextual labels from
search results to represent the aspects of a query, and then uses the optimal
matching method to assess the consistency between the aspects of two queries.
Experimental results on the benchmark Miller Charles' dataset demonstrate the
good effectiveness of the proposed CM-PMI measure. Moreover, we further fuse
WebPMI and CM-PMI to obtain improved results. Keywords: association measure, cm-pmi, web mining | |||
| Web user de-identification in personalization | | BIBAK | Full-Text | 1081-1082 | |
| Jiaqian Zheng; Jing Yao; Junyu Niu | |||
| It is a kind of privacy infraction in personalized web service if the user
profile submitted to one web site transferred to another site without user
permission. That can cause the second web site easily re-identify to whom these
personal data belong, no matter whether the transfer is under control or by
hacking.
This paper presents a portable solution for users to bind their sensitive web data under the appointed domain. Such data, including query logs, user accounts, click stream etc, could be used to identify the sensitive information of the particular user. By our domain stretching de-identification method, if personal data leak from domain A to B, the web user could still not be identified even though he logins to sites under domain B using the same name and password. In the experiment implemented by javascript, we show the flexibility and efficiency of our de-identification approach. Keywords: de-identification, domain, user profile | |||
| A systematic approach for cell-phone worm containment | | BIBAK | Full-Text | 1083-1084 | |
| Liang Xie; Hui Song; Trent Jaeger; Sencun Zhu | |||
| Cell phones are increasingly becoming attractive targets of various worms,
which cause the leakage of user privacy, extra service charges and depletion of
battery power. In this work, we study propagation of cell-phone worms, which
exploit Multimedia Messaging Service (MMS) and/or Bluetooth for spreading. We
then propose a systematic countermeasure against the worms. At the terminal
level, we adopt Graphic Turing test and identity-based signature to block
unauthorized messages from leaving compromised phones; at the network level, we
propose a push-based automated patching scheme for cleansing compromised
phones. Through experiments on phone devices and a wide variety of networks, we
show that cellular systems taking advantage of our defense can achieve a low
infection rate (e.g., less than 3% within 30 hours) even under severe attacks. Keywords: cell phone, patch, turing test | |||
| Information retrieval and knowledge discovery on the semantic web of traditional Chinese medicine | | BIBAK | Full-Text | 1085-1086 | |
| Zhaohui Wu; Tong Yu; Huajun Chen; Xiaohong Jiang; Yi Feng; Yuxin Mao; Heng Wang; Jingming Tang; Chunying Zhou | |||
| We conduct the first systematical adoption of the Semantic Web solution in
the integration, management, and utilization of TCM information and knowledge
resources. As the results, the largest TCM Semantic Web ontology is engineered
as the uniform knowledge representation mechanism; the ontology-based query and
search engine is deployed, mapping legacy and heterogeneous relational
databases to the Semantic Web layer for query and search across database
boundaries; the first global herb-drug interaction network is mapped through
semantic integration, and the semantic graph mining methodology is implemented
for discovering and interpreting interesting patterns from this network. The
platform and underlying methodology are proved effective in TCM-related drug
usage, discovery, and safety analysis. Keywords: information retrieval, knowledge discovery, semantic web | |||
| Topigraphy: visualization for large-scale tag clouds | | BIBAK | Full-Text | 1087-1088 | |
| Ko Fujimura; Shigeru Fujimura; Tatsushi Matsubayashi; Takeshi Yamada; Hidenori Okuda | |||
| This paper proposes a new method for displaying large-scale tag clouds. We
use a topographical image that helps users to grasp the relationship among tags
intuitively as a background to the tag clouds. We apply this interface to a
blog navigation system and show that the proposed method enables users to find
the desired tags easily even if the tag clouds are very large, 5,000 and above
tags. Our approach is also effective for understanding the overall structure of
a large amount of tagged documents. Keywords: blog, clustering, search, tag clouds, topigraphy, topography | |||
| Gsp-exr: gsp protocol with an exclusive right for keyword auctions | | BIBAK | Full-Text | 1089-1090 | |
| Yuko Sakurai; Atsushi Iwasaki; Yasumasa Saito; Makoto Yokoo | |||
| We propose a keyword auction protocol called the Generalized Second Price
with an Exclusive Right (GSP-ExR). In existing keyword auctions, the number of
displayed advertisements is determined in advance. Thus, we consider adjusting
the number of advertisements dynamically based on bids. In the GSP-ExR, the
number of slots can be either 1 or K. When K slots are displayed, the protocol
is identical to the GSP. If the value per click of the highest ranked bidder is
large enough, then this bidder can exclusively display her advertisement by
paying a premium. Thus, this pricing scheme is relatively simple and seller
revenue is at least as good as the GSP. Also, in the GSP-ExR, the highest
ranked bidder has no incentive to change the number of slots by
over/under-bidding as long as she retains the top position. Keywords: electronic commerce, keyword auction, mechanism design, web advertising | |||
| Finding similar pages in a social tagging repository | | BIBAK | Full-Text | 1091-1092 | |
| Alex Penev; Raymond K. Wong | |||
| Social tagging describes a community of users labeling web content with
tags. It is a simple activity that enriches our knowledge about resources on
the web. For a computer to help users search the tagged repository, it must
know when tags are good or bad. We describe TagScore, a scoring function that
rates the goodness of tags. The tags and their ratings give us a succinct
synopsis for a page. We 'find similar' pages in Del.icio.us by comparing
synopses. Our approach gives good correlation to the full cosine similarity but
is hundreds of times faster. Keywords: del.icio.us, social bookmarking, tagging, web | |||
| Folksoviz: a subsumption-based folksonomy visualization using wikipedia texts | | BIBAK | Full-Text | 1093-1094 | |
| Kangpyo Lee; Hyunwoo Kim; Chungsu Jang; Hyoung-Joo Kim | |||
| In this paper, targeting del.icio.us tag data, we propose a method,
FolksoViz, for deriving subsumption relationships between tags by using
Wikipedia texts, and visualizing a folksonomy. To fulfill this method, we
propose a statistical model for deriving subsumption relationships based on the
frequency of each tag on the Wikipedia texts, as well as the TSD (Tag Sense
Disambiguation) method for mapping each tag to a corresponding Wikipedia text.
The derived subsumption pairs are visualized effectively on the screen. The
experiment shows that the FolksoViz manages to find the correct subsumption
pairs with high accuracy. Keywords: collaborative tagging, folksonomy, subsumption, visualization, web 2.0,
wikipedia | |||
| Size matters: word count as a measure of quality on wikipedia | | BIBAK | Full-Text | 1095-1096 | |
| Joshua E. Blumenstock | |||
| Wikipedia, "the free encyclopedia", now contains over two million English
articles, and is widely regarded as a high-quality, authoritative encyclopedia.
Some Wikipedia articles, however, are of questionable quality, and it is not
always apparent to the visitor which articles are good and which are bad. We
propose a simple metric -- word count -- for measuring article quality. In
spite of its striking simplicity, we show that this metric significantly
outperforms the more complex methods described in related work. Keywords: information quality, wikipedia, word count | |||
| Understanding internet video sharing site workload: a view from data center design | | BIBAK | Full-Text | 1097-1098 | |
| Xiaozhu Kang; Hui Zhang; Guofei Jiang; Haifeng Chen; Xiaoqiao Meng; Kenji Yoshihira | |||
| In this paper we measured and analyzed the workload on Yahoo! Video, the 2nd
largest U.S. video sharing site, to understand its nature and the impact on
online video data center design. We discovered interesting statistical
properties on both static and temporal dimensions of the workload including
file duration and popularity distributions, arrival rate dynamics and
predictability, and workload stationarity and burstiness. Complemented with
queueing-theoretic techniques, we further extended our understanding on the
measurement data with a virtual design on the workload and capacity management
components of a data center assuming the same workload as measured, which
reveals key results regarding the impact of Service Level Agreements (SLAs) and
workload scheduling schemes on the design and operations of such large-scale
video distribution systems. Keywords: capacity planning, data center design, measurement, online video, queueing
model, sla, workload management | |||
| Representing a web page as sets of named entities of multiple types: a model and some preliminary applications | | BIBAK | Full-Text | 1099-1100 | |
| Nan Di; Conglei Yao; Mengcheng Duan; Jonathan Zhu; Xiaoming Li | |||
| As opposed to representing a document as a "bag of words" in most
information retrieval applications, we propose a model of representing a web
page as sets of named entities of multiple types. Specifically, four types of
named entities are extracted, namely person, geographic location, organization,
and time. Moreover, the relations among these entities are also extracted,
weighted, classified and marked by labels. On top of this model, some
interesting applications are demonstrated. In particular, we introduce a notion
of person-activity, which contains four different elements: person, location,
time and activity. With this notion and based on a reasonably large set of web
pages, we are able to show how one person's activities can be attributed by
time and location, which gives a good idea of the mobility of the person under
question. Keywords: named entity, web content mining, web page model | |||
| Falcons: searching and browsing entities on the semantic web | | BIBAK | Full-Text | 1101-1102 | |
| Gong Cheng; Weiyi Ge; Yuzhong Qu | |||
| As of today, the amount of data on the Semantic Web has grown considerably.
The services for searching and browsing entities on the Semantic Web are in
demand. To provide such services, we developed the Falcons system. In this
poster, we present the features of the Falcons system. Keywords: indexing, search engine, semantic web, summarization | |||
| Influencers and their barriers to technology | | BIBAK | Full-Text | 1103-1104 | |
| Ann Hsieh; Todd Hausman; Nerija Titus | |||
| Modern Day Marco Polos, young influencers in emerging markets who have more
access to technology than their peers, often act as the gateway to new websites
and technology in their respective countries. However, as they influence their
peers through account gifting, they are often met with barriers such as
language. Keywords: emerging markets, ethnography, international | |||
| A semantic layer for publishing and localizing xml data for a p2p xquery mediator | | BIBAK | Full-Text | 1105-1106 | |
| Florin Dragan; Georges Gardarin; Laurent Yeh | |||
| In this poster, we present the P2P XML data localization layer of XLive, an
XQuery mediation system developed at University of Versailles [2]. A major
challenge in the evaluation of XQuery over a P2P network in the context of
multiple XML sources is to abstract from structural heterogeneity. Most
existing approaches have mainly exploited varied types of 1:1 semantic mappings
between peer schemas and/or ontologies. Complex query algorithms are required
to exploit these semantic links between peers. In contrast, our approach
focuses on a simple semantic layer. It is built on a Chord [4] DHT for indexing
semantic descriptions of mediated data sources. A mapping process transforms an
XML view of a mediator (each peer is equipped with a mediator) to a semantic
form that is published into the P2P network and stored in the DHT. For a given
user query, a search in the DHT retrieves all relevant data sources able to
contribute to the result elaboration. Then, every peer that contains relevant
data is directly queried to collect data for elaborating the final answer. Keywords: p2p, semantic mediation., xquery | |||
| Mining for personal name aliases on the web | | BIBAK | Full-Text | 1107-1108 | |
| Danushka Bollegala; Taiki Honma; Yutaka Matsuo; Mitsuru Ishizuka | |||
| We propose a novel approach to find aliases of a given name from the web. We
exploit a set of known names and their aliases as training data and extract
lexical patterns that convey information related to aliases of names from text
snippets returned by a web search engine. The patterns are then used to find
candidate aliases of a given name. We use anchor texts and hyperlinks to design
a word co-occurrence model and define numerous ranking scores to evaluate the
association between a name and its candidate aliases. The proposed method
outperforms numerous baselines and previous work on alias extraction on a
dataset of personal names, achieving a statistically significant mean
reciprocal rank of 0.6718. Moreover, the aliases extracted using the proposed
method improve recall by 20% in a relation-detection task. Keywords: name alias extraction, semantic web, web mining | |||
| Application of bitmap index to information retrieval | | BIBAK | Full-Text | 1109-1110 | |
| Kengo Fujioka; Yukio Uematsu; Makoto Onizuka | |||
| We developed the HS-bitmap index for efficient information retrieval. The
HS-bitmap index is a hierarchical document-term matrix: the original
document-term matrix is called the leaf matrix and an upper matrix is the
summary of its lower matrix. Our experiment results show the HS-bitmap index
performs better than the inverted index with a minor space overhead. Keywords: bitmap index, information retrieval | |||
| Pivotbrowser: a tag-space image searching prototype | | BIBAK | Full-Text | 1111-1112 | |
| Xiaoyan Li; Lidan Shou; Gang Chen; Xiaolong Zhang; Tianlei Hu; Jinxiang Dong | |||
| We propose a novel iterative searching and refining prototype for tagged
images. This prototype, named PivotBrowser, captures semantically similar tag
sets in a structure called pivot. By constructing a pivot for a textual query,
PivotBrowser first selects candidate images possibly relevant to the query. The
tags contained in these candidate images are then selected in terms of their
tag relevances to the pivot. The shortlisted tags are clustered and one of the
tag clusters is used to select the results from the candidate images. Ranking
of the images in each partition is based on their relevance to the tag cluster.
With the guidance of the tag clusters presented, a user is able to perform
searching and iterative query refinement. Keywords: ambiguity, inconsistency, relevance, tag | |||
| Measuring extremal dependencies in web graphs | | BIBAK | Full-Text | 1113-1114 | |
| Yana Volkovich; Nelly Litvak; Bert Zwart | |||
| We analyze dependencies in power law graph data (Web sample, Wikipedia
sample and a preferential attachment graph) using statistical inference for
multivariate regular variation. The well developed theory of regular variation
is widely applied in extreme value theory, telecommunications and mathematical
finance, and it provides a natural mathematical formalism for analyzing
dependencies between variables with power laws. However, most of the proposed
methods have never been used in the Web graph data mining. The present work
fills this gap. The new insights this yields are striking: the three
above-mentioned data sets are shown to have a totally different dependence
structure between different graph parameters, such as in-degree and PageRank. Keywords: pagerank, preferential attachment, regular variation, web, wikipedia | |||
| Determining user's interest in real time | | BIBAK | Full-Text | 1115-1116 | |
| Ranbir S. Sanasam; Hema A. Murthy; Timothy A. Gonsalves | |||
| Most of the search engine optimization techniques attempt to predict users
interest by learning from the past information collected from different
sources. But, a user's current interest often depends on many factors which are
not captured in the past information. In this paper, we attempt to identify
user's current interest in real time from the information provided by the user
in the current query session. By identifying user's interest in real time, the
engine could adapt differently to different users in real time. Experimental
verification indicates that our approach is encouraging for short queries. Keywords: real time, search engine, users clicks, users' interest | |||
| How to influence my customers?: the impact of electronic market design | | BIBAK | Full-Text | 1117-1118 | |
| Nan Hu; Ling Liu; Bin Chen; Jialie Shen | |||
| This paper investigates the strategic decisions of online vendors for
offering different mechanism, such as sampling and online reviews of
information products, to increase their online sales. Focusing on measuring the
effectiveness of electronic market design (offering reviews, sampling, or
both), our study shows that online markets behavior as communication markets,
and consumers learn product quality information both passively (reading online
reviews) and actively but subjectively (listening to music sampling). Using
data from Amazon, first we show that sampling along is a strong product quality
signal that reduces the product uncertainty after controlling for halo effect.
In general, products with sampling option enjoy a higher conversion rate (which
leads to better sales) than those without sampling because sampling decreases
the uncertainty of consuming experience goods. Second, the impact of online
reviews on sales conversion rate is lower for experience goods with a sampling
option than those without. Third, when the uncertainty of the societal reviews
is higher, sampling plays a more important role because it mitigates such
uncertainty introduced by online reviews. Keywords: conversion rate, electronic market design, halo effect, online product
reviews, sampling | |||
| Improving web spam detection with re-extracted features | | BIBAK | Full-Text | 1119-1120 | |
| Guanggang Geng; Chunheng Wang; Qiudan Li | |||
| Web spam detection has become one of the top challenges for the Internet
search industry. Instead of using some heuristic rules, we propose a feature
re-extraction strategy to optimize the detection result. Based on the predicted
spamicity obtained by the preliminary detection, through the host level web
graph, three types of features are extracted. Experiments on WEBSPAM-UK2006
benchmark show that with this strategy, the performance of web spam detection
can be improved evidently. Keywords: content spam, link spam, machine learning, web spam | |||
| The world wide telecom web browser | | BIBAK | Full-Text | 1121-1122 | |
| Sheetal Agarwal; Arun Kumar; Amit Anil Nanavati; Nitendra Rajput | |||
| As the number of telephony voice applications grow, there will be a need for
a browser to surf the Web of interconnected voice applications (called as
VoiceSites). These VoiceSites are accessed through a telephone over an audio
channel. We present the concept and architecture of T-Web Browser, a World Wide
Telecom Web browser that enables browsing the Web of voice applications through
an ordinary phone. This browser will support rich browsing features such as
history and bookmarking. Keywords: developing regions, hstp, voice browser, wwtw | |||
| VoiKiosk: increasing reachability of kiosks in developing regions | | BIBAK | Full-Text | 1123-1124 | |
| Sheetal Agarwal; Arun Kumar; Amit Anil Nanavati; Nitendra Rajput | |||
| One of the several initiatives to bridge the digital divide in developing
countries has been the deployment of information kiosks or knowledge centers in
villages in rural parts of the country. These kiosks provide services ranging
from email, chat and browsing to distance education programs, agricultural
services and eGovernance services. A kiosk typically comprises of a computer
with printer, web cam, multimedia system and Internet connectivity and is owned
by a local entrepreneur. Moving away from the PC based kiosk model, we present
an alternative platform to create and host such information kiosks in the
telephony network. We call these as VoiKiosks and they are accessible through
voice interaction over an ordinary phone call. Keywords: developing regions, kiosks, voigen | |||
| Semantic similarity based on compact concept ontology | | BIBAK | Full-Text | 1125-1126 | |
| Ce Zhang; Yu-Jing Wang; Bin Cui; Gao Cong | |||
| This paper presents a new method of calculating the semantic similarity
between two articles based on WordNet. To further improve the performance of
the proposed method, we build a new Compact Concept Ontology (CCO) from WordNet
by combining the words with similar semantic meanings. The experimental results
show that our approach significantly outperforms a recent proposal of computing
semantic similarity, and demonstrate the superiority of the proposed CCO
method. Keywords: concept, ontology, similarity | |||
| Layman tuning of websites: facing change resilience | | BIBAK | Full-Text | 1127-1128 | |
| Oscar Díaz; Cristóbal Arellano; Jon Iturrioz | |||
| Client scripting permits end users to customize content, layout or style of
their favourite websites. But current scripting suffers from a tight coupling
with the website. If the page changes, all the scripting can fall apart. The
problem is that websites are reckoned to evolve frequently, and this can
jeopardize all the scripting efforts. To avoid this situation, this work
enriches websites with a "modding interface" in an attempt to decouple layman's
script from website upgrades. From the website viewpoint, this interface
ensures safe scripting, i.e. scripts that do not break the page. From a
scripter perspective, this interface limits tuning but increases change
resilience. The approach tries to find a balance between openness (scripter
free inspection) and modularity (scripter isolation from website design
decisions) that permits scripting to scale up as a mature software practice.
The approach is realized for Greasemonkey scripts. Keywords: personalization, semantic web, web 2.0 | |||
| Model bloggers' interests based on forgetting mechanism | | BIBAK | Full-Text | 1129-1130 | |
| Yuan Cheng; Guang Qiu; Jiajun Bu; Kangmiao Liu; Ye Han; Can Wang; Chun Chen | |||
| Blogs have been expanded at an incredible speed in recent years. Plentiful
personal information makes blogs a popular way mining user profiles. In this
paper, we propose a novel bloggers' interests modeling approach based on
forgetting mechanism. A new forgetting function is introduced to track interest
drift. Based on that, the Short Term Interest Models (STIM) and Long Term
Interest Models (LTIM) are constructed to describe bloggers' short-term and
long-term interests. The experiments show that both models can identify
bloggers' preferences well respectively. Keywords: blog, interest forgetting, long term interests, short term interests | |||
| Temporal views over rdf data | | BIBAK | Full-Text | 1131-1132 | |
| Geetha Manjunath; R. Badrinath; Craig Sayers; K. S. Venugopal | |||
| Supporting fast access to large RDF stores has been one of key challenges
for enabling use of the Semantic Web in real-life applications, more so in
sensor-based systems where large amounts of historic data needs to be stored.
We propose a semantics-based temporal view mechanism that enables faster access
to time-varying data by caching into memory only the required subset of RDF
triples. We describe our experience of implementing such a framework in the
context of a wide area network monitoring system. Our preliminary results show
that our solution significantly improves client access time and scales well for
moderate data sets. Keywords: owl, rdf, semantic web, sensors, time varying data, views | |||
| A teapot graph and its hierarchical structure of the Chinese web | | BIBAK | Full-Text | 1133-1134 | |
| Jonathan J. H. Zhu; Tao Meng; Zhengmao Xie; Geng Li; Xiaoming Li | |||
| The shape of the Web in terms of its graphical structure has been a widely
interested topic. Two graphs, Bow Tie and Daisy, have stood out from previous
research. In this work, we take a different approach, by viewing the Web as a
hierarchy of three levels, namely page level, host level, and domain level.
Such structures are analyzed and compared with a snapshot of Chinese Web in
early 2006, involving 830 million pages, 17 million hosts, and 0.8 million
domains. Some interesting results have emerged. For example, the Chinese Web
appears more like a teapot (with a large size of SCC, a medium size of IN and a
small size of OUT) at page level than the classic bow tie or daisy shape. Some
challenging phenomena are also observed. For example, the INs become much
smaller than OUTs at host and domain levels. Future work will tackle these
puzzles. Keywords: bow tie graph, daisy graph, self similarity, teapot graph | |||
| Collaborative filtering on skewed datasets | | BIBAK | Full-Text | 1135-1136 | |
| Somnath Banerjee; Krishnan Ramanathan | |||
| Many real life datasets have skewed distributions of events when the
probability of observing few events far exceeds the others. In this paper, we
observed that in skewed datasets the state of the art collaborative filtering
methods perform worse than a simple probabilistic model. Our test bench
includes a real ad click stream dataset which is naturally skewed. The same
conclusion is obtained even from the popular movie rating dataset when we pose
a binary prediction problem of whether a user will give maximum rating to a
movie or not. Keywords: collaborative filtering, plsa, skewed dataset | |||
| 2lip: the step towards the web3d | | BIBAK | Full-Text | 1137-1138 | |
| Jacek Jankowski; Sebastian Ryszard Kruk | |||
| The World Wide Web allows users to create and publish a variety of
resources, including multimedia ones. Most of the contemporary best practices
for designing web interfaces, however, do not take into account the 3D
techniques.
In this paper we present a novel approach for designing interactive web applications -- 2-Layer Interface Paradigm (2LIP). The background layer of the 2LIP-type user interface is a 3D scene, which a user cannot directly interact with. The foreground layer is HTML content. Only taking an action on this content (e.g. pressing a hyperlink, scrolling a page) can affect the 3D scene. We introduce a reference implementation of 2LIP: Copernicus -- The Virtual 3D Encyclopedia, which shows one of the potential paths of the evolution of Wikipedia towards Web 3.0. Based on the evaluation of Copernicus we prove that designing web interfaces according to 2LIP provides users a better browsing experience, without harming the interaction. Keywords: design, web3d, wikipedia, wpf | |||
| Protecting web services from remote exploit code: a static analysis approach | | BIBAK | Full-Text | 1139-1140 | |
| Xinran Wang; Yoon-chan Jhi; Sencun Zhu; Peng Liu | |||
| We propose STILL, a signature-free remote exploit binary code injection
attack blocker to protect web servers and web applications. STILL is robust to
almost all anti-signature, anti-static-analysis and anti-emulation obfuscation. Keywords: code injection attack, http, static analysis | |||
| Composing and optimizing data providing web services | | BIBAK | Full-Text | 1141-1142 | |
| Mahmoud Barhamgi; Djamal Benslimane; Aris M. Ouksel | |||
| In this paper, we propose a new approach to automatically compose data
providing Web services. Our approach exploits existing mature works done in
data integration systems. Specifically, data providing services are modeled as
RDF Parameterized views over mediated ontologies. Then, an RDF oriented
algorithm to compose services based on query rewriting techniques was devised.
We apply also an optimization algorithm on the generated composition to speed
up its execution. The results of our experiments show that the algorithms scale
up very well up to a large number of services, covering thus most realistic
applications. Keywords: optimization, rdf, web service composition | |||
| Psst: a web-based system for tracking political statements | | BIBAK | Full-Text | 1143-1144 | |
| Samantha Kleinberg; Bud Mishra | |||
| Determining candidates' views on important issues is critical in deciding
whom to support and vote for; but finding their statements and votes on an
issue can be laborious. In this paper we present PSST, (Political Statement and
Support Tracker), a search engine to facilitate analysis of political
statements and votes over time. We show that prior tools for text analysis can
be combined with minimal manual processing to provide a first step in the full
automation of this process. Keywords: elections, politics, search | |||
| Exploiting semantic web technologies to model web form interactions | | BIBAK | Full-Text | 1145-1146 | |
| Wolfgang Holzinger; Bernhard Kruepl; Robert Baumgartner | |||
| Form mapping is the key problem that needs to be solved in order to get
access to the hidden web. Currently available solutions for fully automatic
mapping are not ready for commercial meta-search engines, which still have to
rely on hand crafted code and are hard to maintain.
We believe that a thorough formal description of the problem with semantic web technologies provides a promising perspective to develop a new class of vertical search engines that is more robust and easier to maintain than existing solutions. In this paper, instead of trying to tackle the mapping problem, we model the interaction necessary to fill out a web form. First, during a user-assisted phase, the connection from the visible elements on the form to the domain concepts is established. Then, with help from background knowledge about the possible interaction steps, a plan for filling out the form is derived. Keywords: hidden web, web data extraction, web form mapping | |||
| Groupme! | | BIBAK | Full-Text | 1147-1148 | |
| Fabian Abel; Nicola Henze; Daniel Krause | |||
| This paper presents the GroupMe! system, a resource sharing system with
advanced tagging functionality. GroupMe! provides a novel user interface, which
enables users to organize and arrange arbitrary Web resources into groups. The
content of such groups can be overlooked and inspected immediately as resources
are visualized in a multimedia-based fashion. In this paper, we furthermore
introduce new folksonomy-based ranking strategies that exploit the group
structure shipped with GroupMe! folksonomies. Experiments show that those
strategies significantly improve the performance of such ranking algorithms. Keywords: folksonomies, groupme!, ranking, search, social media | |||
| Microscale evolution of web pages | | BIBAK | Full-Text | 1149-1150 | |
| Carrie Grimes | |||
| We track a large set of "rapidly" changing web pages and examine the
assumption that the arrival of content changes follows a Poisson process on a
microscale. We demonstrate that there are significant differences in the
behavior of pages that can be exploited to maintain freshness in a web corpus. Keywords: change detection, rate of change, web evolution | |||
| Web page sectioning using regex-based template | | BIBAK | Full-Text | 1151-1152 | |
| Rupesh R. Mehta; Amit Madaan | |||
| This work aims to provide a novel, site-specific web page segmentation and
section importance detection algorithm, which leverages structural, content,
and visual information. The structural and content information is leveraged via
template, a generalized regular expression learnt over set of pages. The
template along with visual information results into high sectioning accuracy.
The experimental results demonstrate the effectiveness of the approach. Keywords: site-specific noise elimination, site-specific segmentation, tree-based
reg-ex | |||
| Towards a programming language for services computing | | BIBAK | Full-Text | 1153-1154 | |
| Arun Kumar; D. Janakiram | |||
| Services Computing is emerging as a new discipline. The acceptance of web
services technology stems from the fact that services enable easy integration
and interoperation of enterprise level distributed systems. However, currently
software developers are forced to translate business level service requirements
and encode them into programs using low level abstractions such as objects. We
propose to introduce language constructs for Service Oriented Programming that
would enable raising programming abstractions from objects to services. Keywords: object orientation, programming languages, semantic web services | |||
| Histrace: building a search engine of historical events | | BIBAK | Full-Text | 1155-1156 | |
| Lian'en Huang; Jonathan J. H. Zhu; Xiaoming Li | |||
| In this paper, we describe an experimental search engine on our Chinese web
archive since 2001. The original data set contains nearly 3 billion Chinese web
pages crawled from past 5 years. From the collection, 430 million
"article-like" pages are selected and then partitioned into 68 million sets of
similar pages. The titles and publication dates are determined for the pages.
An index is built. When searching, the system returns related pages in a
chronological order. This way, if a user is interested in news reports or
commentaries for certain previously happened event, he/she will be able to find
a quite rich set of highly related pages in a convenient way. Keywords: replica detection, text mining, web archive | |||
| Online change detection in individual web user behaviour | | BIBAK | Full-Text | 1157-1158 | |
| Peter I. Hofgesang; Jan Peter Patist | |||
| Based on our field studies and consultations with field experts, we
identified three main problems that are of key importance to online web
personalization and customer relationship management: 1) detecting changes in
individual behaviour, 2) reporting on user actions that may need special care,
and 3) detecting changes in visitation frequency.
We propose solutions to these problems and experiment on real-world data from an investment bank collected over 1.5 years of web traffic. These solutions can be applied on any domains where individuals tend to revisit the website and can be identified accurately. Keywords: change detection, incremental web mining, user profiling | |||
| Webanywhere: enabling a screen reading interface for the web on any computer | | BIBAK | Full-Text | 1159-1160 | |
| Jeffrey P. Bigham; Craig M. Prince; Richard E. Ladner | |||
| People often use computers other than their own to access web content, but
blind users are restricted to using computers equipped with expensive,
special-purpose screen reading programs that they use to access the web.
WebAnywhere is a web-based, self-voicing web application that enables blind web
users to access the web from almost any computer that can produce sound without
installing new software. WebAnywhere could serve as a convenient, low-cost
solution for blind users on-the-go, for blind users unable to afford another
screen reader and for web developers targeting accessible design. This paper
describes the implementation of WebAnywhere, overviews an evaluation of it by
blind web users, and summarizes a survey of public terminals that shows it can
run on most public computers. Keywords: blind users, screen reader, web accessibility | |||
| Guanxi in the Chinese web -- a study of mutual linking | | BIBAK | Full-Text | 1161-1162 | |
| Valerie King; Louis Lei Yu; Yan Zhuang | |||
| Guanxi is a type of dyadic social interaction based on feelings ("qing") and
trust ("xin"). Long studied by scholars of Chinese origin, it has recently
drawn the attention of researchers outside of China. We define the concept of
guanxi as applied to the interaction between web sites. We explore methods to
identify guanxi in the Chinese web, show the unique characteristics of the
Chinese web which result from it, and introduce a mechanism for simulating
guanxi in a web graph model. Keywords: social networks, stochastic graph modeling, web graph, web measurement | |||
| Towards robust trust establishment in web-based social networks with socialtrust | | BIBAK | Full-Text | 1163-1164 | |
| James Caverlee; Ling Liu; Steve Webb | |||
| We propose the SocialTrust framework for tamper-resilient trust
establishment in online social networks. Two of the salient features of
SocialTrust are its dynamic revision of trust by (i) distinguishing
relationship quality from trust; and (ii) incorporating a personalized feedback
mechanism for adapting as the social network evolves. Keywords: social networks, trust | |||
| Plurality: a context-aware personalized tagging system | | BIBAK | Full-Text | 1165-1166 | |
| Robert Graham; Brian Eoff; James Caverlee | |||
| We present the design of Plurality, an interactive tagging system.
Plurality's modular architecture allows users to automatically generate
high-quality tags over Web content, as well as over archival and personal
content typically beyond the reach of existing Web 2.0 social tagging systems.
Three of the salient features of Plurality are: (i) its self-learning and
feedback-sensitive capabilities based on a user's personalized tagging style;
(ii) its leveraging of the collective intelligence of existing social tagging
services; and (iii) its context-awareness for optimizing tag suggestions, e.g.,
based on spatial or temporal features. Keywords: context-sensitive, personalization, social annotation, tags | |||
| Web graph similarity for anomaly detection (poster) | | BIBAK | Full-Text | 1167-1168 | |
| Panagiotis Papadimitriou; Ali Dasdan; Hector Garcia-Molina | |||
| Web graphs are approximate snapshots of the web, created by search engines.
Their creation is an error-prone procedure that relies on the availability of
Internet nodes and the faultless operation of multiple software and hardware
units. Checking the validity of a web graph requires a notion of graph
similarity. Web graph similarity helps measure the amount and significance of
changes in consecutive web graphs. These measurements validate how well search
engines acquire content from the web. In this paper we study five similarity
schemes: three of them adapted from existing graph similarity measures and two
adapted from well-known document and vector similarity methods. We compare and
evaluate all five schemes using a sequence of web graphs for Yahoo! and study
if the schemes can identify anomalies that may occur due to hardware or other
problems. Keywords: anomaly detection, graph similarity, web graph lsh | |||
| Static query result caching revisited | | BIBAK | Full-Text | 1169-1170 | |
| Rifat Ozcan; Ismail Sengor Altingovde; Özgür Ulusoy | |||
| Query result caching is an important mechanism for search engine efficiency.
In this study, we first review several query features that are used to
determine the contents of a static result cache. Next, we introduce a new
feature that more accurately represents the popularity of a query by measuring
the stability of query frequency over a set of time intervals. Experimental
results show that this new feature achieves hit ratios better than those of the
previously proposed features. Keywords: query result caching, search engine, static caching | |||
| A larger scale study of robots.txt | | BIBAK | Full-Text | 1171-1172 | |
| Santanu Kolay; Paolo D'Alberto; Ali Dasdan; Arnab Bhattacharjee | |||
| A website can regulate search engine crawler access to its content using the
robots exclusion protocol, specified in its robots.txt file. The rules in the
protocol enable the site to allow or disallow part or all of its content to
certain crawlers, resulting in a favorable or unfavorable bias towards some of
them. A 2007 survey on the robots.txt usage of about 7,593 sites found some
evidence of such biases, the news of which led to widespread discussions on the
web. In this paper, we report on our survey of about 6 million sites. Our
survey tries to correct the shortcomings of the previous survey and shows the
lack of any significant preferences towards any particular search engine. Keywords: crawler, robots exclusion, robots.txt, search engine | |||
| Defection detection: predicting search engine switching | | BIBAK | Full-Text | 1173-1174 | |
| Allison P. Heath; Ryen W. White | |||
| Searchers have a choice about which Web search engine they use when looking
for information online. If they are unsuccessful on one engine, users may
switch to a different engine to continue their search. By predicting when
switches are likely to occur, the search experience can be modified to retain
searchers or ensure a quality experience for incoming searchers. In this
poster, we present research on a technique for predicting search engine
switches. Our findings show that prediction is possible at a reasonable level
of accuracy, particularly when personalization or user grouping is employed.
These findings have implications for the design of applications to support more
effective online searching. Keywords: search engine switching | |||
| Algorithm for stochastic multiple-choice knapsack problem and application to keywords bidding | | BIBAK | Full-Text | 1175-1176 | |
| Yunhong Zhou; Victor Naroditskiy | |||
| We model budget-constrained keyword bidding in sponsored search auctions as
a stochastic multiple-choice knapsack problem (S-MCKP) and design an algorithm
to solve S-MCKP and the corresponding bidding optimization problem. Our
algorithm selects items online based on a threshold function which can be
built/updated using historical data. Our algorithm achieved about 99%
performance compared to the offline optimum when applied to a real bidding
dataset. With synthetic dataset and iid item-sets, its performance ratio
against the offline optimum converges to one empirically with increasing number
of periods. Keywords: keyword bidding, multiple-choice knapsack problem, sponsored search auction,
stochastic optimization | |||
| Simrank++: query rewriting through link analysis of the clickgraph (poster) | | BIBAK | Full-Text | 1177-1178 | |
| Ioannis Antonellis; Hector Garcia-Molina; Chi-Chao Chang | |||
| We focus on the problem of query rewriting for sponsored search. We base
rewrites on a historical click graph that records the ads that have been
clicked on in response to past user queries. Given a query q, we first consider
Simrank [2] as a way to identify queries similar to q, i.e., queries whose ads
a user may be interested in. We argue that Simrank fails to properly identify
query similarities in our application, and we present two enhanced versions of
Simrank: one that exploits weights on click graph edges and another that
exploits evidence." We experimentally evaluate our new schemes against Simrank,
using actual click graphs and queries form Yahoo!, and using a variety of
metrics. Our results show that the enhanced methods can yield more and better
query rewrites. Keywords: click graph, link analysis, similarity metric, sponsored search | |||
| An initial investigation on evaluating semantic web instance data | | BIBAK | Full-Text | 1179-1180 | |
| Li Ding; Jiao Tao; Deborah L. McGuinness | |||
| Many emerging semantic web applications include ontologies from one set of
authors and instance data from another (often much larger) set of authors.
Often ontologies are reused and instance data is integrated in manners
unanticipated by their authors. Not surprisingly, many instance data rich
applications encounter instance data that is not compatible with the
expectations of the original ontology author(s). This line of work focuses on
issues related to semantic expectation mismatches in instance data. Our initial
results include a customizable and extensible service-oriented evaluation
architecture, and a domain implementation called PmlValidator, which checks
instance data using the corresponding ontologies and additional style
requirements. Keywords: architecture, instance data evaluation, semantic web | |||
| Mining tag clouds and emoticons behind community feedback | | BIBAK | Full-Text | 1181-1182 | |
| Kavita A. Ganesan; Neelakantan Sundaresan; Harshal Deo | |||
| In this paper we describe our mining system which automatically mines tags
from feedback text in an eCommerce scenario. It renders these tags in a
visually appealing manner. Further, emoticons are attached to mined tags to add
sentiment to the visual aspect. Keywords: algorithms, experimentation, feedback, measurement, search, web 2.0 | |||
| Investigation of partial query proximity in web search | | BIBAK | Full-Text | 1183-1184 | |
| Jing Bai; Yi Chang; Hang Cui; Zhaohui Zheng; Gordon Sun; Xin Li | |||
| Proximity of query terms in a document is an important criterion in IR.
However, no investigation has been made to determine the most useful term
sequences for which proximity should be considered. In this study, we test the
effectiveness of using proximity of partial term sequences (n-grams) for Web
search. We observe that the proximity of sequences of 3 to 5 terms is most
effective for long queries, while shorter or longer sequences appear less
useful. This suggests that combinations of 3 to 5 terms can best capture the
intention in user queries. In addition, we also experiment with weighing the
importance of query sub-sequences using query log frequencies. Our preliminary
tests show promising empirical results. Keywords: information retrieval, term proximity | |||
| Identifying regional sensitive queries in web search | | BIBAK | Full-Text | 1185-1186 | |
| Srinivas Vadrevu; Ya Zhang; Belle Tseng; Gordon Sun; Xin Li | |||
| In Web search ranking, the expected results for some queries could vary
greatly depending upon location of the user. We name such queries regional
sensitive queries. Identifying regional sensitivity of queries is important to
meet users' needs. The objective of this work is to identify whether a user
expects only regional results for a query. We present three novel features
generated from search logs and build a meta query classifier to identify
regional sensitive query. Experimental results show that the proposed method
achieves high accuracy in identifying regional sensitive queries. Keywords: query classification, regional sensitive, search | |||
| Offline matching approximation algorithms in exchange markets | | BIBAK | Full-Text | 1187-1188 | |
| Zeinab Abbassi; Laks V. S. Lakshmanan | |||
| Motivated by several marketplace applications on rapidly growing online
social networks, we study the problem of efficient offline matching algorithms
for online exchange markets. We consider two main models of one-shot markets
and exchange markets over time. For one-shot markets, we study three main
variants of the problem: one-to-one exchange market problem, exchange market
problem with short cycles, and probabilistic exchange market problem. We show
that all the above problems are NP-hard, and propose heuristics and
approximation algorithms for these problems. Experiments show that the number
of items exchanged will increase when exchanges through cycles are allowed.
Exploring algorithms for markets over time is an interesting direction for
future work. Keywords: edge partitioning, exchange markets, recommendation algorithms, set packing,
social networks | |||
| An efficient two-phase service discovery mechanism | | BIBAK | Full-Text | 1189-1190 | |
| Shuiguang Deng; Zhaohui Wu; Jian Wu; Ying Li | |||
| We bring forward a two-phase semantic service discovery mechanism which
supports both the operation matchmaking and operation-composition matchmaking.
A serial of experiments on a service management framework show that the
mechanism gains better performance on both discovery recall rate and precision
than a traditional matchmaker. Keywords: bipartite graph matching, composition, service discovery, service
matchmaking, web service | |||
| User oriented link function classification | | BIBAK | Full-Text | 1191-1192 | |
| Mingliang Zhu; Weiming Hu; Ou Wu; Xi Li; Xiaoqin Zhang | |||
| Currently most link-related applications treat all links in the same web
page to be identical. One link-related application usually requires one certain
property of hyperlinks but actually not all links have this property or they
have this property on different levels. Based on a study of how human users
judge the links, the idea of the link function classification (LFC) is
introduced in this paper. The link functions reflect the purpose that links are
created by web page designers and the way they are used by viewers. Links in a
certain function class imply one certain relationship between the adjacent
pages, and thus they can be assumed to have similar properties. An algorithm is
proposed to analyze the link functions based on both vision and structure
features which simulates the reaction on the links of human users. Current
applications can be enhanced by LFC with a more accurate modeling of the web
graph. New mining methods can be also developed by making more and stronger
assumptions on links within each function class due to the purer property set
they share. Keywords: link function classification, structure features, vision features, web
mining | |||
| Extraction and mining of an academic social network | | BIBAK | Full-Text | 1193-1194 | |
| Jie Tang; Jing Zhang; Limin Yao; Juanzi Li | |||
| This paper addresses several key issues in extraction and mining of an
academic social network: 1) extraction of a researcher social network from the
existing Web; 2) integration of the publications from existing digital
libraries; 3) expertise search on a given topic; and 4) association search
between researchers. We developed a social network system, called ArnetMiner,
based on proposed methods to the above problems. In total, 448,470 researcher
profiles and 981,599 publications were extracted/integrated after the system
having been in operation for two years. The paper describes the architecture
and main features of the system. It also briefly presents the experimental
results of the proposed methods. Keywords: expertise search, information extraction, social network | |||
| Keyword extraction for contextual advertisement | | BIBAK | Full-Text | 1195-1196 | |
| Xiaoyuan Wu; Alvaro Bolivar | |||
| As the largest online marketplace, eBay strives to promote its inventory
throughout the Web via different types of online advertisement. Contextually
relevant links to eBay assets on third party sites is one example of such
advertisement avenues. Keyword extraction is the task at the core of any
contextual advertisement system. In this paper, we explore a machine learning
approach to this problem. The proposed solution uses linear and logistic
regression models learnt from human labeled data, combined with document, text
and eBay specific features. In addition, we propose a solution to identify the
prevalent category of eBay items in order to solve the problem of keyword
ambiguity. Keywords: contextual advertisement, keyword extraction | |||
| Which "Apple" are you talking about? | | BIBAK | Full-Text | 1197-1198 | |
| Mandar A. Rahurkar; Dan Roth; Thomas S. Huang | |||
| In a higher level task such as clustering of web results or
word sense disambiguation, knowledge of all possible distinct concepts in which an ambiguous word can be expressed would be advantageous, for instance in determining the number of clusters in case of clustering web search results. We propose an algorithm to generate such a ranked list of distinct concepts associated with an ambiguous word. Concepts which are popular in terms of usage are ranked higher. We evaluate the coverage of the concepts inferred from our algorithm on the results retrieved by querying the ambiguous word using a major search engine and show a coverage of 85% for top 30 documents averaged over all keywords. Keywords: clustering, wikipedia | |||
| Making BPEL flexible: adapting in the context of coordination constraints using WS-BPEL | | BIBAK | Full-Text | 1199-1200 | |
| Yunzhou Wu; Prashant Doshi | |||
| Although WS-BPEL is emerging as the prominent language for modeling
executable business processes, its support for designing flexible processes is
limited.
An important need of many adaptive processes is for concurrent activities in the process to respect coordination constraints. These require that concurrent activities coordinate their behaviors in response to exogenous events. We show how coordination inducing constraints may be represented in WS-BPEL, and use generalized adaptation and constraint enforcement models to transform the traditional BPEL process to an adaptive one. The final outcome is an executable WS-BPEL process without extensions, able to adapt to events while respecting coordination constraints between activities. Keywords: adaptation, bpel, coordination, web services | |||
| Speeding up web service composition with volatile information | | BIBAK | Full-Text | 1201-1202 | |
| John F. Harney; Prashant Doshi | |||
| This paper introduces a novel method for composing Web services in the
presence of external volatile information. Our approach, which we call the
informed-presumptive, is compared to previous state-of-the-art approaches for
Web service composition in volatile environments. We show empirically that the
informed-presumptive strategy produces compositions in significantly less time
than the other strategies with lesser backtracks. Keywords: adaptation, expiration times, volatile environments, web services | |||
| Context-sensitive QoS model: a rule-based approach to web service composition | | BIBAK | Full-Text | 1203-1204 | |
| Tao Zhou; Xiaolin Zheng; William Wei Song; Xiaofeng Du; Deren Chen | |||
| Generally, web services are provided with different QoS values, so they can
be selected dynamically in service composition process. However, the
conventional context free composition QoS model does not consider the
changeability of QoS values and the context sensitive constraints during
composition process. In this paper, we propose a rule based context sensitive
QoS model to support the changeability of QoS values and the context sensitive
constraints. By considering context in the QoS model, web service composition
can be used widely and flexibly in the real world business. Keywords: QoS, context model, rule, web service composition | |||
| A unified framework for name disambiguation | | BIBAK | Full-Text | 1205-1206 | |
| Jie Tang; Jing Zhang; Duo Zhang; Juanzi Li | |||
| Name ambiguity problem has been a challenging issue for a long history. In
this paper, we intend to make a thorough investigation of the whole problem.
Specifically, we formalize the name disambiguation problem in a unified
framework. The framework can incorporate both attribute and relationship into a
probabilistic model. We explore a dynamic approach for automatically estimating
the person number K and employ an adaptive distance measure to estimate the
distance between objects. Experimental results show that our proposed framework
can significantly outperform the baseline method. Keywords: digital library, name disambiguation, probabilistic model | |||
| Low-load server crawler: design and evaluation | | BIBAK | Full-Text | 1207-1208 | |
| Katsuko T. Nakahira; Tetsuya Hoshino; Yoshiki Mikami | |||
| This paper proposes a method of crawling Web servers connected to the
Internet without imposing a high processing load. We are using the crawler for
a field survey of the digital divide, including the ability to connect to the
network. Rather than employing normal Web 'page' crawling algorithm, which
usually collect all pages found on the target server, we have developed
'server' crawling algorithm, which collect only minimum pages from the same
server and achieved low-load and high-speed crawling of servers. Keywords: global digital divide, server crawler | |||
| KC3 browser: semantic mash-up and link-free browsing | | BIBAK | Full-Text | 1209-1210 | |
| Michiaki Iwazume; Ken Kaneiwa; Koji Zettsu; Takafumi Nakanishi; Yutaka Kidawara; Yssushi Kiyoki | |||
| This paper proposes a general framework of for a system with a semantic
browsing and visualization interface called Knowledge Communication,
Collaboration and Creation Browser (KC3 Browser) integrates multimedia contests
and web services on the grid networks, and makes a semantic mash-up called
knowledge workspace (k-workspace) with various visual gadgets according to
user's contexts (e.g. their interests, purpose and computational environments).
KC3 Browser also achieves a link-free browsing for seamless knowledge access by generating semantic links based on an arbitrary knowledge models such as ontology and vector space models. It assists users to look down and to figure out various social and natural events from the web contents. We have implemented a prototype of KC3 Browser and tested it to an international project on risk intelligence against natural disaster. Keywords: semantic browsing, semantic mash-up, semantic web | |||
| Generating hypotheses from the web | | BIBAK | Full-Text | 1211-1212 | |
| Wei Jin; Rohini Srihari; Abhishek Singh | |||
| Hypothesis generation is a crucial initial step for making scientific
discoveries. This paper addresses the problem of automatically discovering
interesting hypotheses from the web. Given a query containing one or two
entities of interest, our algorithm automatically generates a semantic profile
describing the specified entity or provides the potential connections between
two entities of interest. We implemented a prototype on top of the Google
search engine and the experimental results demonstrate the effectiveness of our
algorithms. Keywords: hypothesis generation, link analysis, web search | |||
| Using graphics processors for high-performance IR query processing | | BIBAK | Full-Text | 1213-1214 | |
| Shuai Ding; Jinru He; Hao Yan; Torsten Suel | |||
| Web search engines are facing formidable performance challenges as they need
to process thousands of queries per second over billions of documents. To deal
with this heavy workload, current engines use massively parallel architectures
of thousands of machines that require large hardware investments.
We investigate new ways to build such high-performance IR systems based on Graphical Processing Units (GPUs). GPUs were originally designed to accelerate computer graphics applications through massive on-chip parallelism. Recently a number of researchers have studied how to use GPUs for other problem domains including databases and scientific computing [2,3,5], but we are not aware of previous attempts to use GPUs for large-scale web search. Our contribution here is to design a basic system architecture for GPU-based high-performance IR, and to describe how to perform highly efficient query processing within such an architecture. Preliminary experimental results based on a prototype implementation suggest that significant gains in query processing performance might be obtainable with such an approach. Keywords: gpu, query processing | |||
| A framework for fast community extraction of large-scale networks | | BIBAK | Full-Text | 1215-1216 | |
| Yutaka I. Leon-Suematsu; Kikuo Yuta | |||
| Most of the faster community extraction algorithms are based on the Clauset,
Newman and Moore (CNM), which is employed for networks with sizes up to 500,000
nodes. The modification proposed by Danon, Diaz and Arenas (DDA) obtains better
modularity among CNM and its variations, but there is no improvement in speed
as its authors expressed. In this paper, we identify some inefficiencies in the
data structure employed by former algorithms. We propose a new framework for
the algorithm and a modification of the DDA to make it applicable to
large-scale networks. For instance, the community extraction of a network with
1 million nodes and 5 million edges was performed in about 14 minutes in
contrast to former CNM that required 45 hours (192 times the former CNM,
obtaining better modularity).
The scalability of our improvements is shown by applying it to networks with sizes up to 10 million nodes, obtaining the best modularity and execution time compared to the former algorithms. Keywords: clustering, community analysis, large-scale networks | |||
| Enabling secure digital marketplace | | BIBAK | Full-Text | 1217-1218 | |
| Hongxia Jin; Vladimir Zbarsky | |||
| The fast development of the Web provides new ways for effective distribution
of network-based digital goods. A digital marketplace provides a platform to
enable Web users to effectively acquire, share, market and distribute digital
content. However, the success of the digital marketplace business models hinges
on securely managing the digital rights and usage of the digital content. For
example, the digital content should be only consumable by paid users. This
paper describes a Web-based system that enables the secure exchange of digital
content between Web users and prevents users from illegally re-sell of the
digital content. Part of our solution is based on broadcast encryption
technology. Keywords: content protection, download, drm, security | |||
| Extracting XML schema from multiple implicit xml documents based on inductive reasoning | | BIBAK | Full-Text | 1219-1220 | |
| Masaya Eki; Tadachika Ozono; Toramatsu Shintani | |||
| We propose a method of classifying XML documents and extracting XML schema
from XML by inductive inference based on constraint logic programming. The goal
of this work is to type a large collection of XML approximately but
efficiently. This can also process XML code written in a different schema or
even code which is schema-less. Our approach is intended to achieve
identification based on the syntax and semantics of the XML documents by
information extraction using ontology, and to support retrieval and data
management. Our approach has three steps. The first step is XML to predicates,
the second step is to compare predicates and classifies structures which
represent similar meanings in different structures, and the last step is
predicates to rules by using ontology and to maintain XML Schema. We evaluate
similarity of data type and data range by using an ontology dictionary, and XML
Schema is made from results of second and last step. Keywords: inductive reasoning, predicate logic, xml | |||
| Visualizing historical content of web pages | | BIBAK | Full-Text | 1221-1222 | |
| Adam Jatowt; Yukiko Kawai; Katsumi Tanaka | |||
| Recently, along with the rapid growth of the Web, the preservation efforts
have also increased. As a consequence, large amounts of past Web data are
stored in Web archives. This historical data can be used for better
understanding of long-term page topics and characteristics. In this paper, we
propose an interactive visualization system called Page History Explorer for
exploring page histories. It allows for roughly portraying evolution of pages
and summarizing their content over time. We use a temporal term cloud as a
structure for visualizing prevailing and active terms appearing on pages in the
past. Keywords: history summarization, page history visualization, past web, web archive | |||
| Integrating the IAC neural network in ontology mapping | | BIBAK | Full-Text | 1223-1224 | |
| Ming Mao; Yefei Peng; Michael Spring | |||
| Ontology mapping seeks to find semantic correspondences between similar
elements of different ontologies. This paper proposes a neural network based
approach to search for a global optimal solution that best satisfies ontology
constraints. Experiments on OAEI benchmark tests show it dramatically improves
the performance of preliminary mapping results. Keywords: constraint satisfaction problem (csp), interactive activation and
competition (IAC) neural network, ontology mapping, prior+ | |||
| Fast algorithms for topk personalized pagerank queries | | BIBAK | Full-Text | 1225-1226 | |
| Manish Gupta; Amit Pathak; Soumen Chakrabarti | |||
| In entity-relation (ER) graphs (V,E), nodes V represent typed entities and
edges E represent typed relations. For dynamic personalized PageRank queries,
nodes are ranked by their steady-state probabilities obtained using the
standard random surfer model. In this work, we propose a framework to answer
top-k graph conductance queries. Our top-k ranking technique leads to a 4X
speedup, and overall, our system executes queries 200-1600X faster than
whole-graph PageRank. Some queries might contain hard predicates i.e.
predicates that must be satisfied by the answer nodes. E.g. we may seek
authoritative papers on public key cryptography, but only those written during
1997. We extend our system to handle hard predicates. Our system achieves these
substantial query speedups while consuming only 10-20% of the space taken by a
regular text index. Keywords: hubrank, node-deletion, pagerank, personalized, top-k | |||
| Reasoning about similarity queries in text retrieval tasks | | BIBAK | Full-Text | 1227-1228 | |
| Xiaohui Yu; Yang Liu | |||
| In many text retrieval tasks, it is highly desirable to obtain a "similarity
profile" of the document collection for a given query. We propose
sampling-based techniques to address this need, using calibration techniques to
improve the accuracy. Experimental results confirm the effectiveness of the
proposed approaches. Keywords: calibration, query evaluation, sampling, text retrieval | |||
| Mashups for semantic user profiles | | BIBAK | Full-Text | 1229-1230 | |
| Riddhiman Ghosh; Mohamed Dekhil | |||
| In this paper, we discuss challenges and provide solutions for capturing and
maintaining accurate models of user profiles using semantic web technologies,
by aggregating and sharing distributed fragments of user profile information
spread over multiple services. Our framework for profile management allows for
evolvable, extensible and expressive user profiles. We have implemented a
prototype, targeting the retail domain, on the HP Labs Retail Store Assistant. Keywords: information management, personalization, semantic web, user profiles | |||
| Using CEP technology to adapt messages exchanged by web services | | BIBAK | Full-Text | 1231-1232 | |
| Yehia Taher; Marie-Christine Fauvet; Marlon Dumas; Djamal Benslimane | |||
| Web service may be unable to interact with each other because of
incompatibilities between their interfaces. In this paper, we present an event
driven approach which aims at adapting messages exchanged during service
interactions. The proposed framework relies on the Complex Event Processing
(CEP) technology, which provides an environment for the development of
applications that need to continuously process, analyse and respond to event
streams. Our main contribution is a system that enables developers to design
and implement CEP-based adapters. These latter are deployed in a CEP engine
which is responsible for continuously receiving messages and processing them
according to rules implemented by the adapters. Resulting transformed messages
are thus forwarded to their original service recipient. Keywords: cep technology, web service adaptation | |||
| Finding core members in virtual communities | | BIBAK | Full-Text | 1233-1234 | |
| Haiqiang Chen; Xueqi Cheng; Yue Liu | |||
| Finding the core members of a virtual community is an important problem in
community analysis. Here we presented an simulated annealing algorithm to solve
this problem by optimizing the user interests concentration ratio in user
groups. As an example, we test this algorithm on a virtual community site and
evaluate its results using human "gold standard" method. Keywords: core members finding, simulated annealing, virtual community | |||
| Improving personalized services in mobile commerce by a novel multicriteria rating approach | | BIBAK | Full-Text | 1235-1236 | |
| Qiudan Li; Chunheng Wang; Guanggang Geng | |||
| With the rapid growth of wireless technologies and mobile devices, there is
a great demand for personalized services in m-commerce. Collaborative filtering
(CF) is one of successful techniques to produce personalized recommendations
for users. This paper proposes a novel approach to improve CF algorithms, where
the contextual information of a user and the multicriteria ratings of an item
are considered besides the typical information on users and items. The
multilinear singular value decomposition (MSVD) technique is utilized to
explore both explicit relations and implicit relations among user, item and
criterion. We implement the approach in an existing m-commerce platform, and
encouraging experimental results demonstrate its effectiveness. Keywords: collaborative filtering, m-commerce., personalized service | |||
| Automatic web image selection with a probabilistic latent topic model | | BIBAK | Full-Text | 1237-1238 | |
| Keiji Yanai | |||
| We propose a new method to select relevant images to the given keywords from
images gathered from theWeb based on the Probabilistic Latent Semantic Analysis
(PLSA) model which is a probabilistic latent topic model originally proposed
for text document analysis. The experimental results shows that the results by
the proposed method is almost equivalent to or outperforms the results by
existing methods. In addition, it is proved that our method can select more
various images compared to the existing SVM-based methods. Keywords: image recognition, web image mining | |||
| R-U-in?: doing what you like, with people whom you like | | BIBAK | Full-Text | 1239-1240 | |
| Nilanjan Banerjee; Dipanjan Chakraborty; Koustuv Dasgupta; Sumit Mittal; Seema Nagar | |||
| This paper presents R-U-In? -- a social networking application that
leverages Web 2.0 and IMS-based Converged Networks technologies to create a
rich next-generation service. R-U-In? allows a user to search (in real-time)
and solicit participation of like-minded partners for an activity of mutual
interest (e.g. a rock concert, a soccer game, or a movie). It is an example of
a situational mashup application that exploits content and capabilities of a
Telecom operator, blended with Web 2.0 technologies, to provide an enhanced,
value-added service experience. Keywords: NGN, mashups, social networking, web 2.0 | |||
| Behavioral classification on the click graph | | BIBAK | Full-Text | 1241-1242 | |
| Martin Szummer; Nick Craswell | |||
| A bipartite query-URL graph, where an edge indicates that a document was
clicked for a query, is a useful construct for finding groups of related
queries and URLs. Here we use this behavior graph for classification. We choose
a click graph sampled from two weeks of image search activity, and the task of
"adult" filtering: identifying content in the graph that is inappropriate for
minors. We show how to perform classification using random walks on this graph,
and two methods for estimating classifier parameters. Keywords: classification, click data | |||
| Budget constrained bidding in keyword auctions and online knapsack problems | | BIBAK | Full-Text | 1243-1244 | |
| Yunhong Zhou; Deeparnab Chakrabarty; Rajan Lukose | |||
| We consider the budget-constrained bidding optimization problem for
sponsored search auctions, and model it as an online (multiple-choice) knapsack
problem. We design both deterministic and randomized algorithms for the online
(multiple-choice) knapsack problems achieving a provably optimal competitive
ratio. This translates back to fully automatic bidding strategies maximizing
either profit or revenue for the budget-constrained advertiser. Our bidding
strategy for revenue maximization is oblivious (i.e., without knowledge) of
other bidders' prices and/or click-through-rates for those positions. We
evaluate our bidding algorithms using both synthetic data and real bidding data
gathered manually, and also discuss a sniping heuristic that strictly improves
bidding performance. With sniping and parameter tuning enabled, our bidding
algorithms can achieve a performance ratio above 90% against the optimum by the
omniscient bidder. Keywords: keyword bidding, multiple-choice knapsack problem, online knapsack problem,
sponsored search auction | |||
| Social and semantics analysis via non-negative matrix factorization | | BIBAK | Full-Text | 1245-1246 | |
| Zhi-li Wu; Chi-Wa Cheng; Chun-hung Li | |||
| Social media such as Web forum often have dense interactions between user
and content where network models are often appropriate for analysis. Joint
non-negative matrix factorization model of participation and content data can
be viewed as a bipartite graph model between users and media and is proposed
for analysis social media. The factorizations allow simultaneous automatic
discovery of leaders and sub-communities in the Web forum as well as the core
latent topics in the forum. Results on topic detection of Web forums and
cluster analysis show that social features are highly effective for forum
analysis. Keywords: algorithm, graph theory | |||
| Incremental web page template detection | | BIBAK | Full-Text | 1247-1248 | |
| Yu Wang; Bingxing Fang; Xueqi Cheng; Li Guo; Hongbo Xu | |||
| Most template detection methods process web pages in batches that a newly
crawled page can not be processed until enough pages have been collected. This
results in large storage consumption and a huge delay of data refreshing. In
this paper, we present an incremental framework to detect templates in which a
page is processed as soon as it has been crawled. In this framework, we don't
need to cache any web page. Experiments show that our framework consumes less
than 7% storage than traditional methods. And also the speed of data refreshing
is accelerated because of the incremental manner. Keywords: automatic template removal, web page segmentation | |||
| Rogue access point detection using segmental TCP jitter | | BIBAK | Full-Text | 1249-1250 | |
| Gaogang Xie; Tingting He; Guangxing Zhang | |||
| Rogue Access Points (RAPs) pose serious security threats to local networks.
An analytic model of prior probability distribution of Segmental TCP Jitter
(STJ) is deduced from the mechanism of IEEE 802.11 MAC Distributed Coordinated
Function (DCF) and used to differentiate the types of wire and WLAN connections
which is the crucial step for RAPs detecting. STJ as the detecting metric can
reflect more the characteristic of 802.11 MAC than ACK-Pair since it can
eliminate the delay caused by packet transmission. The experiment on an
operated network shows the average detection ratio of the algorithm with STJ is
more than 92.8% and the average detection time is less than 1s with improvement
of 20% and 60% over the detecting approach of ACK-Pair respectively. Farther
more no WLAN training trace is needed in the detecting algorithm. Keywords: analytic model, rogue ap, segmental tcp jitter, sequential hypothesis
testing | |||
| Towards the policy-aware web: the real web 3.0? | | BIBAK | Full-Text | 1251-1252 | |
| Renato Iannella | |||
| The sharing and control of information on the Web needs to be balanced to
provide the web community with the best experience and outcomes. The
Policy-Aware Web is an emerging direction that may hold the key in getting this
balance right by allowing future policy languages to maintain this
harmonization. This panel will discuss, highlight, and debate the challenges in
moving towards the Policy-Aware Web and the paths it could lead. The target
audience includes all web stakeholders as most policies, such as privacy and
rights, are intrinsic to all web users, information, services and content. Keywords: identity, policy-aware web, privacy, rights | |||
| Protecting the web: phishing, malware, and other security threats | | BIBAK | Full-Text | 1253-1254 | |
| Greg Aaron; Katharine A. Bostik; Rod Rasmussen; Edmon Chung | |||
| Web site operators, Internet users, and online service providers are
besieged by a growing array of abuses and threats. Spam leads users to online
scams and phishing Web pages, which cyber-criminals implant on innocent sites
to fool the unwary into revealing their financial data and passwords. Other
criminals use Web sites to spread malware, which can steal personal data or
take over users' computers into a botnet, which can be used to send spam or
mount cyber-attacks against Web sites and other Internet services. Together,
these abuses undermine user trust, hamper e-commerce, and cost the Internet
community huge losses in money, service and support costs, and time.
What should Web site operators and online service providers do to protect themselves and their users? What are Internet companies, organizations, and law enforcement doing (and not doing) to combat these problems? And how can the international Internet community work together on these problems? The panel brings together representatives from the chain of organizations that respond to Internet abuse problems, and promises a lively, compelling, and relevant discussion. Keywords: dns, security | |||
| The future of online social interactions: what to expect in 2020 | | BIBAK | Full-Text | 1255-1256 | |
| Ajita John; Lada Adamic; Marc Davis; Frank Nack; David A. Shamma; Doree D. Seligmann | |||
| Blogs, wikis, tagging, podcasts, and social networking websites such as
MySpace, Facebook, Flickr and YouTube have radically changed user interactions
on the World Wide Web from a static, one-way, consumption model to a dynamic,
multi-way, participation model. Broad user power and flexibility have changed
how people engage in and experience their interconnections, interests, and
collaborations. Online social interactions will evolve in the next decade to
address the growing needs of its user community and make entries into many
aspects of our lives. This evolution may very well be among the most exciting
ones of our times where the individual and collective power of people to
contribute and share content, experiences, ideas, expertise etc. may be even
more enhanced than it is today. The enhancement may be shaped through a better
understanding of user needs and behavior and it may be enabled through the
seamless convergence of multi-modal technologies, new applications, new
domains, data mining and better navigational and search capabilities. Some of
these changes will also permeate into the workplace and change the way we work.
This panel will discuss how online social interactions may evolve in the next
decade and what impact it may have on diverse dimensions in our world. Keywords: online interactions, social media, social networks | |||
| Information "uptrieval": exploring models for content assimilation and aggregation for developing regions | | BIBAK | Full-Text | 1257-1258 | |
| Sheetal K. Agarwal; Arun Kumar; Sougata Mukherjea; Amit A. Nanavati; Nitendra Rajput | |||
| Information Retrieval on the WWW is important because it is hard to find
what one is looking for. There is a plethora of information available, and
searching relevant information is a challenge. In the case of developing
regions, we have the opposite problem: 1) Information availability of global
markets is scarce. Most of the consumers and producers (of information as well
as goods) are relegated to local markets in geographical vicinity. In order to
reach wider markets, it is important for this local information to reach wider
audiences. (Local information for global consumption LIG model). (2) At the
same time, locally relevant information, such as delays in bus/train timings,
mobile medical van schedule changes, electricity outage timings, is not easily
available either. (Local information for local consumption LIL model). We
introduce the term Information Uptrieval to address the reverse problem of
acquiring, assimilating, aggregating and uploading global and local information
that is relevant for developing regions to a platform that improves the reach
of the information. While the WWW is an obvious example of one such platform,
given the low internet penetration in such regions, we need to explore
effective alternatives. Several innovative, but disconnected approaches have
been attempted to address the information uptrieval problem, ranging from the
use of DVDs (eSagu, http://www.esagu.in/esagu) through the use of wireless
stations on motorcycles (First Mile Solutions,
http://www.firstmilesolutions.com). Many of these have met with reasonable
success in their pilot deployments. Keywords: developing regions, information management | |||
| Rich media and web 2.0 | | BIBAK | Full-Text | 1259-1260 | |
| Edward Chang; Ken Ong; Susanne Boll; Wei-Ying Ma | |||
| Rich media data, such as video, imagery, music, and gaming, do no longer
play just a supporting role on the World Wide Web to text data. Thanks to Web
2.0, rich media is the primary content on sites such as Flickr, PicasaWeb,
YouTube, and QQ. Because of massive user generated content, the volume of rich
media being transmitted on the Internet has surpassed that of text. It is vital
to properly manage these data to ensure efficient bandwidth utilization, to
support effective indexing and search, and to safeguard copyrights (just to
name a few). This panel invites both researchers and practitioners to discuss
the challenges of Web-scale media-data management. In particular, the panelists
will address issues such as leveraging Rich Media and Web 2.0, indexing,
search, and scalability. Keywords: rich media, web 2.0 | |||
| Location and the web (LocWeb 2008) | | BIBAK | Full-Text | 1261-1262 | |
| Susanne Boll; Christopher Jones; Eric Kansa; Puneet Kishor; Mor Naaman; Ross Purves; Arno Scharl; Erik Wilde | |||
| The World Wide Web has become the world's largest networked information
resource, but references to geographical locations remain unstructured and
typically implicit in nature. This lack of explicit spatial knowledge within
the Web makes it difficult to service user needs for location-specific
information. At present, spatial knowledge is hidden in many small information
fragments such as addresses on Web pages, annotated photos with GPS
co-ordinates, geographic mapping applications, and geotags in user-generated
content. Several emerging formats that primarily or secondarily include
location metadata, like GeoRSS, KML, and microformats, aim to improve this
state of affairs. However, the question remains how to extract, index, mine,
find, view, mashup, and exploit Web content using its location semantics. This
work-shop brings together researchers from academia and industry labs to
discuss and present the latest results and trends in all facets of the
relationships between location concepts and Web information. Keywords: geographic data, geolocation, geospatial, location, location-based services | |||
| WS3: international workshop on context-enabled source and service selection, integration and adaptation (CSSSIA 2008) | | BIBAK | Full-Text | 1263-1264 | |
| Quan Z. Sheng; Ullas Nambiar; Amit P. Sheth; Biplav Srivastava; Zakaria Maamar; Said Elnaffar | |||
| This write-up provides a summary of the International Workshop on Context
enabled Source and Service Selection, Integration and Adaptation (CSSSIA 2008),
organized in conjunction with WWW 2008, at Beijing, China on April 22nd 2008.
We outline the motivation for organizing the workshop, briefly describe the
organizational details and program of the workshop, and summarize each of the
papers accepted by the workshop. More information about the workshop can be
found at http://www.cs.adelaide.edu.au/~csssia08/. Keywords: context awareness, service and data integration, web service | |||
| Linked data on the web (LDOW2008) | | BIBAK | Full-Text | 1265-1266 | |
| Christian Bizer; Tom Heath; Kingsley Idehen; Tim Berners-Lee | |||
| The Web is increasingly understood as a global information space consisting
not just of linked documents, but also of Linked Data. More than just a vision,
the resulting Web of Data has been brought into being by the maturing of the
Semantic Web technology stack, and by the publication of an increasing number
of datasets according to the principles of Linked Data.
The Linked Data on the Web (LDOW2008) workshop brings together researchers and practitioners working on all aspects of Linked Data. The workshop provides a forum to present the state of the art in the field and to discuss ongoing and future research challenges. In this workshop summary we will outline the technical context in which Linked Data is situated, describe developments in the past year through initiatives such as the Linking Open Data community project, and look ahead to the workshop itself. Keywords: linked data, rdf, semantic web, web of data | |||
| Fourth international workshop on adversarial information retrieval on the web (AIRWeb 2008) | | BIBAK | Full-Text | 1267-1268 | |
| Carlos Castillo; Kumar Chellapilla; Dennis Fetterly | |||
| Adversarial IR in general, and search engine spam, in particular, are
engaging research topics with a real-world impact for Web users, advertisers
and publishers. The AIRWeb workshop will bring researchers and practitioners in
these areas together, to present and discuss state-of-the-art techniques as
well as real-world experiences. Given the continued growth in search engine
spam creation and detection efforts, we expect interest in this AIRWeb to
surpass that of the previous three editions of the workshop (held jointly with
WWW 2005, SIGIR 2006, and WWW 2007 respectively). Keywords: adversarial information retrieval, search engine spam, spamdexing, web spam | |||
| First workshop on targeting and ranking for online advertising | | BIBAK | Full-Text | 1269-1270 | |
| Ewa Dominowska; Vanja Josifovski | |||
| Online advertising is a rapidly growing, multi-billion dollar industry. It
has become a significant element of the Web browsing experience. Online
advertising providers use sophisticated ad targeting and ranking algorithms
with the dual aim of maximizing revenue while providing a superior user
experience. As a result, advertising optimization is a very complex research
problem, since it combines relevance with user interaction models, advertiser
valuations, and commercial constraints. Online advertising integrates a number
of core research areas: machine learning, data mining, search, auction theory,
and user modeling.
This workshop is intended to serve as an open forum for discussion of new ideas and current research in the field of online advertising. We expect that the workshop will promote a community of researchers interested in this area and yield future collaboration and exchanges. The research included should address problems faced by advertisers, end-users, advertising platforms, and the market. Keywords: online advertising, ranking, targeting of online advertising | |||
| WS7 -- MobEA VI: personal rich social media | | BIBAK | Full-Text | 1271-1272 | |
| Rittwik Jana; Daniel Appelquist; Galit Zadok; Bin Wei | |||
| 2008 will be the "Mobile Eureka! Moment". With the Olympics just round the
corner and the mobile market experiencing a phenomenal growth, we believe that
WWW 2008 will truly be the climax and focal point in the midst of this mobile
revolution. Mobile Web Initiative spearheaded by W3C is making a strong stand
on how to realize the vision of pervasive mobile computing. Three screen
services (TV, PC, Phone) are demanding new architectures to be developed.
Furthermore, the need to go beyond technology, demands an embrace of the
human-centric aspects of mobile computing. The objective of this workshop is to
provide a single forum for researchers, sociologists, and technologists to
discuss the state-of-the-art, present their contributions, and set future
directions in personalized applications for mobile users with a focus in rich
social media.
The transition from the wired Internet to the mobile web has generated a wide array of novel services bringing in billions of dollars in revenue in the recent years. So there are a few natural questions that one may ask. "What is "the" next big thing in the mobile web as far as applications and services are concerned?", "How will operators provide these services efficiently with this next wave of applications?" and "What are the emerging business models that drive the next generation of services?". In a nutshell, to answer some of the questions above, we believe that in general there is a convergence in the fixed and mobile worlds by means of standards like IP Multimedia Subsystems (IMS) [1]. This convergence allows enterprise and consumer applications to be blended via a unified infrastructure. This unification results in a number of advantages like a) application availability from any access methods, b) enhanced user experience for combined/blended services (presence, messaging, address book etc.), c) centralized user profiles shared between applications and common provisioning, d) flexible charging and billing of multimedia services in particular, e) scalable deployments with guaranteed Quality of Services and last but not least f) secure solutions through built in identity management, authentication and authorization procedures. There are a variety of novel applications that are starting to emerge. A general shift from static content to a more dynamic nature has been observed in recent years. In particular, multimedia and social networking are the foundations of emerging applications. One such example is the delivery of IPTV to the third screen (mobile, PDA). Social networking/messaging (applications like Twitter) and peer-to-peer applications are also on the verge of breakthroughs in the mobile landscape. Interestingly, a lot of the momentum has been reinvigorated with the introduction of smart phones like Apple's iPhone [2]. Devices are becoming more powerful and standards are being written to take advantage of these smart features. Finally, we believe that the mobile web is at the forefront of technology and this workshop will discuss many of these technical and business challenges that are on the wish list of everyday practitioners. Keywords: mobile web, rich media | |||
| Report on semantic web for health care and life sciences workshop | | BIBAK | Full-Text | 1273-1274 | |
| Huajun Chen; Kei Cheung; Michel Dumontier; Eric Prudhommeaux; Alan Ruttenberg; Susie Stephens; Yimin Wang | |||
| The Semantic Web for Health Care and Life Sciences Workshop will be held in
Beijing, China, on April 22, 2008. The goal of the workshop is to foster the
development and advancement in the use of Semantic Web technologies to
facilitate collaboration, research and development, and innovation adoption in
the domains of Health Care and Life Sciences, We also encourage the
participation of all research communities in this event, with enhanced
participation from Asia due to the location of the event. The workshop consists
of two invited keynote talks, eight peer-reviewed presentations, and one panel
discussion. Keywords: biomedical informatics, data integration, health care, life sciences,
ontology, semantic web | |||
| International workshop on question answering on the web (QAWeb2008) | | BIBAK | Full-Text | 1275-1276 | |
| Liu Wenyin; Qing Li; Xuedong Huang | |||
| A half-day single track workshop is designed to gather academic researchers
and industrial practitioners at to share ideas and knowledge of know-how, and
to discuss all relevant issues including the business models, enabling
technologies, and killer applications, of Web-based question answering (QA),
especially, the user-interactive QA services and applications. The workshop
program consists of two sessions, one for academic papers, and the other for
industrial practice papers. Each session consists of a leading talk, followed
by three short presentations. Sufficient time is allocated for brainstorming
discussions in each session. Keywords: question answering | |||
| WWW 2008 workshop: NLPIX2008 summary | | BIBAK | Full-Text | 1277-1278 | |
| Hiroshi Nakagawa; Kentaro Torisawa; Marasu Kitsuregawa | |||
| The amount of information available on the Web has increased rapidly,
reaching levels that few would ever have imagined possible. We live in what
could be called the "information-explosion era," and this situation poses new
problems for computer scientists. Users demand useful and reliable information
from the Web in the shortest time possible, but the obstacles to fulfilling
this demand are many including language barriers and the so-called "long tail."
Even worse, users may provide only vague specifications of the information that
they actually want, so that a more concrete specification must somehow be
inferred by Web access tools. Natural language processing (NLP) is one of the
key technologies for solving the above Web usability problems. Almost all the
Web page provide with the essential information in the form of natural language
texts, and the amount of these text information is huge. In order to offer
solutions to these problems we must perform searching and extracting
information from the Web texts using NLP technologies. The aim of this
workshop: NLP Challenges in the Information Explosion Era (NLPIX 2008) is to
bring researchers and practitioners together in order to discuss our most
pressing needs with respect to accessing information on the Web, and to discuss
new ideas in NLP technologies that might offer viable solutions for those
issues. Keywords: natural language processing, web | |||
| Workshop on social web and knowledge management (SWKM2008) | | BIBAK | Full-Text | 1279-1280 | |
| Peter Dolog; Markus Kroetzsch; Sebastian Schaffert; Denny Vrandecic | |||
| This paper provides an overview on the synergies between social web and
knowledge management, topics, program committee members as well as summary of
accepted papers for the SWKM2008 workshop. Keywords: knowledge management, semantic web, social web, web 2.0 | |||
| WWW 2008 workshop on social web search and mining: SWSM2008 | | BIBK | Full-Text | 1281-1282 | |
| Juanzi Li; Gui-rong Xue; Jie Tang; Ying Ding | |||
Keywords: social web, web mining, web search | |||