| Towards rich query interpretation: walking back and forth for mining query templates | | BIBAK | Full-Text | 1-10 | |
| Ganesh Agarwal; Govind Kabra; Kevin Chen-Chuan Chang | |||
| We propose to mine structured query templates from search logs, for enabling
rich query interpretation that recognizes both query intents and associated
attributes. We formalize the notion of template as a sequence of keywords and
domain attributes, and our objective is to discover templates with high
precision and recall for matching queries in a domain of interest. Our solution
bootstraps from small seed input knowledge to discover relevant query
templates, by harnessing the wealth of information available in search logs. We
model this information in a tri-partite QueST network of queries, sites, and
templates. We propose a probabilistic inferencing framework based on the dual
metrics of precision and recall -- and we show that the dual inferencing
correspond respectively to the random walks in backward and forward directions.
We deployed and tested our algorithm over a real-world search log of 15 million
queries. The algorithm achieved accuracy of as high as 90% (on F-measure), with
little seed knowledge and even with incomplete domain schema. Keywords: query attributes, query intents, query templates, search log mining | |||
| Selecting skyline services for QoS-based web service composition | | BIBAK | Full-Text | 11-20 | |
| Mohammad Alrifai; Dimitrios Skoutas; Thomas Risse | |||
| Web service composition enables seamless and dynamic integration of business
applications on the web. The performance of the composed application is
determined by the performance of the involved web services. Therefore,
non-functional, quality of service aspects are crucial for selecting the web
services to take part in the composition. Identifying the best candidate web
services from a set of functionally-equivalent services is a multi-criteria
decision making problem. The selected services should optimize the overall QoS
of the composed application, while satisfying all the constraints specified by
the client on individual QoS parameters. In this paper, we propose an approach
based on the notion of skyline to effectively and efficiently select services
for composition, reducing the number of candidate services to be considered. We
also discuss how a provider can improve its service to become more competitive
and increase its potential of being included in composite applications. We
evaluate our approach experimentally using both real and synthetically
generated datasets. Keywords: optimization, QoS, service composition, web services | |||
| Money, glory and cheap talk: analyzing strategic behavior of contestants in simultaneous crowdsourcing contests on TopCoder.com | | BIBAK | Full-Text | 21-30 | |
| Nikolay Archak | |||
| Crowdsourcing is a new Web phenomenon, in which a firm takes a function once
performed in-house and outsources it to a crowd, usually in the form of an open
contest.
Designing efficient crowdsourcing mechanisms is not possible without deep understanding of incentives and strategic choices of all participants. This paper presents an empirical analysis of determinants of individual performance in multiple simultaneous crowdsourcing contests using a unique dataset for the world's largest competitive software development portal: TopCoder.com. Special attention is given to studying the effects of the reputation system currently used by TopCoder.com on behavior of contestants. We find that individual specific traits together with the project payment and the number of project requirements are significant predictors of the final project quality. Furthermore, we find significant evidence of strategic behavior of contestants. High rated contestants face tougher competition from their opponents in the competition phase of the contest. In order to soften the competition, they move first in the registration phase of the contest, signing up early for particular projects. Although registration in TopCoder contests is non-binding, it deters entry of opponents in the same contest; our lower bound estimate shows that this strategy generates significant surplus gain to high rated contestants. We conjecture that the reputation + cheap talk mechanism employed by TopCoder has a positive effect on allocative efficiency of simultaneous all-pay contests and should be considered for adoption in other crowdsourcing platforms. Keywords: all-pay auction, cheap talk, contest, crowdsourcing, electronic markets,
entry deterrence, reputation | |||
| Mining advertiser-specific user behavior using adfactors | | BIBAK | Full-Text | 31-40 | |
| Nikolay Archak; Vahab S. Mirrokni; S. Muthukrishnan | |||
| Consider an online ad campaign run by an advertiser. The ad serving
companies that handle such campaigns record users' behavior that leads to
impressions of campaign ads, as well as users' responses to such impressions.
This is summarized and reported to the advertisers to help them evaluate the
performance of their campaigns and make better budget allocation decisions.
The most popular reporting statistics are the click-through rate and the conversion rate. While these are indicative of the effectiveness of an ad campaign, the advertisers often seek to understand more sophisticated long-term effects of their ads on the brand awareness and the user behavior that leads to the conversion, thus creating a need for the reporting measures that can capture both the duration and the frequency of the pathways to user conversions. In this paper, we propose an alternative data mining framework for analyzing user-level advertising data. In the aggregation step, we compress individual user histories into a graph structure, called the adgraph, representing local correlations between ad events. For the reporting step, we introduce several scoring rules, called the adfactors (AF), that can capture global role of ads and ad paths in the adgraph, in particular, the structural correlation between an ad impression and the user conversion. We present scalable local algorithms for computing the adfactors; all algorithms were implemented using the MapReduce programming model and the Pregel framework. Using an anonymous user-level dataset of sponsored search campaigns for eight different advertisers, we evaluate our framework with different adgraphs and adfactors in terms of their statistical fit to the data, and show its value for mining the long-term behavioral patterns in the advertising data. Keywords: ad auctions, clickthrough rate, conversion rate, online advertising,
pagerank, sponsored search, user behavior models | |||
| Matrix "Bit" loaded: a scalable lightweight join query processor for RDF data | | BIBAK | Full-Text | 41-50 | |
| Medha Atre; Vineet Chaoji; Mohammed J. Zaki; James A. Hendler | |||
| The Semantic Web community, until now, has used traditional database systems
for the storage and querying of RDF data. The SPARQL query language also
closely follows SQL syntax. As a natural consequence, most of the SPARQL query
processing techniques are based on database query processing and optimization
techniques. For SPARQL join query optimization, previous works like RDF-3X and
Hexastore have proposed to use 6-way indexes on the RDF data. Although these
indexes speed up merge-joins by orders of magnitude, for complex join queries
generating large intermediate join results, the scalability of the query
processor still remains a challenge.
In this paper, we introduce (i) BitMat -- a compressed bit-matrix structure for storing huge RDF graphs, and (ii) a novel, light-weight SPARQL join query processing method that employs an initial pruning technique, followed by a variable-binding-matching algorithm on BitMats to produce the final results. Our query processing method does not build intermediate join tables and works directly on the compressed data. We have demonstrated our method against RDF graphs of up to 1.33 billion triples -- the largest among results published until now (single-node, non-parallel systems), and have compared our method with the state-of-the-art RDF stores -- RDF-3X and MonetDB. Our results show that the competing methods are most effective with highly selective queries. On the other hand, BitMat can deliver 2-3 orders of magnitude better performance on complex, low-selectivity queries over massive data. Keywords: data compression, query algorithm, RDF store | |||
| A comparison of visual and textual page previews in judging the helpfulness of web pages | | BIBAK | Full-Text | 51-60 | |
| Anne Aula; Rehan M. Khan; Zhiwei Guan; Paul Fontes; Peter Hong | |||
| We investigated the efficacy of visual and textual web page previews in
predicting the helpfulness of web pages related to a specific topic. We ran two
studies in the usability lab and collected data through an online survey.
Participants (total of 245) were asked to rate the expected helpfulness of a
web page based on a preview (four different thumbnail variations: a textual web
page summary, a thumbnail/title/URL combination, a title/URL combination). In
the lab studies, the same participants also rated the helpfulness of the actual
web pages themselves. In the online study, the web page ratings were collected
from a separate group of participants. Our results show that thumbnails add
information about the relevance of web pages that is not available in the
textual summaries of web pages (title, snippet & URL). However, showing
only thumbnails, with no textual information, results in poorer performance
than showing only textual summaries. The prediction inaccuracy caused by
textual vs. visual previews was different: textual previews tended to make
users overestimate the helpfulness of web pages, whereas thumbnails made users
underestimate the helpfulness of web pages in most cases. In our study, the
best performance was obtained by combining sufficiently large thumbnails (at
least 200x200 pixels) with page titles and URLs -- and it was better to make
users focus primarily on the thumbnail by placing the title and URL below the
thumbnail. Our studies highlighted four key aspects that affect the performance
of previews: the visual/textual mode of the previews, the zoom level and size
of the thumbnail, as well as the positioning of key information elements. Keywords: snippets, thumbnails, user study, web page previews | |||
| Find me if you can: improving geographical prediction with social and spatial proximity | | BIBAK | Full-Text | 61-70 | |
| Lars Backstrom; Eric Sun; Cameron Marlow | |||
| Geography and social relationships are inextricably intertwined; the people
we interact with on a daily basis almost always live near us. As people spend
more time online, data regarding these two dimensions -- geography and social
relationships -- are becoming increasingly precise, allowing us to build
reliable models to describe their interaction. These models have important
implications in the design of location-based services, security intrusion
detection, and social media supporting local communities.
Using user-supplied address data and the network of associations between members of the Facebook social network, we can directly observe and measure the relationship between geography and friendship. Using these measurements, we introduce an algorithm that predicts the location of an individual from a sparse set of located users with performance that exceeds IP-based geolocation. This algorithm is efficient and scalable, and could be run on a network containing hundreds of millions of users. Keywords: geolocation, propagation, social networks | |||
| AdHeat: an influence-based diffusion model for propagating hints to match ads | | BIBAK | Full-Text | 71-80 | |
| Hongji Bao; Edward Y. Chang | |||
| In this paper, we present AdHeat, a social ad model considering user
influence in addition to relevance for matching ads. Traditionally, ad
placement employs the relevance model. Such a model matches ads with Web page
content, user interests, or both. We have observed, however, on social networks
that the relevance model suffers from two shortcomings. First, influential
users (users who contribute opinions) seldom click ads that are highly relevant
to their expertise. Second, because influential users' contents and activities
are attractive to other users, hint words summarizing their expertise and
activities may be widely preferred. Therefore, we propose AdHeat, which
diffuses hint words of influential users to others and then matches ads for
each user with aggregated hints. We performed experiments on a large online
Q&A community with half a million users. The experimental results show that
AdHeat outperforms the relevance model on CTR (click through rate) by
significant margins. Keywords: behavior targeting, contextual advertising, heat diffusion, influence
propagation, online advertising, social network analysis | |||
| Dynamic and graphical web page breakpoints | | BIBAK | Full-Text | 81-90 | |
| John J. Barton; Jan Odvarko | |||
| Breakpoints are perhaps the quintessential feature of a de-bugger: they
allow a developer to stop time and study the program state. Breakpoints are
typically specified by selecting a line of source code. For large, complex, web
pages with multiple developers, the relevant source line for a given user
interface problem may not be known to the developer. In this paper we describe
the implementation of breakpoints in dynamically created source, and on error
messages, network events, DOMmutation, DOMobject property changes, and CSS
style rule updates. Adding these domain-specific breakpoints to a
general-purpose debugger for Javascript allows the developer to initiate the
debugging process via Web page abstractions rather than lower level source code
views. The breakpoints are implemented in the open source Fire-bug project,
version 1.5, for the Firefox Web browser. Keywords: breakpoints, css, debugging, dynamic, firebug, html, javascript, web | |||
| Regular expressions considered harmful in client-side XSS filters | | BIBAK | Full-Text | 91-100 | |
| Daniel Bates; Adam Barth; Collin Jackson | |||
| Cross-site scripting flaws have now surpassed buffer overflows as the
world's most common publicly-reported security vulnerability. In recent years,
browser vendors and researchers have tried to develop client-side filters to
mitigate these attacks. We analyze the best existing filters and find them to
be either unacceptably slow or easily circumvented. Worse, some of these
filters could introduce vulnerabilities into sites that were previously
bug-free. We propose a new filter design that achieves both high performance
and high precision by blocking scripts after HTML parsing but before execution.
Compared to previous approaches, our approach is faster, protects against more
vulnerabilities, and is harder for attackers to abuse. We have contributed an
implementation of our filter design to the WebKit open source rendering engine,
and the filter is now enabled by default in the Google Chrome browser. Keywords: XSS, browser, cross-site scripting, filter, web | |||
| The anatomy of an ad: structured indexing and retrieval for sponsored search | | BIBAK | Full-Text | 101-110 | |
| Michael Bendersky; Evgeniy Gabrilovich; Vanja Josifovski; Donald Metzler | |||
| The core task of sponsored search is to retrieve relevant ads for the user's
query. Ads can be retrieved either by exact match, when their bid term is
identical to the query, or by advanced match, which indexes ads as documents
and is similar to standard information retrieval (IR). Recently, there has been
a great deal of research into developing advanced match ranking algorithms.
However, no previous research has addressed the ad indexing problem. Unlike
most traditional search problems, the ad corpus is defined hierarchically in
terms of advertiser accounts, campaigns, and ad groups, which further consist
of creatives and bid terms. This hierarchical structure makes indexing highly
non-trivial, as naively indexing all possible "displayable" ads leads to a
prohibitively large and ineffective index. We show that ad retrieval using such
an index is not only slow, but its precision is suboptimal as well. We
investigate various strategies for compact, hierarchy-aware indexing of
sponsored search ads through adaptation of standard IR indexing techniques. We
also propose a new ad retrieval method that yields more relevant ads by
exploiting the structured nature of the ad corpus. Experiments carried out over
a large ad test collection from a commercial search engine show that our
proposed methods are highly effective and efficient compared to more standard
indexing and retrieval approaches. Keywords: sponsored search, structured retrieval | |||
| Classification-enhanced ranking | | BIBAK | Full-Text | 111-120 | |
| Paul N. Bennett; Krysta Svore; Susan T. Dumais | |||
| Many have speculated that classifying web pages can improve a search
engine's ranking of results. Intuitively results should be more relevant when
they match the class of a query. We present a simple framework for
classification-enhanced ranking that uses clicks in combination with the
classification of web pages to derive a class distribution for the query. We
then go on to define a variety of features that capture the match between the
class distributions of a web page and a query, the ambiguity of a query, and
the coverage of a retrieved result relative to a query's set of classes.
Experimental results demonstrate that a ranker learned with these features
significantly improves ranking over a competitive baseline. Furthermore, our
methodology is agnostic with respect to the classification space and can be
used to derive query classes for a variety of different taxonomies. Keywords: learning to rank, query classification | |||
| Sync kit: a persistent client-side database caching toolkit for data intensive websites | | BIBAK | Full-Text | 121-130 | |
| Edward Benson; Adam Marcus; David Karger; Samuel Madden | |||
| We introduce a client-server toolkit called Sync Kit that demonstrates how
client-side database storage can improve the performance of data intensive
websites. Sync Kit is designed to make use of the embedded relational database
defined in the upcoming HTML5 standard to offload some data storage and
processing from a web server onto the web browsers to which it serves content.
Our toolkit provides various strategies for synchronizing relational database
tables between the browser and the web server, along with a client-side
template library so that portions web applications may be executed client-side.
Unlike prior work in this area, Sync Kit persists both templates and data in
the browser across web sessions, increasing the number of concurrent
connections a server can handle by up to a factor of four versus that of a
traditional server-only web stack and a factor of three versus a recent
template caching approach. Keywords: cache, client-side, data base, database, performance, web | |||
| Ranking specialization for web search: a divide-and-conquer approach by using topical RankSVM | | BIBAK | Full-Text | 131-140 | |
| Jiang Bian; Xin Li; Fan Li; Zhaohui Zheng; Hongyuan Zha | |||
| Many ranking algorithms applying machine learning techniques have been
proposed in informational retrieval and Web search. However, most of existing
approaches do not explicitly take into account the fact that queries vary
significantly in terms of ranking and entail different treatments regarding the
ranking models. In this paper, we apply a divide-and-conquer framework for
ranking specialization, i.e. learning multiple ranking models by addressing
query difference. We first generate query representation by aggregating ranking
features through pseudo feedbacks, and employ unsupervised clustering methods
to identify a set of ranking-sensitive query topics based on training queries.
To learn multiple ranking models for respective ranking-sensitive query topics,
we define a global loss function by combining the ranking risks of all query
topics, and we propose a unified SVM-based learning process to minimize the
global loss. Moreover, we employ an ensemble approach to generate the ranking
result for each test query by applying a set of ranking models of the most
appropriate query topics. We conduct experiments using a benchmark dataset for
learning ranking functions as well as a dataset from a commercial search
engine. Experimental results show that our proposed approach can significantly
improve the ranking performance over existing single-model approaches as well
as straightforward local ranking approaches, and the automatically identified
ranking-sensitive topics are more useful for enhancing ranking performance than
pre-defined query categorization. Keywords: ranking specialization for web search, ranking-sensitive query topic,
topical RankSVM | |||
| Automated performance assessment for service-oriented middleware: a case study on BPEL engines | | BIBAK | Full-Text | 141-150 | |
| Domenico Bianculli; Walter Binder; Mauro Luigi Drago | |||
| Middleware for Web service compositions, such as BPEL engines, provides the
execution environment for services as well as additional functionalities, such
as monitoring and self-tuning. Given its role in service provisioning, it is
very important to assess the performance of middleware in the context of a SOA.
This paper presents SOABench, a framework for the automatic generation and
execution of testbeds for benchmarking middleware for composite Web services
and for assessing the performance of existing SOA infrastructures. SOABench
defines a testbed model characterized by the composite services to execute, the
workload to generate, the deployment configuration to use, the performance
metrics to gather, the data analyses to perform on them, and the reports to
produce. We have validated SOABench by benchmarking the performance of
different BPEL engines. Keywords: experiment automation, middleware, performance assessment, testbed
generation, web service composition | |||
| Relational duality: unsupervised extraction of semantic relations between entities on the web | | BIBAK | Full-Text | 151-160 | |
| Danushka Tarupathi Bollegala; Yutaka Matsuo; Mitsuru Ishizuka | |||
| Extracting semantic relations among entities is an important first step in
various tasks in Web mining and natural language processing such as information
extraction, relation detection, and social network mining. A relation can be
expressed extensionally by stating all the instances of that relation or
intensionally by defining all the paraphrases of that relation. For example,
consider the ACQUISITION relation between two companies. An extensional
definition of ACQUISITION contains all pairs of companies in which one company
is acquired by another (e.g. (YouTube, Google) or (Powerset, Microsoft)). On
the other hand we can intensionally define ACQUISITION as the relation
described by lexical patterns such as X is acquired by Y, or Y purchased X,
where X and Y denote two companies. We use this dual representation of semantic
relations to propose a novel sequential co-clustering algorithm that can
extract numerous relations efficiently from unlabeled data. We provide an
efficient heuristic to find the parameters of the proposed coclustering
algorithm. Using the clusters produced by the algorithm, we train an L1
regularized logistic regression model to identify the representative patterns
that describe the relation expressed by each cluster. We evaluate the proposed
method in three different tasks: measuring relational similarity between entity
pairs, open information extraction (Open IE), and classifying relations in a
social network system. Experiments conducted using a benchmark dataset show
that the proposed method improves existing relational similarity measures.
Moreover, the proposed method significantly outperforms the current
state-of-the-art Open IE systems in terms of both precision and recall. The
proposed method correctly classifies 53 relation types in an online social
network containing 470; 671 nodes and 35; 652; 475 edges, thereby demonstrating
its efficacy in real-world relation detection tasks. Keywords: relation extraction, relational duality, relational similarity, web mining | |||
| Liquid query: multi-domain exploratory search on the web | | BIBAK | Full-Text | 161-170 | |
| Alessandro Bozzon; Marco Brambilla; Stefano Ceri; Piero Fraternali | |||
| In this paper we propose the Liquid Query paradigm, to support users in
finding responses to multi-domain queries through exploratory information
seeking across structured information sources (Web documents, deep Web data,
and personal data repositories), wrapped by means of a uniform notion of search
service. Liquid Query aims at filling the gap between general-purpose search
engines, which are unable to find information spanning multiple topics, and
domain-specific search systems, which cannot go beyond their domain limits. The
Liquid Query interface consists of interaction primitives that let users pose
questions and explore results spanning over multiple sources incrementally,
thus getting closer and closer to the sought information. We demonstrate our
approach with a prototype built upon the YQL (Yahoo! Query Language) framework. Keywords: exploratory search, federated search engine, multi-domain search, search
computing, search engine, search service, web | |||
| Graph-based concept identification and disambiguation for enterprise search | | BIBAK | Full-Text | 171-180 | |
| Falk Brauer; Michael Huber; Gregor Hackenbroich; Ulf Leser; Felix Naumann; Wojciech M. Barczynski | |||
| Enterprise Search (ES) is different from traditional IR due to a number of
reasons, among which the high level of ambiguity of terms in queries and
documents and existence of graph-structured enterprise data (ontologies) that
describe the concepts of interest and their relationships to each other, are
the most important ones.
Our method identifies concepts from the enterprise ontology in the query and corpus. We propose a ranking scheme for ontology sub-graphs on top of approximately matched token q-grams. The ranking leverages the graph-structure of the ontology to incorporate not explicitly mentioned concepts. It improves previous solutions by using a fine-grained ranking function that is specifically designed to cope with high levels of ambiguity. This method is able to capture much more of the semantics of queries and documents than previous techniques. We prove this claim by an evaluation of our method in three real-life scenarios from two different domains, and found it to consistently be superior both in terms of precision and recall. Keywords: enterprise search, graph-based disambiguation, information extraction | |||
| A refreshing perspective of search engine caching | | BIBAK | Full-Text | 181-190 | |
| Berkant Barla Cambazoglu; Flavio P. Junqueira; Vassilis Plachouras; Scott Banachowski; Baoqiu Cui; Swee Lim; Bill Bridge | |||
| Commercial Web search engines have to process user queries over huge Web
indexes under tight latency constraints. In practice, to achieve low latency,
large result caches are employed and a portion of the query traffic is served
using previously computed results. Moreover, search engines need to update
their indexes frequently to incorporate changes to the Web. After every index
update, however, the content of cache entries may become stale, thus decreasing
the freshness of served results. In this work, we first argue that the real
problem in today's caching for large-scale search engines is not eviction
policies, but the ability to cope with changes to the index, i.e., cache
freshness. We then introduce a novel algorithm that uses a time-to-live value
to set cache entries to expire and selectively refreshes cached results by
issuing refresh queries to back-end search clusters. The algorithm prioritizes
the entries to refresh according to a heuristic that combines the frequency of
access with the age of an entry in the cache. In addition, for setting the rate
at which refresh queries are issued, we present a mechanism that takes into
account idle cycles of back-end servers. Evaluation using a real workload shows
that our algorithm can achieve hit rate improvements as well as reduction in
average hit ages. An implementation of this algorithm is currently in
production use at Yahoo!. Keywords: caching, index updates, refresh, web search engines | |||
| Automated object persistence for JavaScript | | BIBAK | Full-Text | 191-200 | |
| Brett Cannon; Eric Wohlstadter | |||
| Traditionally web applications have required an internet connection in order
to work with data. Browsers have lacked any mechanisms to allow web
applications to operate offline with a set of data to provide constant access
to applications. Recently, through browser plug-ins such as Google Gears,
browsers have gained the ability to persist data for offline use. However,
until now it's been difficult for a web developer using these plug-ins to
manage persisting data both locally for offline use and in the internet cloud
due to: synchronization requirements, managing throughput and latency to the
cloud, and making it work within the confines of a standards-compliant web
browser. Historically in non-browser environments, programming language
environments have offered automated object persistence to shield the developer
from these complexities. In our research we have created a framework which
introduces automated persistence of data objects for JavaScript utilizing the
internet. Unlike traditional object persistence solutions, ours relies only on
existing or forthcoming internet standards and does not rely upon specific
runtime mechanisms such as OS or interpreter/compiler support. A new design was
required in order to be suitable to the internet's unique characteristics of
varying connection quality and a browser's specific restrictions. We validate
our approach using benchmarks which show that our framework can handle
thousands of data objects automatically, reducing the amount of work needed by
developers to support offline Web applications. Keywords: JavaScript, html5, json, object persistence, web storage | |||
| A generalized framework of exploring category information for question retrieval in community question answer archives | | BIBAK | Full-Text | 201-210 | |
| Xin Cao; Gao Cong; Bin Cui; Christian S. Jensen | |||
| Community Question Answering (CQA) has emerged as a popular type of service
where users ask and answer questions and access historical question-answer
pairs. CQA archives contain very large volumes of questions organized into a
hierarchy of categories. As an essential function of CQA services, question
retrieval in a CQA archive aims to retrieve historical question-answer pairs
that are relevant to a query question. In this paper, we present a new approach
to exploiting category information of questions for improving the performance
of question retrieval, and we apply the approach to existing question retrieval
models, including a state-of-the-art question retrieval model. Experiments
conducted on real CQA data demonstrate that the proposed techniques are capable
of outperforming a variety of baseline methods significantly. Keywords: categorization, question retrieval, question-answering services | |||
| The paths more taken: matching DOM trees to search logs for accurate webpage clustering | | BIBAK | Full-Text | 211-220 | |
| Deepayan Chakrabarti; Rupesh Mehta | |||
| An unsupervised clustering of the webpages on a website is a primary
requirement for most wrapper induction and automated data extraction methods.
Since page content can vary drastically across pages of one cluster (e.g., all
product pages on amazon.com), traditional clustering methods typically use some
distance function between the DOM trees representing a pair of webpages.
However, without knowing which portions of the DOM tree are "important," such
distance functions might discriminate between similar pages based on trivial
features (e.g., differing number of reviews on two product pages), or club
together distinct types of pages based on superficial features present in the
DOM trees of both (e.g., matching footer/copyright), leading to poor clustering
performance.
We propose using search logs to automatically find paths in the DOM trees that mark out important portions of pages, e.g., the product title in a product page. Such paths are identified via a global analysis of the entire website, whereby search data for popular pages can be used to infer good paths even for other pages that receive little or no search traffic. The webpages on the website are then clustered using these "key" paths. Our algorithm only requires information on search queries, and the webpages clicked in response to them; there is no need for human input, and it does not need to be told which portion of a webpage the user found interesting. The resulting clusterings achieve an adjusted RAND score of over 0.9 on half of the websites (a score of 1 indicating a perfect clustering), and 59% better scores on average than competing algorithms. Besides leading to refined clusterings, these key paths can be useful in the wrapper induction process itself, as shown by the high degree of match between the key paths and the manually identified paths used in existing wrappers for these sites (90% average precision). Keywords: clustering, search logs, wrapper induction | |||
| Actively predicting diverse search intent from user browsing behaviors | | BIBAK | Full-Text | 221-230 | |
| Zhicong Cheng; Bin Gao; Tie-Yan Liu | |||
| This paper is concerned with actively predicting search intent from user
browsing behavior data. In recent years, great attention has been paid to
predicting user search intent. However, the prediction was mostly passive
because it was performed only after users submitted their queries to search
engines. It is not considered why users issued these queries, and what
triggered their information needs. According to our study, many information
needs of users were actually triggered by what they have browsed. That is,
after reading a page, if a user found something interesting or unclear, he/she
might have the intent to obtain further information and accordingly formulate a
search query. Actively predicting such search intent can benefit both search
engines and their users. In this paper, we propose a series of technologies to
fulfill this task. First, we extract all the queries that users issued after
reading a given page from user browsing behavior data. Second, we learn a model
to effectively rank these queries according to their likelihoods of being
triggered by the page. Third, since search intents can be quite diverse even if
triggered by the same page, we propose an optimization algorithm to diversify
the ranked list of queries obtained in the second step, and then suggest the
list to users. We have tested our approach on large-scale user browsing
behavior data obtained from a commercial search engine. The experimental
results have shown that our approach can predict meaningful queries for a given
page, and the search performance for these queries can be significantly
improved by using the triggering page as contextual information. Keywords: contextual information, diversification, log mining, search intent,
search-trigger | |||
| Max-cover in map-reduce | | BIBAK | Full-Text | 231-240 | |
| Flavio Chierichetti; Ravi Kumar; Andrew Tomkins | |||
| The NP-hard Max-k-cover problem requires selecting k sets from a collection
so as to maximize the size of the union. This classic problem occurs commonly
in many settings in web search and advertising. For moderately-sized instances,
a greedy algorithm gives an approximation of (1-1/e). However, the greedy
algorithm requires updating scores of arbitrary elements after each step, and
hence becomes intractable for large datasets.
We give the first max cover algorithm designed for today's large-scale commodity clusters. Our algorithm has provably almost the same approximation as greedy, but runs much faster. Furthermore, it can be easily expressed in the MapReduce programming paradigm, and requires only polylogarithmically many passes over the data. Our experiments on five large problem instances show that our algorithm is practical and can achieve good speedups compared to the sequential greedy algorithm. Keywords: greedy algorithm, map-reduce, maximum cover | |||
| Stochastic models for tabbed browsing | | BIBAK | Full-Text | 241-250 | |
| Flavio Chierichetti; Ravi Kumar; Andrew Tomkins | |||
| We present a model of tabbed browsing that represents a hybrid between a
Markov process capturing the graph of hyperlinks, and a branching process
capturing the birth and death of tabs. We present a mathematical criterion to
characterize whether the process has a steady state independent of initial
conditions, and we show how to characterize the limiting behavior in both
cases. We perform a series of experiments to compare our tabbed browsing model
with pagerank, and show that tabbed browsing is able to explain 15-25% of the
deviation between actual measured browsing behavior and the behavior predicted
by the simple pagerank model. We find this to be a surprising result, as the
tabbed browsing model does not make use of any notion of site popularity, but
simply captures deviations in user likelihood to open and close tabs from a
particular node in the graph. Keywords: branching process, convergence, random walks, stationary distribution,
tabbed browsing | |||
| Using landing pages for sponsored search ad selection | | BIBAK | Full-Text | 251-260 | |
| Yejin Choi; Marcus Fontoura; Evgeniy Gabrilovich; Vanja Josifovski; Mauricio Mediano; Bo Pang | |||
| We explore the use of the landing page content in sponsored search ad
selection. Specifically, we compare the use of the ad's intrinsic content to
augmenting the ad with the whole, or parts, of the landing page. We explore two
types of extractive summarization techniques to select useful regions from the
landing pages: out-of-context and in-context methods. Out-of-context methods
select salient regions from the landing page by analyzing the content alone,
without taking into account the ad associated with the landing page. In-context
methods use the ad context (including its title, creative, and bid phrases) to
help identify regions of the landing page that should be used by the ad
selection engine. In addition, we introduce a simple yet effective unsupervised
algorithm to enrich the ad context to further improve the ad selection.
Experimental evaluation confirms that the use of landing pages can
significantly improve the quality of ad selection. We also find that our
extractive summarization techniques reduce the size of landing pages
substantially, while retaining or even improving the performance of ad
retrieval over the method that utilize the entire landing page. Keywords: compositional semantics, extractive summarization, landing pages, sponsored
search | |||
| The "Map Trap"?: an evaluation of map versus text-based interfaces for location-based mobile search services | | BIBAK | Full-Text | 261-270 | |
| Karen Church; Joachim Neumann; Mauro Cherubini; Nuria Oliver | |||
| As the mobile Internet continues to grow, there is an increasing need to
provide users with effective search and information access services. In order
to build more effective mobile search services, we must first understand the
impact that various interface choices have on mobile users. For example, the
majority of mobile location-based search services are built on top of a map
visualization, but is this intuitive design-decision the optimal interface
choice from a human centric perspective? In order to tackle this fundamental
design question, we have developed two proactive mobile search interfaces (one
map-based and the other text-based) that utilize key mobile contexts to improve
the search and information discovery experience of mobile users. In this paper,
we present the results of an exploratory field study of these two interfaces --
involving 34 users over a 1 month period -- where we focus in particular on the
impact that the type of user interface (e.g. map vs text) has on the search and
information discovery experience of mobile users. We highlight the main usage
results -- including that maps are not the interface of choice for certain
information access tasks -- and outline key implications for the design of next
generation mobile search services. Keywords: location-based services, map-based interfaces, mobile interfaces, mobile
internet, mobile search, mobile web, text-based interfaces, user study | |||
| Malicious interface design: exploiting the user | | BIBAK | Full-Text | 271-280 | |
| Gregory Conti; Edward Sobiesk | |||
| In an ideal world, interface design is the art and science of helping users
accomplish tasks in a timely, efficient, and pleasurable manner. This paper
studies the inverse situation, the vast emergence of deliberately constructed
malicious interfaces that violate design best practices in order to accomplish
goals counter to those of the user. This has become a commonplace occurrence
both on and off the desktop, particularly on the web. A primary objective of
this paper is to formally define this problem, including construction of a
taxonomy of malicious interface techniques and a preliminary analysis of their
impact on users. Findings are presented that gauge the self-reported tolerance
and expectation levels of users with regard to malicious interfaces as well as
the effectiveness and ease of use of existing countermeasures. A second
objective of this paper is to increase awareness, dialogue, and research in a
domain that we consider largely unexplored but critical to future usability of
the WWW. Our results were accomplished through significant compilation of
malicious interface techniques based on review of thousands of web sites and by
conducting three surveys. Ultimately, this paper concludes that malicious
interfaces are a ubiquitous problem that demands intervention by the security
and human computer interaction communities in order to reduce the negative
impact on the global user population. Keywords: adversarial interface design, design principles, evil interfaces, malicious
interfaces, web usability guidelines | |||
| Detection and analysis of drive-by-download attacks and malicious JavaScript code | | BIBAK | Full-Text | 281-290 | |
| Marco Cova; Christopher Kruegel; Giovanni Vigna | |||
| JavaScript is a browser scripting language that allows developers to create
sophisticated client-side interfaces for web applications. However, JavaScript
code is also used to carry out attacks against the user's browser and its
extensions. These attacks usually result in the download of additional malware
that takes complete control of the victim's platform, and are, therefore,
called "drive-by downloads." Unfortunately, the dynamic nature of the
JavaScript language and its tight integration with the browser make it
difficult to detect and block malicious JavaScript code.
This paper presents a novel approach to the detection and analysis of malicious JavaScript code. Our approach combines anomaly detection with emulation to automatically identify malicious JavaScript code and to support its analysis. We developed a system that uses a number of features and machine-learning techniques to establish the characteristics of normal JavaScript code. Then, during detection, the system is able to identify anomalous JavaScript code by emulating its behavior and comparing it to the established profiles. In addition to identifying malicious code, the system is able to support the analysis of obfuscated code and to generate detection signatures for signature-based systems. The system has been made publicly available and has been used by thousands of analysts. Keywords: anomaly detection, drive-by-download attacks, web client exploits | |||
| Competing for users' attention: on the interplay between organic and sponsored search results | | BIBAK | Full-Text | 291-300 | |
| Cristian Danescu-Niculescu-Mizil; Andrei Z. Broder; Evgeniy Gabrilovich; Vanja Josifovski; Bo Pang | |||
| Queries on major Web search engines produce complex result pages, primarily
composed of two types of information: organic results, that is, short
descriptions and links to relevant Web pages, and sponsored search results, the
small textual advertisements often displayed above or to the right of the
organic results. Strategies for optimizing each type of result in isolation and
the consequent user reaction have been extensively studied; however, the
interplay between these two complementary sources of information has been
ignored, a situation we aim to change. Our findings indicate that their
perceived relative usefulness (as evidenced by user clicks) depends on the
nature of the query. Specifically, we found that, when both sources focus on
the same intent, for navigational queries there is a clear competition between
ads and organic results, while for non-navigational queries this competition
turns into synergy.
We also investigate the relationship between the perceived usefulness of the ads and their textual similarity to the organic results, and propose a model that formalizes this relationship. To this end, we introduce the notion of responsive ads, which directly address the user's information need, and incidental ads, which are only tangentially related to that need. Our findings support the hypothesis that in the case of navigational queries, which are usually fully satisfied by the top organic result, incidental ads are perceived as more valuable than responsive ads, which are likely to be duplicative. On the other hand, in the case of non-navigational queries, incidental ads are perceived as less beneficial, possibly because they diverge too far from the actual user need. We hope that our findings and further research in this area will allow search engines to tune ad selection for an increased synergy between organic and sponsored results, leading to both higher user satisfaction and better monetization. Keywords: ad selection, incidental ads, navigational queries, responsive ads,
sponsored search, textual similarity, usefulness of ads | |||
| Inferring relevant social networks from interpersonal communication | | BIBAK | Full-Text | 301-310 | |
| Munmun De Choudhury; Winter A. Mason; Jake M. Hofman; Duncan J. Watts | |||
| Researchers increasingly use electronic communication data to construct and
study large social networks, effectively inferring unobserved ties (e.g. i is
connected to j) from observed communication events (e.g. i emails j). Often
overlooked, however, is the impact of tie definition on the corresponding
network, and in turn the relevance of the inferred network to the research
question of interest. Here we study the problem of network inference and
relevance for two email data sets of different size and origin. In each case,
we generate a family of networks parameterized by a threshold condition on the
frequency of emails exchanged between pairs of individuals. After demonstrating
that different choices of the threshold correspond to dramatically different
network structures, we then formulate the relevance of these networks in terms
of a series of prediction tasks that depend on various network features. In
general, we find: a) that prediction accuracy is maximized over a non-trivial
range of thresholds corresponding to 5-10 reciprocated emails per year; b) that
for any prediction task, choosing the optimal value of the threshold yields a
sizable (~30%) boost in accuracy over naive choices; and c) that the optimal
threshold value appears to be (somewhat surprisingly) consistent across data
sets and prediction tasks. We emphasize the practical utility in defining ties
via their relevance to the prediction task(s) at hand and discuss implications
of our empirical results. Keywords: communication networks, email, learning, network structure, network
thresholds, social network analysis, social networks, ties | |||
| Scalable techniques for document identifier assignment ininverted indexes | | BIBAK | Full-Text | 311-320 | |
| Shuai Ding; Josh Attenberg; Torsten Suel | |||
| Web search engines depend on the full-text inverted index data structure.
Because the query processing performance is so dependent on the size of the
inverted index, a plethora of research has focused on fast end effective
techniques for compressing this structure. Recently, several authors have
proposed techniques for improving index compression by optimizing the
assignment of document identifiers to the documents in the collection, leading
to significant reduction in overall index size.
In this paper, we propose improved techniques for document identifier assignment. Previous work includes simple and fast heuristics such as sorting by URL, as well as more involved approaches based on the Traveling Salesman Problem or on graph partitioning. These techniques achieve good compression but do not scale to larger document collections. We propose a new framework based on performing a Traveling Salesman computation on a reduced sparse graph obtained through Locality Sensitive Hashing. This technique achieves improved compression while scaling to tens of millions of documents. Based on this framework, we describe a number of new algorithms, and perform a detailed evaluation on three large data sets showing improvements in index size. Keywords: documentID reassignment, index compression | |||
| Do you want to take notes?: identifying research missions in Yahoo! search pad | | BIBAK | Full-Text | 321-330 | |
| Debora Donato; Francesco Bonchi; Tom Chi; Yoelle Maarek | |||
| Addressing user's information needs has been one of the main goals of Web
search engines since their early days. In some cases, users cannot see their
needs immediately answered by search results, simply because these needs are
too complex and involve multiple aspects that are not covered by a single Web
or search results page. This typically happens when users investigate a certain
topic in domains such as education, travel or health, which often require
collecting facts and information from many pages. We refer to this type of
activities as "research missions". These research missions account for 10% of
users' sessions and more than 25% of all query volume, as verified by a manual
analysis that was conducted by Yahoo! editors.
We demonstrate in this paper that such missions can be automatically identified on-the-fly, as the user interacts with the search engine, through careful runtime analysis of query flows and query sessions. The on-the-fly automatic identification of research missions has been implemented in Search Pad, a novel Yahoo! application that was launched in 2009, and that we present in this paper. Search Pad helps users keeping trace of results they have consulted. Its novelty however is that unlike previous notes taking products, it is automatically triggered only when the system decides, with a fair level of confidence, that the user is undertaking a research mission and thus is in the right context for gathering notes. Beyond the Search Pad specific application, we believe that changing the level of granularity of query modeling, from an isolated query to a list of queries pertaining to the same research missions, so as to better reflect a certain type of information needs, can be beneficial in a number of other Web search applications. Session-awareness is growing and it is likely to play, in the near future, a fundamental role in many on-line tasks: this paper presents a first step on this path. Keywords: persistent search, query logs, sessions, task-oriented search | |||
| Time is of the essence: improving recency ranking using Twitter data | | BIBAK | Full-Text | 331-340 | |
| Anlei Dong; Ruiqiang Zhang; Pranam Kolari; Jing Bai; Fernando Diaz; Yi Chang; Zhaohui Zheng; Hongyuan Zha | |||
| Realtime web search refers to the retrieval of very fresh content which is
in high demand. An effective portal web search engine must support a variety of
search needs, including realtime web search. However, supporting realtime web
search introduces two challenges not encountered in non-realtime web search:
quickly crawling relevant content and ranking documents with impoverished link
and click information. In this paper, we advocate the use of realtime
micro-blogging data for addressing both of these problems. We propose a method
to use the micro-blogging data stream to detect fresh URLs. We also use
micro-blogging data to compute novel and effective features for ranking fresh
URLs. We demonstrate these methods improve effective of the portal web search
engine for realtime web search. Keywords: Twitter, recency modeling, recency ranking | |||
| Highlighting disputed claims on the web | | BIBAK | Full-Text | 341-350 | |
| Rob Ennals; Beth Trushkowsky; John Mark Agosta | |||
| We describe Dispute Finder, a browser extension that alerts a user when
information they read online is disputed by a source that they might trust.
Dispute Finder examines the text on the page that the user is browsing and
highlights any phrases that resemble known disputed claims. If a user clicks on
a highlighted phrase then Dispute Finder shows them a list of articles that
support other points of view.
Dispute Finder builds a database of known disputed claims by crawling web sites that already maintain lists of disputed claims, and by allowing users to enter claims that they believe are disputed. Dispute Finder identifies snippets that make known disputed claims by running a simple textual entailment algorithm inside the browser extension, referring to a cached local copy of the claim database. In this paper, we explain the design of Dispute Finder, and the trade-offs between the various design decisions that we explored. Keywords: annotation, argumentation, cscw, sensemaking, web | |||
| Privacy wizards for social networking sites | | BIBAK | Full-Text | 351-360 | |
| Lujun Fang; Kristen LeFevre | |||
| Privacy is an enormous problem in online social networking sites. While
sites such as Facebook allow users fine-grained control over who can see their
profiles, it is difficult for average users to specify this kind of detailed
policy.
In this paper, we propose a template for the design of a social networking privacy wizard. The intuition for the design comes from the observation that real users conceive their privacy preferences (which friends should be able to see which information) based on an implicit set of rules. Thus, with a limited amount of user input, it is usually possible to build a machine learning model that concisely describes a particular user's preferences, and then use this model to configure the user's privacy settings automatically. As an instance of this general framework, we have built a wizard based on an active learning paradigm called uncertainty sampling. The wizard iteratively asks the user to assign privacy "labels" to selected ("informative") friends, and it uses this input to construct a classifier, which can in turn be used to automatically assign privileges to the rest of the user's (unlabeled) friends. To evaluate our approach, we collected detailed privacy preference data from 45 real Facebook users. Our study revealed two important things. First, real users tend to conceive their privacy preferences in terms of communities, which can easily be extracted from a social network graph using existing techniques. Second, our active learning wizard, using communities as features, is able to recommend high-accuracy privacy settings using less user input than existing policy-specification tools. Keywords: active learning, social network privacy, usability | |||
| A novel traffic analysis for identifying search fields in the long tail of web sites | | BIBAK | Full-Text | 361-370 | |
| George Forman; Evan Kirshenbaum; Shyamsundar Rajaram | |||
| Using a clickstream sample of 2 billion URLs from many thousand volunteer
Web users, we wish to analyze typical usage of keyword searches across the Web.
In order to do this, we need to be able to determine whether a given URL
represents a keyword search and, if so, which field contains the query.
Although it is easy to recognize 'q' as the query field in
'http://www.google.com/search?hl=en&q=music', we must do this automatically
for the long tail of diverse websites. This problem is the focus of this paper.
Since the names, types and number of fields differ across sites, this does not
conform to traditional text classification or to multi-class problem
formulations. The problem also exhibits highly non-uniform importance across
websites, since traffic follows a Zipf distribution.
We developed a solution based on manually identifying the query fields on the most popular sites, followed by an adaptation of machine learning for the rest. It involves an interesting case-instances structure: labeling each website case usually involves selecting at most one of the field instances as positive, based on seeing sample field values. This problem structure and soft constraint -- which we believe has broader applicability -- can be used to greatly reduce the manual labeling effort. We employed active learning and judicious GUI presentation to efficiently train a classifier with accuracy estimated at 96%, beating several baseline alternatives. Keywords: active learning, clickstream analysis, machine learning classification, web
data mining | |||
| Expressive auctions for externalities in online advertising | | BIBAK | Full-Text | 371-380 | |
| Arpita Ghosh; Amin Sayedi | |||
| When online ads are shown together, they compete for user attention and
conversions, imposing negative externalities on each other. While the
competition for user attention in sponsored search can be captured via models
of clickthrough rates, the post-click competition for conversions cannot: since
the value-per-click of an advertiser is proportional to the conversion
probability conditional on a click, which depends on the other ads displayed,
the private value of an advertiser is no longer one-dimensional, and the GSP
mechanism is not adequately expressive. We study the design of expressive
GSP-like mechanisms for the simplest form that an advertiser's private value
can have in the presence of such externalities -- an advertiser's value depends
on exclusivity, i.e., whether her ad is shown exclusively, or along with other
ads.
Our auctions take as input two-dimensional (per-click) bids for exclusive and nonexclusive display, and have two types of outcomes: either a single ad is displayed exclusively, or multiple ads are simultaneously shown. We design two expressive auctions that are both extensions of GSP- the first auction, GSP2D, is designed with the property that the allocation and pricing are identical to GSP when multiple ads are shown; the second auction, NP2D, is designed to be a next price auction. We show that both auctions have high efficiency and revenue in all reasonable equilibria; further, the NP2D auction is guaranteed to always have an equilibrium with revenue at least as much as the current GSP mechanism. However, we find that unlike with one-dimensional valuations, the GSP-like auctions for these richer valuations do not always preserve efficiency and revenue with respect to the VCG mechanism. Keywords: auction design, expressive auctions, externalities, online advertising | |||
| Tracking the random surfer: empirically measured teleportation parameters in PageRank | | BIBAK | Full-Text | 381-390 | |
| David F. Gleich; Paul G. Constantine; Abraham D. Flaxman; Asela Gunawardana | |||
| PageRank computes the importance of each node in a directed graph under a
random surfer model governed by a teleportation parameter. Commonly denoted
alpha, this parameter models the probability of following an edge inside the
graph or, when the graph comes from a network of web pages and links, clicking
a link on a web page. We empirically measure the teleportation parameter based
on browser toolbar logs and a click trail analysis. For a particular user or
machine, such analysis produces a value of alpha. We find that these values
nicely fit a Beta distribution with mean edge-following probability between 0.3
and 0.7, depending on the site. Using these distributions, we compute PageRank
scores where PageRank is computed with respect to a distribution as the
teleportation parameter, rather than a constant teleportation parameter. These
new metrics are evaluated on the graph of pages in Wikipedia. Keywords: PageRank, Wikipedia, click trail analysis, empirical click probability,
teleportation parameter, toolbar data | |||
| Document recommendation in social tagging services | | BIBAK | Full-Text | 391-400 | |
| Ziyu Guan; Can Wang; Jiajun Bu; Chun Chen; Kun Yang; Deng Cai; Xiaofei He | |||
| Social tagging services allow users to annotate various online resources
with freely chosen keywords (tags). They not only facilitate the users in
finding and organizing online resources, but also provide meaningful
collaborative semantic data which can potentially be exploited by recommender
systems. Traditional studies on recommender systems focused on user rating
data, while recently social tagging data is becoming more and more prevalent.
How to perform resource recommendation based on tagging data is an emerging
research topic. In this paper we consider the problem of document (e.g. Web
pages, research papers) recommendation using purely tagging data. That is, we
only have data containing users, tags, documents and the relationships among
them. We propose a novel graph-based representation learning algorithm for this
purpose. The users, tags and documents are represented in the same semantic
space in which two related objects are close to each other. For a given user,
we recommend those documents that are sufficiently close to him/her.
Experimental results on two data sets crawled from Del.icio.us and CiteULike
show that our algorithm can generate promising recommendations and outperforms
traditional recommendation algorithms. Keywords: recommender systems, social tagging | |||
| Equip tourists with knowledge mined from travelogues | | BIBAK | Full-Text | 401-410 | |
| Qiang Hao; Rui Cai; Changhu Wang; Rong Xiao; Jiang-Ming Yang; Yanwei Pang; Lei Zhang | |||
| With the prosperity of tourism and Web 2.0 technologies, more and more
people have willingness to share their travel experiences on the Web (e.g.,
weblogs, forums, or Web 2.0 communities). These so-called travelogues contain
rich information, particularly including location-representative knowledge such
as attractions (e.g., Golden Gate Bridge), styles (e.g., beach, history), and
activities (e.g., diving, surfing). The location-representative information in
travelogues can greatly facilitate other tourists' trip planning, if it can be
correctly extracted and summarized. However, since most travelogues are
unstructured and contain much noise, it is difficult for common users to
utilize such knowledge effectively. In this paper, to mine
location-representative knowledge from a large collection of travelogues, we
propose a probabilistic topic model, named as Location-Topic model. This model
has the advantages of (1) differentiability between two kinds of topics, i.e.,
local topics which characterize locations and global topics which represent
other common themes shared by various locations, and (2) representation of
locations in the local topic space to encode both location-representative
knowledge and similarities between locations. Some novel applications are
developed based on the proposed model, including (1) destination recommendation
for on flexible queries, (2) characteristic summarization for a given
destination with representative tags and snippets, and (3) identification of
informative parts of a travelogue and enriching such highlights with related
images. Based on a large collection of travelogues, the proposed framework is
evaluated using both objective and subjective evaluation methods and shows
promising results. Keywords: probabilistic topic model, recommendation, travelogue mining | |||
| Data summaries for on-demand queries over linked data | | BIBAK | Full-Text | 411-420 | |
| Andreas Harth; Katja Hose; Marcel Karnstedt; Axel Polleres; Kai-Uwe Sattler; Jürgen Umbrich | |||
| Typical approaches for querying structured Web Data collect (crawl) and
pre-process (index) large amounts of data in a central data repository before
allowing for query answering. However, this time-consuming pre-processing phase
however leverages the benefits of Linked Data -- where structured data is
accessible live and up-to-date at distributed Web resources that may change
constantly -- only to a limited degree, as query results can never be current.
An ideal query answering system for Linked Data should return current answers
in a reasonable amount of time, even on corpora as large as the Web. Query
processors evaluating queries directly on the live sources require knowledge of
the contents of data sources. In this paper, we develop and evaluate an
approximate index structure summarising graph-structured content of sources
adhering to Linked Data principles, provide an algorithm for answering
conjunctive queries over Linked Data on the Web exploiting the source summary,
and evaluate the system using synthetically generated queries. The experimental
results show that our lightweight index structure enables complete and
up-to-date query results over Linked Data, while keeping the overhead for
querying low and providing a satisfying source ranking at no additional cost. Keywords: index structures, linked data, rdf querying | |||
| Context-aware citation recommendation | | BIBAK | Full-Text | 421-430 | |
| Qi He; Jian Pei; Daniel Kifer; Prasenjit Mitra; Lee Giles | |||
| When you write papers, how many times do you want to make some citations at
a place but you are not sure which papers to cite? Do you wish to have a
recommendation system which can recommend a small number of good candidates for
every place that you want to make some citations? In this paper, we present our
initiative of building a context-aware citation recommendation system. High
quality citation recommendation is challenging: not only should the citations
recommended be relevant to the paper under composition, but also should match
the local contexts of the places citations are made. Moreover, it is far from
trivial to model how the topic of the whole paper and the contexts of the
citation places should affect the selection and ranking of citations. To tackle
the problem, we develop a context-aware approach. The core idea is to design a
novel non-parametric probabilistic model which can measure the context-based
relevance between a citation context and a document. Our approach can recommend
citations for a context effectively. Moreover, it can recommend a set of
citations for a paper with high quality. We implement a prototype system in
CiteSeerX. An extensive empirical evaluation in the CiteSeerX digital library
against many baselines demonstrates the effectiveness and the scalability of
our approach. Keywords: bibliometrics, context, Gleason's theorem, recommender systems | |||
| The anatomy of a large-scale social search engine | | BIBAK | Full-Text | 431-440 | |
| Damon Horowitz; Sepandar D. Kamvar | |||
| We present Aardvark, a social search engine. With Aardvark, users ask a
question, either by instant message, email, web input, text message, or voice.
Aardvark then routes the question to the person in the user's extended social
network most likely to be able to answer that question. As compared to a
traditional web search engine, where the challenge lies in finding the right
document to satisfy a user's information need, the challenge in a social search
engine like Aardvark lies in finding the right person to satisfy a user's
information need. Further, while trust in a traditional search engine is based
on authority, in a social search engine like Aardvark, trust is based on
intimacy. We describe how these considerations inform the architecture,
algorithms, and user interface of Aardvark, and how they are reflected in the
behavior of Aardvark users. Keywords: social search | |||
| Multi-modality in one-class classification | | BIBAK | Full-Text | 441-450 | |
| Matthijs Hovelynck; Boris Chidlovskii | |||
| We propose a method for improving classification performance in a one-class
setting by combining classifiers of different modalities. We apply the method
to the problem of distinguishing responsive documents in a corpus of e-mails,
like Enron Corpus. We extract the social network of actors which is implicit in
a large body of electronic communication and turn it into valuable features for
classifying the exchanged documents. Working in a one-class setting we follow a
semi-supervised approach based on the Mapping Convergence framework. We propose
an alternative interpretation, that allows for broader applicability when
positive and negative items are not naturally separable. We propose an
extension to the one-class evaluation framework in truly one-case cases when
only some positive training examples are available. We extent the one-class
setting to the co-training principle that enables us to take advantage of
multiple views on the data. We report evaluation results of this extension on
three different corpora including Enron Corpus. Keywords: co-training, Enron, one-class classification, responsiveness | |||
| Exploring web scale language models for search query processing | | BIBAK | Full-Text | 451-460 | |
| Jian Huang; Jianfeng Gao; Jiangbo Miao; Xiaolong Li; Kuansan Wang; Fritz Behr; C. Lee Giles | |||
| It has been widely observed that search queries are composed in a very
different style from that of the body or the title of a document. Many
techniques explicitly accounting for this language style discrepancy have shown
promising results for information retrieval, yet a large scale analysis on the
extent of the language differences has been lacking. In this paper, we present
an extensive study on this issue by examining the language model properties of
search queries and the three text streams associated with each web document:
the body, the title, and the anchor text. Our information theoretical analysis
shows that queries seem to be composed in a way most similar to how authors
summarize documents in anchor texts or titles, offering a quantitative
explanation to the observations in past work.
We apply these web scale n-gram language models to three search query processing (SQP) tasks: query spelling correction, query bracketing and long query segmentation. By controlling the size and the order of different language models, we find that the perplexity metric to be a good accuracy indicator for these query processing tasks. We show that using smoothed language models yields significant accuracy gains for query bracketing for instance, compared to using web counts as in the literature. We also demonstrate that applying web-scale language models can have marked accuracy advantage over smaller ones. Keywords: language models, search query processing, very large-scale experiments, web
n-gram | |||
| A scalable machine-learning approach for semi-structured named entity recognition | | BIBAK | Full-Text | 461-470 | |
| Utku Irmak; Reiner Kraft | |||
| Named entity recognition studies the problem of locating and classifying
parts of free text into a set of predefined categories. Although extensive
research has focused on the detection of person, location and organization
entities, there are many other entities of interest, including phone numbers,
dates, times and currencies (to name a few examples). We refer to these types
of entities as "semi-structured named entities", since they usually follow
certain syntactic formats according to some conventions, although their
structure is typically not well-defined. Regular expression solutions require
significant amount of manual effort and supervised machine learning approaches
rely on large sets of labeled training data. Therefore, these approaches do not
scale when we need to support many semi-structured entity types in many
languages and regions.
In this paper, we study this problem and propose a novel three-level bootstrapping framework for the detection of semi-structured entities. We describe the proposed techniques for phone, date and time entities, and perform extensive evaluations on English, German, Polish, Swedish and Turkish documents. Despite the minimal input from the user, our approach can achieve 95% precision and 84% recall for phone entities, and 94% precision and 81% recall for date and time entities, on average. We also discuss implementation details and report run time performance results, which show significant improvements over regular expression based solutions. Keywords: boostrapping algorithm, ner, weakly-supervised learning | |||
| Autonomous resource provisioning for multi-service web applications | | BIBAK | Full-Text | 471-480 | |
| Dejun Jiang; Guillaume Pierre; Chi-Hung Chi | |||
| Dynamic resource provisioning aims at maintaining the end-to-end response
time of a web application within a pre-defined SLA. Although the topic has been
well studied for monolithic applications, provisioning resources for
applications composed of multiple services remains a challenge. When the SLA is
violated, one must decide which service(s) should be reprovisioned for optimal
effect. We propose to assign an SLA only to the front-end service. Other
services are not given any particular response time objectives. Services are
autonomously responsible for their own provisioning operations and
collaboratively negotiate performance objectives with each other to decide the
provisioning service(s). We demonstrate through extensive experiments that our
system can add/remove/shift both servers and caches within an entire
multi-service application under varying workloads to meet the SLA target and
improve resource utilization. Keywords: multi-service application, resource provisioning | |||
| Topic initiator detection on the world wide web | | BIBAK | Full-Text | 481-490 | |
| Xin Jin; Scott Spangler; Rui Ma; Jiawei Han | |||
| In this paper we introduce a new Web mining and search technique -- Topic
Initiator Detection (TID) on the Web. Given a topic query on the Internet and
the resulting collection of time-stamped web documents which contain the query
keywords, the task of TID is to automatically return which web document (or its
author) initiated the topic or was the first to discuss about the topic.
To deal with the TID problem, we design a system framework and propose algorithm InitRank (Initiator Ranking) to rank the web documents by their possibility to be the topic initiator. We first extract features from the web documents and design several topic initiator indicators. Then, we propose a TCL graph which integrates the Time, Content and Link information and design an optimization framework over the graph to compute InitRank. Experiments show that compared with baseline methods, such as direct time sorting, well-known link based ranking algorithms PageRank and HITS, InitRank achieves the best overall performance with high effectiveness and robustness. In case studies, we successfully detected (1) the first web document related to a famous rumor of an Australia product banned in USA and (2) the pre-release of IBM and Google Cloud Computing collaboration before the official announcement. Keywords: information retrieval, ranking, topic initiator, web mining | |||
| Smart caching for web browsers | | BIBAK | Full-Text | 491-500 | |
| Kaimin Zhang; Lu Wang; Aimin Pan; Bin Benjamin Zhu | |||
| This paper presents smart caching schemes for Web browsers. For modern Web
applications, the style formatting and layout calculation often account for
substantial amounts of the local computation in order to render a Web page. In
this paper, we propose two caching schemes to reduce the computation of style
formatting and layout calculation, named smart style caching and layout
caching, respectively. The stable style data and layout data for DOM (Document
Object Model) elements are recorded to construct the caches when a Web page is
browsed. The cached data is checked in the granularity of DOM elements and
applied directly if the identified DOM element is not changed in the sequent
visits to the same page.
We have implemented a prototype of the proposed caching schemes based on the Webkit layout engine. The experimental results with Web pages from the Top 25 Web sites show that, with the smart style caching scheme enabled, the time consumed for style formatting is reduced by 64% on average; with both the smart style caching scheme and layout caching scheme enabled, the time consumed for layout calculation are reduced by 61% on average, and the overall performance improvement is about 46%. Keywords: browser, caching, cascade style sheet, css, javascript, web | |||
| Large-scale bot detection for search engines | | BIBAK | Full-Text | 501-510 | |
| Hongwen Kang; Kuansan Wang; David Soukal; Fritz Behr; Zijian Zheng | |||
| In this paper, we propose a semi-supervised learning approach for
classifying program (bot) generated web search traffic from that of genuine
human users. The work is motivated by the challenge that the enormous amount of
search data pose to traditional approaches that rely on fully annotated
training samples. We propose a semi-supervised framework that addresses the
problem in multiple fronts. First, we use the CAPTCHA technique and simple
heuristics to extract from the data logs a large set of training samples with
initial labels, though directly using these training data is problematic
because the data thus sampled are biased. To tackle this problem, we further
develop a semi-supervised learning algorithm to take advantage of the unlabeled
data to improve the classification performance. These two proposed algorithms
can be seamlessly combined and very cost efficient to scale the training
process. In our experiment, the proposed approach showed significant (i.e. 2:1)
improvement compared to the traditional supervised approach. Keywords: bot detection, captcha, click logs, query logs, search engine,
semi-supervised learning | |||
| LCA-based selection for XML document collections | | BIBAK | Full-Text | 511-520 | |
| Georgia Koloniari; Evaggelia Pitoura | |||
| In this paper, we address the problem of database selection for XML document
collections, that is, given a set of collections and a user query, how to rank
the collections based on their goodness to the query. Goodness is determined by
the relevance of the documents in the collection to the query. We consider
keyword queries and support Lowest Common Ancestor (LCA) semantics for defining
query results, where the relevance of each document to a query is determined by
properties of the LCA of those nodes in the XML document that contain the query
keywords. To avoid evaluating queries against each document in a collection, we
propose maintaining in a preprocessing phase, information about the LCAs of all
pairs of keywords in a document and use it to approximate the properties of the
LCA-based results of a query. To improve storage and processing efficiency, we
use appropriate summaries of the LCA information based on Bloom filters. We
address both a boolean and a weighted version of the database selection
problem. Our experimental results show that our approach incurs low errors in
the estimation of the goodness of a collection and provides rankings that are
very close to the actual ones. Keywords: database selection, lowest common ancestor, xml | |||
| Stop thinking, start tagging: tag semantics emerge from collaborative verbosity | | BIBAK | Full-Text | 521-530 | |
| Christian Körner; Dominik Benz; Andreas Hotho; Markus Strohmaier; Gerd Stumme | |||
| Recent research provides evidence for the presence of emergent semantics in
collaborative tagging systems. While several methods have been proposed, little
is known about the factors that influence the evolution of semantic structures
in these systems. A natural hypothesis is that the quality of the emergent
semantics depends on the pragmatics of tagging: Users with certain usage
patterns might contribute more to the resulting semantics than others. In this
work, we propose several measures which enable a pragmatic differentiation of
taggers by their degree of contribution to emerging semantic structures. We
distinguish between categorizers, who typically use a small set of tags as a
replacement for hierarchical classification schemes, and describers, who are
annotating resources with a wealth of freely associated, descriptive keywords.
To study our hypothesis, we apply semantic similarity measures to 64 different
partitions of a real-world and large-scale folksonomy containing different
ratios of categorizers and describers. Our results not only show that "verbose"
taggers are most useful for the emergence of tag semantics, but also that a
subset containing only 40% of the most 'verbose' taggers can produce results
that match and even outperform the semantic precision obtained from the whole
dataset. Moreover, the results suggest that there exists a causal link between
the pragmatics of tagging and resulting emergent semantics. This work is
relevant for designers and analysts of tagging systems interested (i) in
fostering the semantic development of their platforms, (ii) in identifying
users introducing "semantic noise", and (iii) in learning ontologies. Keywords: folksonomies, pragmatics, semantics, tagging, user characteristics | |||
| Mind the data skew: distributed inferencing by speeddating in elastic regions | | BIBAK | Full-Text | 531-540 | |
| Spyros Kotoulas; Eyal Oren; Frank van Harmelen | |||
| Semantic Web data exhibits very skewed frequency distributions among terms.
Efficient large-scale distributed reasoning methods should maintain
load-balance in the face of such highly skewed distribution of input data. We
show that term-based partitioning, used by most distributed reasoning
approaches, has limited scalability due to load-balancing problems.
We address this problem with a method for data distribution based on clustering in elastic regions. Instead of assigning data to fixed peers, data flows semi-randomly in the network. Data items "speed-date" while being temporarily collocated in the same peer. We introduce a bias in the routing to allow semantically clustered neighborhoods to emerge. Our approach is self-organising, efficient and does not require any central coordination. We have implemented this method on the MaRVIN platform and have performed experiments on large real-world datasets, using a cluster of up to 64 nodes. We compute the RDFS closure over different datasets and show that our clustering algorithm drastically reduces computation time, calculating the RDFS closure of 200 million triples in 7.2 minutes. Keywords: distributed, peer-to-peer, reasoning, self-organisation | |||
| Towards natural question guided search | | BIBAK | Full-Text | 541-550 | |
| Alexander Kotov; ChengXiang Zhai | |||
| Web search is generally motivated by an information need. Since asking
well-formulated questions is the fastest and the most natural way to obtain
information for human beings, almost all queries posed to search engines
correspond to some underlying questions, which reflect the user's information
need. Accurate determination of these questions may substantially improve the
quality of search results and usability of search interfaces. In this paper, we
propose a new framework for question-guided search, in which a retrieval system
would automatically generate potentially interesting questions to users based
on the search results of a query. Since the answers to such questions are known
to exist in the search results, these questions can potentially guide users
directly to the answers that they are looking for, eliminating the need to scan
the documents in the result list. Moreover, in case of imprecise or ambiguous
queries, automatically generated questions can naturally engage users into a
feedback cycle to refine their information need and guide them towards their
search goals. Implementation of the proposed strategy raises new challenges in
content indexing, question generation, ranking and feedback. We propose new
methods to address these challenges and evaluated them with a prototype system
on a subset of Wikipedia. Evaluation results show the promise of this new
question-guided search strategy. Keywords: interactive search, query processing, search interface | |||
| Fine-grained privilege separation for web applications | | BIBAK | Full-Text | 551-560 | |
| Akshay Krishnamurthy; Adrian Mettler; David Wagner | |||
| We present a programming model for building web applications with security
properties that can be confidently verified during a security review. In our
model, applications are divided into isolated, privilege-separated components,
enabling rich security policies to be enforced in a way that can be checked by
reviewers. In our model, the web framework enforces privilege separation and
isolation of web applications by requiring the use of an object-capability
language and providing interfaces that expose limited, explicitly-specified
privileges to application components. This approach restricts what each
component of the application can do and quarantines buggy or compromised code.
It also provides a way to more safely integrate third-party, less-trusted code
into a web application. We have implemented a prototype of this model based
upon the Java Servlet framework and used it to build a webmail application. Our
experience with this example suggests that the approach is viable and helpful
at establishing reviewable application-specific security properties. Keywords: object-capabilities, privilege separation, web applications | |||
| A characterization of online browsing behavior | | BIBAK | Full-Text | 561-570 | |
| Ravi Kumar; Andrew Tomkins | |||
| In this paper, we undertake a large-scale study of online user behavior
based on search and toolbar logs. We propose a new CCS taxonomy of pageviews
consisting of Content (news, portals, games, verticals, multimedia),
Communication (email, social networking, forums, blogs, chat), and Search (Web
search, item search, multimedia search). We show that roughly half of all
pageviews online are content, one-third are communications, and the remaining
one-sixth are search. We then give further breakdowns to characterize the
pageviews within each high-level category.
We then study the extent to which pages of certain types are revisited by the same user over time, and the mechanisms by which users move from page to page, within and across hosts, and within and across page types. We consider robust schemes for assigning responsibility for a pageview to ancestors along the chain of referrals. We show that mail, news, and social networking pageviews are insular in nature, appearing primarily in homogeneous sessions of one type. Search pageviews, on the other hand, appear on the path to a disproportionate number of pageviews, but cannot be viewed as the principal mechanism by which those pageviews were reached. Finally, we study the burstiness of pageviews associated with a URL, and show that by and large, online browsing behavior is not significantly affected by "breaking" material with non-uniform visit frequency. Keywords: browsing, pageviews, toolbar analysis | |||
| Generalized distances between rankings | | BIBAK | Full-Text | 571-580 | |
| Ravi Kumar; Sergei Vassilvitskii | |||
| Spearman's footrule and Kendall's tau are two well established distances
between rankings. They, however, fail to take into account concepts crucial to
evaluating a result set in information retrieval: element relevance and
positional information. That is, changing the rank of a highly-relevant
document should result in a higher penalty than changing the rank of an
irrelevant document; a similar logic holds for the top versus the bottom of the
result ordering. In this work, we extend both of these metrics to those with
position and element weights, and show that a variant of the Diaconis-Graham
inequality still holds -- the generalized two measures remain within a constant
factor of each other for all permutations.
We continue by extending the element weights into a distance metric between elements. For example, in search evaluation, swapping the order of two nearly duplicate results should result in little penalty, even if these two are highly relevant and appear at the top of the list. We extend the distance measures to this more general case and show that they remain within a constant factor of each other. We conclude by conducting simple experiments on web search data with the proposed measures. Our experiments show that the weighted generalizations are more robust and consistent with each other than their unweighted counter-parts. Keywords: kendall's tau, metrics, permutation distances, spearman footrule | |||
| Redundancy detection in service-oriented systems | | BIBAK | Full-Text | 581-590 | |
| Peep Küngas; Marlon Dumas | |||
| This paper addresses the problem of identifying redundant data in
large-scale service-oriented information systems. Specifically, the paper puts
forward an automated method to pinpoint potentially redundant data attributes
from a given collection of semantically-annotated Web service interfaces. The
key idea is to construct a service network to represent all input and output
dependencies between data attributes and operations captured in the service
interfaces, and to apply centrality measures from network theory in order to
quantify the degree to which an attribute belongs to a given subsystem. The
proposed method was tested on a federated governmental information system
consisting of 58 independently-maintained information systems providing
altogether about 1000 service operations described in WSDL. The accuracy of the
method is evaluated in terms of precision and recall. Keywords: data model redundancy, federated information systems, metrics, semantic web
services, web services | |||
| What is Twitter, a social network or a news media? | | BIBAK | Full-Text | 591-600 | |
| Haewoon Kwak; Changhyun Lee; Hosung Park; Sue Moon | |||
| Twitter, a microblogging service less than three years old, commands more
than 41 million users as of July 2009 and is growing fast. Twitter users tweet
about any topic within the 140-character limit and follow others to receive
their tweets. The goal of this paper is to study the topological
characteristics of Twitter and its power as a new medium of information
sharing.
We have crawled the entire Twitter site and obtained 41.7 million user profiles, 1.47 billion social relations, 4,262 trending topics, and 106 million tweets. In its follower-following topology analysis we have found a non-power-law follower distribution, a short effective diameter, and low reciprocity, which all mark a deviation from known characteristics of human social networks [28]. In order to identify influentials on Twitter, we have ranked users by the number of followers and by PageRank and found two rankings to be similar. Ranking by retweets differs from the previous two rankings, indicating a gap in influence inferred from the number of followers and that from the popularity of one's tweets. We have analyzed the tweets of top trending topics and reported on their temporal behavior and user participation. We have classified the trending topics based on the active period and the tweets and show that the majority (over 85%) of topics are headline news or persistent news in nature. A closer look at retweets reveals that any retweeted tweet is to reach an average of 1,000 users no matter what the number of followers is of the original tweet. Once retweeted, a tweet gets retweeted almost instantly on next hops, signifying fast diffusion of information after the 1st retweet. To the best of our knowledge this work is the first quantitative study on the entire Twittersphere and information diffusion on it. Keywords: Twitter, degree of separation, homophily, influential, information
diffusion, online social network, pagerank, reciprocity, retweet | |||
| Randomization tests for distinguishing social influence and homophily effects | | BIBAK | Full-Text | 601-610 | |
| Timothy La Fond; Jennifer Neville | |||
| Relational autocorrelation is ubiquitous in relational domains. This
observed correlation between class labels of linked instances in a network
(e.g., two friends are more likely to share political beliefs than two randomly
selected people) can be due to the effects of two different social processes.
If social influence effects are present, instances are likely to change their
attributes to conform to their neighbor values. If homophily effects are
present, instances are likely to link to other individuals with similar
attribute values. Both these effects will result in autocorrelated attribute
values. When analyzing static relational networks it is impossible to determine
how much of the observed correlation is due each of these factors. However, the
recent surge of interest in social networks has increased the availability of
dynamic network data. In this paper, we present a randomization technique for
temporal network data where the attributes and links change over time. Given
data from two time steps, we measure the gain in correlation and assess whether
a significant portion of this gain is due to influence and/or homophily. We
demonstrate the efficacy of our method on semi-synthetic data and then apply
the method to a real-world social networks dataset, showing the impact of both
influence and homophily effects. Keywords: homophily, randomization, social influence, social networks | |||
| A pattern tree-based approach to learning URL normalization rules | | BIBAK | Full-Text | 611-620 | |
| Tao Lei; Rui Cai; Jiang-Ming Yang; Yan Ke; Xiaodong Fan; Lei Zhang | |||
| Duplicate URLs have brought serious troubles to the whole pipeline of a
search engine, from crawling, indexing, to result serving. URL normalization is
to transform duplicate URLs to a canonical form using a set of rewrite rules.
Nowadays URL normalization has attracted significant attention as it is
lightweight and can be flexibly integrated into both the online (e.g. crawling)
and the offline (e.g. index compression) parts of a search engine. To deal with
a large scale of websites, automatic approaches are highly desired to learn
rewrite rules for various kinds of duplicate URLs. In this paper, we rethink
the problem of URL normalization from a global perspective and propose a
pattern tree-based approach, which is remarkably different from existing
approaches. Most current approaches learn rewrite rules by iteratively inducing
local duplicate pairs to more general forms, and inevitably suffer from noisy
training data and are practically inefficient. Given a training set of URLs
partitioned into duplicate clusters for a targeted website, we develop a simple
yet efficient algorithm to automatically construct a URL pattern tree. With the
pattern tree, the statistical information from all the training samples is
leveraged to make the learning process more robust and reliable. The learning
process is also accelerated as rules are directly summarized based on pattern
tree nodes. In addition, from an engineering perspective, the pattern tree
helps select deployable rules by removing conflicts and redundancies. An
evaluation on more than 70 million duplicate URLs from 200 websites showed that
the proposed approach achieves very promising performance, in terms of both
de-duping effectiveness and computational efficiency. Keywords: url deduplication, url normalization, url pattern | |||
| Using a model of social dynamics to predict popularity of news | | BIBAK | Full-Text | 621-630 | |
| Kristina Lerman; Tad Hogg | |||
| Popularity of content in social media is unequally distributed, with some
items receiving a disproportionate share of attention from users. Predicting
which newly-submitted items will become popular is critically important for
both companies that host social media sites and their users. Accurate and
timely prediction would enable the companies to maximize revenue through
differential pricing for access to content or ad placement. Prediction would
also give consumers an important tool for filtering the ever-growing amount of
content. Predicting popularity of content in social media, however, is
challenging due to the complex interactions among content quality, how the
social media site chooses to highlight content, and influence among users.
While these factors make it difficult to predict popularity a priori, we show
that stochastic models of user behavior on these sites allows predicting
popularity based on early user reactions to new content. By incorporating
aspects of the web site design, such models improve on predictions based on
simply extrapolating from the early votes. We validate this claim on the social
news portal Digg using a previously-developed model of social voting based on
the Digg user interface. Keywords: popularity, prediction, social dynamics, social media, social voting | |||
| Empirical comparison of algorithms for network community detection | | BIBAK | Full-Text | 631-640 | |
| Jure Leskovec; Kevin J. Lang; Michael Mahoney | |||
| Detecting clusters or communities in large real-world graphs such as large
social or information networks is a problem of considerable interest. In
practice, one typically chooses an objective function that captures the
intuition of a network cluster as set of nodes with better internal
connectivity than external connectivity, and then one applies approximation
algorithms or heuristics to extract sets of nodes that are related to the
objective function and that "look like" good communities for the application of
interest.
In this paper, we explore a range of network community detection methods in order to compare them and to understand their relative performance and the systematic biases in the clusters they identify. We evaluate several common objective functions that are used to formalize the notion of a network community, and we examine several different classes of approximation algorithms that aim to optimize such objective functions. In addition, rather than simply fixing an objective and asking for an approximation to the best cluster of any size, we consider a size-resolved version of the optimization problem. Considering community quality as a function of its size provides a much finer lens with which to examine community detection algorithms, since objective functions and approximation algorithms often have non-obvious size-dependent behavior. Keywords: community structure, conductance, flow-based methods, graph partitioning,
spectral methods | |||
| Predicting positive and negative links in online social networks | | BIBAK | Full-Text | 641-650 | |
| Jure Leskovec; Daniel Huttenlocher; Jon Kleinberg | |||
| We study online social networks in which relationships can be either
positive (indicating relations such as friendship) or negative (indicating
relations such as opposition or antagonism). Such a mix of positive and
negative links arise in a variety of online settings; we study datasets from
Epinions, Slashdot and Wikipedia. We find that the signs of links in the
underlying social networks can be predicted with high accuracy, using models
that generalize across this diverse range of sites. These models provide
insight into some of the fundamental principles that drive the formation of
signed links in networks, shedding light on theories of balance and status from
social psychology; they also suggest social computing applications by which the
attitude of one user toward another can be estimated from evidence provided by
their relationships with other members of the surrounding social network. Keywords: distrust, negative edges, positive edges, signed networks, status theory,
structural balance, trust | |||
| Facetedpedia: dynamic generation of query-dependent faceted interfaces for wikipedia | | BIBAK | Full-Text | 651-660 | |
| Chengkai Li; Ning Yan; Senjuti B. Roy; Lekhendro Lisham; Gautam Das | |||
| This paper proposes Facetedpedia, a faceted retrieval system for information
discovery and exploration in Wikipedia. Given the set of Wikipedia articles
resulting from a keyword query, Facetedpedia generates a faceted interface for
navigating the result articles. Compared with other faceted retrieval systems,
Facetedpedia is fully automatic and dynamic in both facet generation and
hierarchy construction, and the facets are based on the rich semantic
information from Wikipedia. The essence of our approach is to build upon the
collaborative vocabulary in Wikipedia, more specifically the intensive internal
structures (hyperlinks) and folksonomy (category system). Given the sheer size
and complexity of this corpus, the space of possible choices of faceted
interfaces is prohibitively large. We propose metrics for ranking individual
facet hierarchies by user's navigational cost, and metrics for ranking
interfaces (each with k facets) by both their average pairwise similarities and
average navigational costs. We thus develop faceted interface discovery
algorithms that optimize the ranking metrics. Our experimental evaluation and
user study verify the effectiveness of the system. Keywords: data exploration, faceted search, wikipedia | |||
| A contextual-bandit approach to personalized news article recommendation | | BIBAK | Full-Text | 661-670 | |
| Lihong Li; Wei Chu; John Langford; Robert E. Schapire | |||
| Personalized web services strive to adapt their services (advertisements,
news articles, etc.) to individual users by making use of both content and user
information. Despite a few recent advances, this problem remains challenging
for at least two reasons. First, web service is featured with dynamically
changing pools of content, rendering traditional collaborative filtering
methods inapplicable. Second, the scale of most web services of practical
interest calls for solutions that are both fast in learning and computation.
In this work, we model personalized recommendation of news articles as a contextual bandit problem, a principled approach in which a learning algorithm sequentially selects articles to serve users based on contextual information about the users and articles, while simultaneously adapting its article-selection strategy based on user-click feedback to maximize total user clicks. The contributions of this work are three-fold. First, we propose a new, general contextual bandit algorithm that is computationally efficient and well motivated from learning theory. Second, we argue that any bandit algorithm can be reliably evaluated offline using previously recorded random traffic. Finally, using this offline evaluation method, we successfully applied our new algorithm to a Yahoo! Front Page Today Module dataset containing over 33 million events. Results showed a 12.5% click lift compared to a standard context-free bandit algorithm, and the advantage becomes even greater when data gets more scarce. Keywords: contextual bandit, exploration/exploitation dilemma, personalization,
recommender systems, web service | |||
| b-Bit minwise hashing | | BIBAK | Full-Text | 671-680 | |
| Ping Li; Christian König | |||
| This paper establishes the theoretical framework of b-bit minwise hashing.
The original minwise hashing method has become a standard technique for
estimating set similarity (e.g., resemblance) with applications in information
retrieval, data management, computational advertising, etc.
By only storing b bits of each hashed value (e.g., b=1 or 2), we gain substantial advantages in terms of storage space. We prove the basic theoretical results and provide an unbiased estimator of the resemblance for any b. We demonstrate that, even in the least favorable scenario, using b=1 may reduce the storage space at least by a factor of 21.3 (or 10.7) compared to b=64 (or b=32), if one is interested in resemblance >0.5. Keywords: hashing, similarity estimation | |||
| Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce | | BIBAK | Full-Text | 681-690 | |
| Chao Liu; Hung-chih Yang; Jinliang Fan; Li-Wei He; Yi-Min Wang | |||
| The Web abounds with dyadic data that keeps increasing by every single
second. Previous work has repeatedly shown the usefulness of extracting the
interaction structure inside dyadic data [21, 9, 8]. A commonly used tool in
extracting the underlying structure is the matrix factorization, whose fame was
further boosted in the Netflix challenge [26]. When we were trying to replicate
the same success on real-world Web dyadic data, we were seriously challenged by
the scalability of available tools. We therefore in this paper report our
efforts on scaling up the nonnegative matrix factorization (NMF) technique. We
show that by carefully partitioning the data and arranging the computations to
maximize data locality and parallelism, factorizing a tens of millions by
hundreds of millions matrix with billions of nonzero cells can be accomplished
within tens of hours. This result effectively assures practitioners of the
scalability of NMF on Web-scale dyadic data. Keywords: distributed computing, dyadic data, mapreduce, nonnegative matrix
factorization | |||
| Exploiting social context for review quality prediction | | BIBAK | Full-Text | 691-700 | |
| Yue Lu; Panayiotis Tsaparas; Alexandros Ntoulas; Livia Polanyi | |||
| Online reviews in which users publish detailed commentary about their
experiences and opinions with products, services, or events are extremely
valuable to users who rely on them to make informed decisions. However, reviews
vary greatly in quality and are constantly increasing in number, therefore,
automatic assessment of review helpfulness is of growing importance. Previous
work has addressed the problem by treating a review as a stand-alone document,
extracting features from the review text, and learning a function based on
these features for predicting the review quality. In this work, we exploit
contextual information about authors' identities and social networks for
improving review quality prediction. We propose a generic framework for
incorporating social context information by adding regularization constraints
to the text-based predictor. Our approach can effectively use the social
context information available for large quantities of unlabeled reviews. It
also has the advantage that the resulting predictor is usable even when social
context is unavailable. We validate our framework within a real commerce portal
and experimentally demonstrate that using social context information can help
improve the accuracy of review quality prediction especially when the available
training data is sparse. Keywords: graph regularization, review helpfulness, review quality, social network | |||
| Sampling community structure | | BIBAK | Full-Text | 701-710 | |
| Arun S. Maiya; Tanya Y. Berger-Wolf | |||
| We propose a novel method, based on concepts from expander graphs, to sample
communities in networks. We show that our sampling method, unlike previous
techniques, produces subgraphs representative of community structure in the
original network. These generated subgraphs may be viewed as stratified samples
in that they consist of members from most or all communities in the network.
Using samples produced by our method, we show that the problem of community
detection may be recast into a case of statistical relational learning. We
empirically evaluate our approach against several real-world datasets and
demonstrate that our sampling method can effectively be used to infer and
approximate community affiliation in the larger network. Keywords: clustering, community detection, complex networks, graphs, sampling, social
networks | |||
| Fast and parallel webpage layout | | BIBAK | Full-Text | 711-720 | |
| Leo A. Meyerovich; Rastislav Bodik | |||
| The web browser is a CPU-intensive program. Especially on mobile devices,
webpages load too slowly, expending significant time in processing a document's
appearance. Due to power constraints, most hardware-driven speedups will come
in the form of parallel architectures. This is also true of mobile devices such
as phones and e-books. In this paper, we introduce new algorithms for CSS
selector matching, layout solving, and font rendering, which represent key
components for a fast layout engine. Evaluation on popular sites shows speedups
as high as 80x. We also formulate the layout problem with attribute grammars,
enabling us to not only parallelize our algorithm but prove that it computes in
O(log) time and without reflow. Keywords: attribute grammar, box model, css, font, html, layout, mobile, multicore,
selector | |||
| Object views: fine-grained sharing in browsers | | BIBAK | Full-Text | 721-730 | |
| Leo A. Meyerovich; Adrienne Porter Felt; Mark S. Miller | |||
| Browsers do not currently support the secure sharing of JavaScript objects
between principals. We present this problem as the need for object views, which
are consistent and controllable versions of objects. Multiple views can be made
for the same object and customized for the recipients. We implement object
views with a JavaScript library that wraps shared objects and interposes on all
access attempts. The security challenge is to fully mediate access to objects
shared through a view and prevent privilege escalation. We discuss how object
views can be deployed in two settings: same-origin sharing with rewriting-based
JavaScript isolation systems like Google Caja, and inter-origin sharing between
browser frames over a message-passing channel.
To facilitate simple document sharing, we build a policy system for declaratively defining policies for document object views. Notably, our document policy system makes it possible to hide elements without breaking document structure invariants. Developers can control the fine-grained behavior of object views with an aspect system that accepts programmatic policies. Keywords: aspect, browser, ecmascript, javascript, membrane, reference monitor, remote
object, same origin policy, web security, wrapper | |||
| Protocol-aware matching of web service interfaces for adapter development | | BIBAK | Full-Text | 731-740 | |
| Hamid Reza Motahari Nezhad; Guang Yuan Xu; Boualem Benatallah | |||
| With the rapid growth in the number of online Web services, the problem of
service adaptation has received significant attention. In matching and
adaptation, the functional description of services including interface and data
as well as behavioral descriptions are important. Existing work on matching and
adaptation focuses only on one aspect.
In this paper, we present a semi-automated matching approach that considers both service descriptions. We introduce two protocol-aware service interface matching algorithms, i.e. depth-based interface matching and iterative reference-based interface matching. These algorithms refine the results of interface matching by incorporating the ordering constraints imposed by business protocol definitions on service operations. We have implemented a prototype and performed experiments using the specification of synthetic and real-world Web services. Experiments show that the proposed approaches lead to a significant improvement in the quality of matching between services. Keywords: adaptation, business process, web services | |||
| Volunteer computing: a model of the factors determining contribution to community-based scientific research | | BIBAK | Full-Text | 741-750 | |
| Oded Nov; David Anderson; Ofer Arazy | |||
| Volunteer computing is a powerful way to harness distributed resources to
perform large-scale tasks, similarly to other types of community-based
initiatives. Volunteer computing is based on two pillars: the first is
computational -- allocating and managing large computing tasks; the second is
participative -- making large numbers of individuals volunteer their computer
resources to a project. While the computational aspects of volunteer computing
received much research attention, the participative aspect remains largely
unexplored. In this study we aim to address this gap: by drawing on social
psychology and online communities research, we develop and test a
three-dimensional model of the factors determining volunteer computing users'
contribution. We investigate one of the largest volunteer computing projects --
SETI@home -- by linking survey data about contributors' motivations to their
activity logs. Our findings highlight the differences between volunteer
computing and other forms of community-based projects, and reveal the intricate
relationship between individual motivations, social affiliation, tenure in the
project, and resource contribution. Implications for research and practice are
discussed. Keywords: boinc, citizen science, crowdsourcing, motivations, online communities,
seti@home, volunteer computing | |||
| Cross-domain sentiment classification via spectral feature alignment | | BIBAK | Full-Text | 751-760 | |
| Sinno Jialin Pan; Xiaochuan Ni; Jian-Tao Sun; Qiang Yang; Zheng Chen | |||
| Sentiment classification aims to automatically predict sentiment polarity
(e.g., positive or negative) of users publishing sentiment data (e.g., reviews,
blogs). Although traditional classification algorithms can be used to train
sentiment classifiers from manually labeled text data, the labeling work can be
time-consuming and expensive. Meanwhile, users often use some different words
when they express sentiment in different domains. If we directly apply a
classifier trained in one domain to other domains, the performance will be very
low due to the differences between these domains. In this work, we develop a
general solution to sentiment classification when we do not have any labels in
a target domain but have some labeled data in a different domain, regarded as
source domain. In this cross-domain sentiment classification setting, to bridge
the gap between the domains, we propose a spectral feature alignment (SFA)
algorithm to align domain-specific words from different domains into unified
clusters, with the help of domain-independent words as a bridge. In this way,
the clusters can be used to reduce the gap between domain-specific words of the
two domains, which can be used to train sentiment classifiers in the target
domain accurately. Compared to previous approaches, SFA can discover a robust
representation for cross-domain data by fully exploiting the relationship
between the domain-specific and domain-independent words via simultaneously
co-clustering them in a common latent space. We perform extensive experiments
on two real world datasets, and demonstrate that SFA significantly outperforms
previous approaches to cross-domain sentiment classification. Keywords: domain adaptation, feature alignment, opinion mining, sentiment
classification, transfer learning | |||
| DSNotify: handling broken links in the web of data | | BIBAK | Full-Text | 761-770 | |
| Niko P. Popitsch; Bernhard Haslhofer | |||
| The Web of Data has emerged as a way of exposing structured linked data on
the Web. It builds on the central building blocks of the Web (URIs, HTTP) and
benefits from its simplicity and wide-spread adoption. It does, however, also
inherit the unresolved issues such as the broken link problem. Broken links
constitute a major challenge for actors consuming Linked Data as they require
them to deal with reduced accessibility of data. We believe that the broken
link problem is a major threat to the whole Web of Data idea and that both
Linked Data consumers and providers will require solutions that deal with this
problem. Since no general solutions for fixing such links in the Web of Data
have emerged, we make three contributions into this direction: first, we
provide a concise definition of the broken link problem and a comprehensive
analysis of existing approaches. Second, we present DSNotify, a generic
framework able to assist human and machine actors in fixing broken links. It
uses heuristic feature comparison and employs a time-interval-based blocking
technique for the underlying instance matching problem. Third, we derived
benchmark datasets from knowledge bases such as DBpedia and evaluated the
effectiveness of our approach with respect to the broken link problem. Our
results show the feasibility of a time-interval-based blocking approach for
systems that aim at detecting and fixing broken links in the Web of Data. Keywords: blocking, broken links, instance matching, link integrity, linked data | |||
| Ad-hoc object retrieval in the web of data | | BIBAK | Full-Text | 771-780 | |
| Jeffrey Pound; Peter Mika; Hugo Zaragoza | |||
| Semantic Search refers to a loose set of concepts, challenges and techniques
having to do with harnessing the information of the growing Web of Data (WoD)
for Web search. Here we propose a formal model of one specific semantic search
task: ad-hoc object retrieval. We show that this task provides a solid
framework to study some of the semantic search problems currently tackled by
commercial Web search engines. We connect this task to the traditional ad-hoc
document retrieval and discuss appropriate evaluation metrics. Finally, we
carry out a realistic evaluation of this task in the context of a Web search
application. Keywords: evaluation, object retrieval, semantic search | |||
| Diversifying web search results | | BIBAK | Full-Text | 781-790 | |
| Davood Rafiei; Krishna Bharat; Anand Shukla | |||
| Result diversity is a topic of great importance as more facets of queries
are discovered and users expect to find their desired facets in the first page
of the results. However, the underlying questions of how 'diversity' interplays
with 'quality' and when preference should be given to one or both are not
well-understood. In this work, we model the problem as expectation maximization
and study the challenges of estimating the model parameters and reaching an
equilibrium. One model parameter, for example, is correlations between pages
which we estimate using textual contents of pages and click data (when
available). We conduct experiments on diversifying randomly selected queries
from a query log and the queries chosen from the disambiguation topics of
Wikipedia. Our algorithm improves upon Google in terms of the diversity of
random queries, retrieving 14% to 38% more aspects of queries in top 5, while
maintaining a precision very close to Google. On a more selective set of
queries that are expected to benefit from diversification, our algorithm
improves upon Google in terms of precision and diversity of the results, and
significantly outperforms another baseline system for result diversification. Keywords: query ambiguity, result diversity, search diversity | |||
| A large-scale active learning system for topical categorization on the web | | BIBAK | Full-Text | 791-800 | |
| Suju Rajan; Dragomir Yankov; Scott J. Gaffney; Adwait Ratnaparkhi | |||
| Many web applications such as ad matching systems, vertical search engines,
and page categorization systems require the identification of a particular type
or class of pages on the Web. The sheer number and diversity of the pages on
the Web, however, makes the problem of obtaining a good sample of the class of
interest hard. In this paper, we describe a successfully deployed end-to-end
system that starts from a biased training sample and makes use of several
state-of-the-art machine learning algorithms working in tandem, including a
powerful active learning component, in order to achieve a good classification
system. The system is evaluated on traffic from a real-world ad-matching
platform and is shown to achieve high categorization effectiveness with a
significant reduction in editorial effort and labeling time. Keywords: active learning, classification with imbalanced datasets, svms, web scale
performance evaluation | |||
| Distributing private data in challenged network environments | | BIBAK | Full-Text | 801-810 | |
| Azarias Reda; Brian Noble; Yidnekachew Haile | |||
| Developing countries face significant challenges in network access, making
even simple network tasks unpleasant. Many standard techniques -- caching and
predictive prefetching -- help somewhat, but provide little or no assistance
for personal data that is needed only by a single user. Sulula addresses this
problem by leveraging the near-ubiquity of cellular phones able to send and
receive simple SMS messages. Rather than visit a kiosk and fetch data on demand
-- a tiresome process at best -- users request a future visit. If capacity
exists, the kiosk can schedule secure retrieval of that user's data, saving
time and more efficiently utilizing the kiosk's limited connectivity. When the
user arrives at a provisioned kiosk, she need only obtain the session key
on-demand, and thereafter has instant access. In addition, Sulula allows users
to schedule data uploads. Experimental results show significant gains for the
end user, saving tens of minutes of time for a typical email/news reading
session. We also describe a small, ongoing deployment in-country for
proof-of-concept, lessons learned from that experience, and provide a
discussion on pricing and marketplace issues that remain to be addressed to
make the system viable for developing-world access. Keywords: band-width, caching, developing regions, Ethiopia, latency, limited
connectivity, personal data, prefetching, sms, www access | |||
| Factorizing personalized Markov chains for next-basket recommendation | | BIBAK | Full-Text | 811-820 | |
| Steffen Rendle; Christoph Freudenthaler; Lars Schmidt-Thieme | |||
| Recommender systems are an important component of many websites. Two of the
most popular approaches are based on matrix factorization (MF) and Markov
chains (MC). MF methods learn the general taste of a user by factorizing the
matrix over observed user-item preferences. On the other hand, MC methods model
sequential behavior by learning a transition graph over items that is used to
predict the next action based on the recent actions of a user. In this paper,
we present a method bringing both approaches together. Our method is based on
personalized transition graphs over underlying Markov chains. That means for
each user an own transition matrix is learned -- thus in total the method uses
a transition cube. As the observations for estimating the transitions are
usually very limited, our method factorizes the transition cube with a pairwise
interaction model which is a special case of the Tucker Decomposition. We show
that our factorized personalized MC (FPMC) model subsumes both a common Markov
chain and the normal matrix factorization model. For learning the model
parameters, we introduce an adaption of the Bayesian Personalized Ranking (BPR)
framework for sequential basket data. Empirically, we show that our FPMC model
outperforms both the common matrix factorization and the unpersonalized MC
model both learned with and without factorization. Keywords: basket recommendation, Markov chain, matrix factorization | |||
| Sketcha: a captcha based on line drawings of 3D models | | BIBAK | Full-Text | 821-830 | |
| Steven A. Ross; J. Alex Halderman; Adam Finkelstein | |||
| This paper introduces a captcha based on upright orientation of line
drawings rendered from 3D models. The models are selected from a large
database, and images are rendered from random viewpoints, affording many
different drawings from a single 3D model. The captcha presents the user with a
set of images, and the user must choose an upright orientation for each image.
This task generally requires understanding of the semantic content of the
image, which is believed to be difficult for automatic algorithms. We describe
a process called covert filtering whereby the image database can be continually
refreshed with drawings that are known to have a high success rate for humans,
by inserting randomly into the captcha new images to be evaluated. Our analysis
shows that covert filtering can ensure that captchas are likely to be solvable
by humans while deterring attackers who wish to learn a portion of the
database. We performed several user studies that evaluate how effectively
people can solve the captcha. Comparing these results to an attack based on
machine learning, we find that humans possess a substantial performance
advantage over computers. Keywords: 3D models, captcha, drawings, security | |||
| Unlocking the semantics of multimedia presentations in the web with the multimedia metadata ontology | | BIBAK | Full-Text | 831-840 | |
| Carsten Saathoff; Ansgar Scherp | |||
| The semantics of rich multimedia presentations in the web such as SMIL, SVG,
and Flash cannot or only to a very limited extend be understood by search
engines today. This hampers the retrieval of such presentations and makes their
archival and management a difficult task. Existing metadata models and metadata
standards are either conceptually too narrow, focus on a specific media type
only, cannot be used and combined together, or are not practically applicable
for the semantic description of rich multimedia presentations.
In this paper, we propose the Multimedia Metadata Ontology (M3O) for annotating rich, structured multimedia presentations. The M3O provides a generic modeling framework for representing sophisticated multimedia metadata. It allows for integrating the features provided by the existing metadata models and metadata standards. Our approach bases on Semantic Web technologies and can be easily integrated with multimedia formats such as the W3C standards SMIL and SVG. With the M3O, we unlock the semantics of rich multimedia presentations in the web by making the semantics machine-readable and machine-understandable. The M3O is used with our SemanticMM4U framework for the multi-channel generation of semantically-rich multimedia presentations. Keywords: multimedia metadata, rich multimedia presentations, semantic annotation | |||
| Clustering query refinements by user intent | | BIBAK | Full-Text | 841-850 | |
| Eldar Sadikov; Jayant Madhavan; Lu Wang; Alon Halevy | |||
| We address the problem of clustering the refinements of a user search query.
The clusters computed by our proposed algorithm can be used to improve the
selection and placement of the query suggestions proposed by a search engine,
and can also serve to summarize the different aspects of information relevant
to the original user query. Our algorithm clusters refinements based on their
likely underlying user intents by combining document click and session
co-occurrence information. At its core, our algorithm operates by performing
multiple random walks on a Markov graph that approximates user search behavior.
A user study performed on top search engine queries shows that our clusters are
rated better than corresponding clusters computed using approaches that use
only document click or only sessions co-occurrence information. Keywords: clustering, query refinements, random walks | |||
| Earthquake shakes Twitter users: real-time event detection by social sensors | | BIBAK | Full-Text | 851-860 | |
| Takeshi Sakaki; Makoto Okazaki; Yutaka Matsuo | |||
| Twitter, a popular microblogging service, has received much attention
recently. An important characteristic of Twitter is its real-time nature. For
example, when an earthquake occurs, people make many Twitter posts (tweets)
related to the earthquake, which enables detection of earthquake occurrence
promptly, simply by observing the tweets. As described in this paper, we
investigate the real-time interaction of events such as earthquakes in Twitter
and propose an algorithm to monitor tweets and to detect a target event. To
detect a target event, we devise a classifier of tweets based on features such
as the keywords in a tweet, the number of words, and their context.
Subsequently, we produce a probabilistic spatiotemporal model for the target
event that can find the center and the trajectory of the event location. We
consider each Twitter user as a sensor and apply Kalman filtering and particle
filtering, which are widely used for location estimation in
ubiquitous/pervasive computing. The particle filter works better than other
comparable methods for estimating the centers of earthquakes and the
trajectories of typhoons. As an application, we construct an earthquake
reporting system in Japan. Because of the numerous earthquakes and the large
number of Twitter users throughout the country, we can detect an earthquake
with high probability (96% of earthquakes of Japan Meteorological Agency (JMA)
seismic intensity scale 3 or more are detected) merely by monitoring tweets.
Our system detects earthquakes promptly and sends e-mails to registered users.
Notification is delivered much faster than the announcements that are broadcast
by the JMA. Keywords: Twitter, earthquake, event detection, location estimation, social sensor | |||
| Measurement-calibrated graph models for social network experiments | | BIBAK | Full-Text | 861-870 | |
| Alessandra Sala; Lili Cao; Christo Wilson; Robert Zablit; Haitao Zheng; Ben Y. Zhao | |||
| Access to realistic, complex graph datasets is critical to research on
social networking systems and applications. Simulations on graph data provide
critical evaluation of new systems and applications ranging from community
detection to spam filtering and social web search. Due to the high time and
resource costs of gathering real graph datasets through direct measurements,
researchers are anonymizing and sharing a small number of valuable datasets
with the community. However, performing experiments using shared real datasets
faces three key disadvantages: concerns that graphs can be de-anonymized to
reveal private information, increasing costs of distributing large datasets,
and that a small number of available social graphs limits the statistical
confidence in the results.
The use of measurement-calibrated graph models is an attractive alternative to sharing datasets. Researchers can "fit" a graph model to a real social graph, extract a set of model parameters, and use them to generate multiple synthetic graphs statistically similar to the original graph. While numerous graph models have been proposed, it is unclear if they can produce synthetic graphs that accurately match the properties of the original graphs. In this paper, we explore the feasibility of measurement-calibrated synthetic graphs using six popular graph models and a variety of real social graphs gathered from the Facebook social network ranging from 30,000 to 3 million edges. We find that two models consistently produce synthetic graphs with common graph metric values similar to those of the original graphs. However, only one produces high fidelity results in our application-level benchmarks. While this shows that graph models can produce realistic synthetic graphs, it also highlights the fact that current graph metrics remain incomplete, and some applications expose graph properties that do not map to existing metrics. Keywords: graph models, social networking | |||
| Monitoring algorithms for negative feedback systems | | BIBAK | Full-Text | 871-880 | |
| Mark Sandler; S. Muthukrishnan | |||
| There are many online systems where millions of users post original content
such as videos, reviews of items such as products, services and businesses,
etc. While there are general rules for good behavior or even formal Terms of
Service, there are still users who post content that is not suitable.
Increasingly, online systems rely on other users who view the posted content to
provide feedback.
We study online systems where users report negative feedback, i.e., report abuse; these systems are quite distinct from much studied, traditional reputation systems that focus on eliciting popularity of content by various voting methods. The central problem that we study here is how to monitor the quality of negative feedback, that is, detect negative feedback which is incorrect, or perhaps even malicious. Systems address this problem by testing flags manually, which is an expensive operation. As a result, there is a tradeoff between the number of manual tests and the number of errors defined as the number of incorrect flags the monitoring system misses. Our contributions are as follows: * We initiate a systematic study of negative feedbacks systems. Our framework is general enough to be applicable for a variety of systems. In this framework, the number of errors the system admits is bounded over the worst case of adversarial users while simultaneously the system performs only small amount of manual testing for multitude of standard users who might still err while reporting. * Our main contribution is a randomized monitoring algorithm that we call Adaptive Probabilistic Testing (APT), that is simple to implement and has guarantees on expected number of errors. Even for adversarial users, the total expected error is bounded by "N over N flags for a given e < 0. Simultaneously, the number of tests performed by the algorithm is within a constant factor of the best possible algorithm for standard users. * Finally, we present empirical study of our algorithm that shows its performance on both synthetic data and real data accumulated from a variety of negative feedback systems at Google. Our study indicates that the algorithm performs better than the analysis above shows. Keywords: negative feedback systems, probabilistic analysis, user reputation | |||
| Exploiting query reformulations for web search result diversification | | BIBAK | Full-Text | 881-890 | |
| Rodrygo L. T. Santos; Craig Macdonald; Iadh Ounis | |||
| When a Web user's underlying information need is not clearly specified from
the initial query, an effective approach is to diversify the results retrieved
for this query. In this paper, we introduce a novel probabilistic framework for
Web search result diversification, which explicitly accounts for the various
aspects associated to an underspecified query. In particular, we diversify a
document ranking by estimating how well a given document satisfies each
uncovered aspect and the extent to which different aspects are satisfied by the
ranking as a whole. We thoroughly evaluate our framework in the context of the
diversity task of the TREC 2009 Web track. Moreover, we exploit query
reformulations provided by three major Web search engines (WSEs) as a means to
uncover different query aspects. The results attest the effectiveness of our
framework when compared to state-of-the-art diversification approaches in the
literature. Additionally, by simulating an upper-bound query reformulation
mechanism from official TREC data, we draw useful insights regarding the
effectiveness of the query reformulations generated by the different WSEs in
promoting diversity. Keywords: diversity, relevance, web search | |||
| How useful are your comments?: analyzing and predicting youtube comments and comment ratings | | BIBAK | Full-Text | 891-900 | |
| Stefan Siersdorfer; Sergiu Chelaru; Wolfgang Nejdl; Jose San Pedro | |||
| An analysis of the social video sharing platform YouTube reveals a high
amount of community feedback through comments for published videos as well as
through meta ratings for these comments. In this paper, we present an in-depth
study of commenting and comment rating behavior on a sample of more than 6
million comments on 67,000 YouTube videos for which we analyzed dependencies
between comments, views, comment ratings and topic categories. In addition, we
studied the influence of sentiment expressed in comments on the ratings for
these comments using the SentiWordNet thesaurus, a lexical WordNet-based
resource containing sentiment annotations. Finally, to predict community
acceptance for comments not yet rated, we built different classifiers for the
estimation of ratings for these comments. The results of our large-scale
evaluations are promising and indicate that community feedback on already rated
comments can help to filter new unrated comments or suggest particularly useful
but still unrated comments. Keywords: YouTube, comment ratings, community feedback | |||
| Optimal rare query suggestion with implicit user feedback | | BIBAK | Full-Text | 901-910 | |
| Yang Song; Li-wei He | |||
| Query suggestion has been an effective approach to help users narrow down to
the information they need. However, most of existing studies focused on only
popular/head queries. Since rare queries possess much less information (e.g.,
clicks) than popular queries in the query logs, it is much more difficult to
efficiently suggest relevant queries to a rare query. In this paper, we propose
an optimal rare query suggestion framework by leveraging implicit feedbacks
from users in the query logs. Our model resembles the principle of
pseudo-relevance feedback which assumes that top-returned results by search
engines are relevant. However, we argue that the clicked URLs and skipped URLs
contain different levels of information and thus should be treated differently.
Hence, our framework optimally combines both the click and skip information
from users and uses a random walk model to optimize the query correlation. Our
model specifically optimizes two parameters: (1) the restarting (jumping) rate
of random walk, and (2) the combination ratio of click and skip information.
Unlike the Rocchio algorithm, our learning process does not involve the content
of the URLs but simply leverages the click and skip counts in the query-URL
bipartite graphs. Consequently, our model is capable of scaling up to the need
of commercial search engines. Experimental results on one-month query logs from
a large commercial search engine with over 40 million rare queries demonstrate
the superiority of our framework, with statistical significance, over the
traditional random walk models and pseudo-relevance feedback models. Keywords: pseudo-relevance feedback, query suggestion, random walk | |||
| What are the most eye-catching and ear-catching features in the video?: implications for video summarization | | BIBAK | Full-Text | 911-920 | |
| Yaxiao Song; Gary Marchionini; Chi Young Oh | |||
| Video summarization is a mechanism for generating short summaries of the
video to help people quickly make sense of the content of the video before
downloading or seeking more detailed information. To produce reliable automatic
video summarization algorithms, it is essential to first understand how human
beings create video summaries with manual efforts. This paper focuses on a
corpus of instructional documentary video, and seeks to improve automatic video
summaries by understanding what features in the video catch the eyes and ears
of human assessors, and using these findings to inform automatic summarization
algorithms. The paper contributes a thorough and valuable methodology for
performing automatic video summarization, and the methodology can be extended
to inform summarization of other video corpuses. Keywords: audio salience, video summarization, visual salience | |||
| Reining in the web with content security policy | | BIBAK | Full-Text | 921-930 | |
| Sid Stamm; Brandon Sterne; Gervase Markham | |||
| The last three years have seen a dramatic increase in both awareness and
exploitation of Web Application Vulnerabilities. 2008 and 2009 saw dozens of
high-profile attacks against websites using Cross Site Scripting (XSS) and
Cross Site Request Forgery (CSRF) for the purposes of information stealing,
website defacement, malware planting, clickjacking, etc. While an ideal
solution may be to develop web applications free from any exploitable
vulnerabilities, real world security is usually provided in layers.
We present content restrictions, and a content restrictions enforcement scheme called Content Security Policy (CSP), which intends to be one such layer. Content restrictions allow site designers or server administrators to specify how content interacts on their web sites-a security mechanism desperately needed by the untamed Web. These content restrictions rules are activated and enforced by supporting web browsers when a policy is provided for a site via HTTP, and we show how a system such as CSP can be effective to lock down sites and provide an early alert system for vulnerabilities on a web site. Our scheme is also easily deployed, which is made evident by our prototype implementation in Firefox and on the Mozilla Add-Ons web site. Keywords: content restrictions, http, security policy, web security | |||
| Visualizing differences in web search algorithms using the expected weighted Hoeffding distance | | BIBAK | Full-Text | 931-940 | |
| Mingxuan Sun; Guy Lebanon; Kevyn Collins-Thompson | |||
| We introduce a new dissimilarity function for ranked lists, the expected
weighted Hoeffding distance, that has several advantages over current
dissimilarity measures for ranked search results. First, it is easily
customized for users who pay varying degrees of attention to websites at
different ranks. Second, unlike existing measures such as generalized Kendall's
tau, it is based on a true metric, preserving meaningful embeddings when
visualization techniques like multi-dimensional scaling are applied. Third, our
measure can effectively handle partial or missing rank information while
retaining a probabilistic interpretation. Finally, the measure can be made
computationally tractable and we give a highly efficient algorithm for
computing it. We then apply our new metric with multi-dimensional scaling to
visualize and explore relationships between the result sets from different
search engines, showing how the weighted Hoeffding distance can distinguish
important differences in search engine behavior that are not apparent with
other rank-distance metrics. Such visualizations are highly effective at
summarizing and analyzing insights on which search engines to use, what search
strategies users can employ, and how search results evolve over time. We
demonstrate our techniques using a collection of popular search engines, a
representative set of queries, and frequently used query manipulation methods. Keywords: expected weighted Hoeffding distance, ranking, search algorithm
dissimilarity | |||
| Alhambra: a system for creating, enforcing, and testing browser security policies | | BIBAK | Full-Text | 941-950 | |
| Shuo Tang; Chris Grier; Onur Aciicmez; Samuel T. King | |||
| Alhambra is a browser-based system designed to enforce and test web browser
security policies. At the core of Alhambra is a policy-enhanced browser
supporting fine-grain security policies that restrict web page contents and
execution. Alhambra requires no server-side modifications or additions to the
web application. Policies can restrict the construction of the document as well
as the execution of JavaScript using access control rules and a taint-tracking
engine. Using the Alhambra browser, we present two security policies that we
have built using our architecture, both designed to prevent cross-site
scripting. The first policy uses a taint-tracking engine to prevent cross-site
scripting attacks that exploit bugs in the client-side of the web applications.
The second one uses browsing history to create policies that restrict the
contents of documents and prevent the inclusion of malicious content.
Using Alhambra we analyze the impact of policies on the compatibility of web pages. To test compatibility, Alhambra supports revisiting user-generated browsing sessions and comparing multiple security policies in parallel to quickly and automatically evaluate security policies. To compare security policies for identical pages we have also developed useful comparison metrics that quantify differences between identical pages executed with different security policies. Not only do we show that our policies are effective with minimal compatibility cost, we also demonstrate that Alhambra can enforce strong security policies and provide quantitative evaluation of the differences introduced by security policies. Keywords: cross-site scripting, web browser, web security | |||
| Atomate it! end-user context-sensitive automation using heterogeneous information sources on the web | | BIBAK | Full-Text | 951-960 | |
| Max Van Kleek; Brennan Moore; David R. Karger; Paul André; m.c. schraefel | |||
| The transition of personal information management (PIM) tools off the
desktop to the Web presents an opportunity to augment these tools with
capabilities provided by the wealth of real-time information readily available.
In this paper, we describe a next-generation personal information assistance
engine that lets end-users delegate to it various simple context- and
activity-reactive tasks and reminders. Our system, Atomate, treats RSS/ATOM
feeds from social networking and life-tracking sites as sensor streams,
integrating information from such feeds into a simple unified RDF world model
representing people, places and things and their timevarying states and
activities. Combined with other information sources on the web, including the
user's online calendar, web-based e-mail client, news feeds and messaging
services, Atomate can be made to automatically carry out a variety of simple
tasks for the user, ranging from context-aware filtering and messaging, to
sharing and social coordination actions. Atomate's open architecture and world
model easily accommodate new information sources and actions via the addition
of feeds and web services. To make routine use of the system easy for
non-programmers, Atomate provides a constrained-input natural language
interface (CNLI) for behavior specification, and a direct-manipulation
interface for inspecting and updating its world model. Keywords: context-aware computing, end-user programming, reactive automation, web
mash-ups | |||
| Faceted exploration of image search results | | BIBAK | Full-Text | 961-970 | |
| Roelof van Zwol; Börkur Sigurbjornsson; Ramu Adapala; Lluis Garcia Pueyo; Abhinav Katiyar; Kaushal Kurapati; Mridul Muralidharan; Sudar Muthu; Vanessa Murdock; Polly Ng; Anand Ramani; Anuj Sahai; Sriram Thiru Sathish; Hari Vasudev; Upendra Vuyyuru | |||
| This paper describes MediaFaces, a system that enables faceted exploration
of media collections. The system processes semi-structured information sources
to extract objects and facets, e.g. the relationships between two objects.
Next, we rank the facets based on a statistical analysis of image search query
logs, and the tagging behaviour of users annotating photos in Flickr. For a
given object of interest, we can then retrieve the top-k most relevant facets
and present them to the user. The system is currently deployed in production by
Yahoo!'s image search engine1. We present the system architecture, its
main components, and the application of the system as part of the image search
experience. Keywords: faceted browsing, image search, object extraction, ranking | |||
| CETR: content extraction via tag ratios | | BIBAK | Full-Text | 971-980 | |
| Tim Weninger; William H. Hsu; Jiawei Han | |||
| We present Content Extraction via Tag Ratios (CETR) -- a method to extract
content text from diverse webpages by using the HTML document's tag ratios. We
describe how to compute tag ratios on a line-by-line basis and then cluster the
resulting histogram into content and non-content areas. Initially, we find that
the tag ratio histogram is not easily clustered because of its
one-dimensionality; therefore we extend the original approach in order to model
the data in two dimensions. Next, we present a tailored clustering technique
which operates on the two-dimensional model, and then evaluate our approach
against a large set of alternative methods using standard accuracy, precision
and recall metrics on a large and varied Web corpus. Finally, we show that, in
most cases, CETR achieves better content extraction performance than existing
methods, especially across varying web domains, languages and styles. Keywords: content extraction, tag ratio, world wide web | |||
| Modeling relationship strength in online social networks | | BIBAK | Full-Text | 981-990 | |
| Rongjing Xiang; Jennifer Neville; Monica Rogati | |||
| Previous work analyzing social networks has mainly focused on binary
friendship relations. However, in online social networks the low cost of link
formation can lead to networks with heterogeneous relationship strengths (e.g.,
acquaintances and best friends mixed together). In this case, the binary
friendship indicator provides only a coarse representation of relationship
information. In this work, we develop an unsupervised model to estimate
relationship strength from interaction activity (e.g., communication, tagging)
and user similarity. More specifically, we formulate a link-based latent
variable model, along with a coordinate ascent optimization procedure for the
inference. We evaluate our approach on real-world data from Facebook and
LinkedIn, showing that the estimated link weights result in higher
autocorrelation and lead to improved classification accuracy. Keywords: homophily, randomization, social influence, social networks | |||
| Automatic extraction of clickable structured web contents for name entity queries | | BIBAK | Full-Text | 991-1000 | |
| Xiaoxin Yin; Wenzhao Tan; Xiao Li; Yi-Chin Tu | |||
| Today the major web search engines answer queries by showing ten result
snippets, which need to be inspected by users for identifying relevant results.
In this paper we investigate how to extract structured information from the
web, in order to directly answer queries by showing the contents being searched
for. We treat users' search trails (i.e., post-search browsing behaviors) as
implicit labels on the relevance between web contents and user queries. Based
on such labels we use information extraction approach to build wrappers and
extract structured information. An important observation is that many web sites
contain pages for name entities of certain categories (e.g., AOL Music contains
a page for each musician), and these pages have the same format. This makes it
possible to build wrappers from a small amount of implicit labels, and use them
to extract structured information from many web pages for different name
entities. We propose STRUCLICK, a fully automated system for extracting
structured information for queries containing name entities of certain
categories. It can identify important web sites from web search logs, build
wrappers from users' search trails, filter out bad wrappers built from random
user clicks, and combine structured information from different web sites for
each query. Comparing with existing approaches on information extraction,
STRUCLICK can assign semantics to extracted data without any human labeling or
supervision. We perform comprehensive experiments, which show STRUCLICK
achieves high accuracy and good scalability. Keywords: information extraction, web search | |||
| Building taxonomy of web search intents for name entity queries | | BIBAK | Full-Text | 1001-1010 | |
| Xiaoxin Yin; Sarthak Shah | |||
| A significant portion of web search queries are name entity queries. The
major search engines have been exploring various ways to provide better user
experiences for name entity queries, such as showing "search tasks" (Bing
search) and showing direct answers (Yahoo!, Kosmix). In order to provide the
search tasks or direct answers that can satisfy most popular user intents, we
need to capture these intents, together with relationships between them.
In this paper we propose an approach for building a hierarchical taxonomy of the generic search intents for a class of name entities (e.g., musicians or cities). The proposed approach can find phrases representing generic intents from user queries, and organize these phrases into a tree, so that phrases indicating equivalent or similar meanings are on the same node, and the parent-child relationships of tree nodes represent the relationships between search intents and their sub-intents. Three different methods are proposed for tree building, which are based on directed maximum spanning tree, hierarchical agglomerative clustering, and pachinko allocation model. Our approaches are purely based on search logs, and do not utilize any existing taxonomies such as Wikipedia. With the evaluation by human judges (via Mechanical Turk), it is shown that our approaches can build trees of phrases that capture the relationships between important search intents. Keywords: query clustering, web search intent | |||
| Beyond position bias: examining result attractiveness as a source of presentation bias in clickthrough data | | BIBAK | Full-Text | 1011-1018 | |
| Yisong Yue; Rajan Patel; Hein Roehrig | |||
| Leveraging clickthrough data has become a popular approach for evaluating
and optimizing information retrieval systems. Although data is plentiful, one
must take care when interpreting clicks, since user behavior can be affected by
various sources of presentation bias. While the issue of position bias in
clickthrough data has been the topic of much study, other presentation bias
effects have received comparatively little attention. For instance, since users
must decide whether to click on a result based on its summary (e.g., the title,
URL and abstract), one might expect clicks to favor "more attractive" results.
In this paper, we examine result summary attractiveness as a potential source
of presentation bias. This study distinguishes itself from prior work by aiming
to detect systematic biases in click behavior due to attractive summaries
inflating perceived relevance. Our experiments conducted on the Google web
search engine show substantial evidence of presentation bias in clicks towards
results with more attractive titles. Keywords: click logs, implicit feedback, presentation bias | |||
| Statistical models of music-listening sessions in social media | | BIBAK | Full-Text | 1019-1028 | |
| Elena Zheleva; John Guiver; Eduarda Mendes Rodrigues; Natasa Milić-Frayling | |||
| User experience in social media involves rich interactions with the media
content and other participants in the community. In order to support such
communities, it is important to understand the factors that drive the users'
engagement. In this paper we show how to define statistical models of different
complexity to describe patterns of song listening in an online music community.
First, we adapt the LDA model to capture music taste from listening activities
across users and identify both the groups of songs associated with the specific
taste and the groups of listeners who share the same taste. Second, we define a
graphical model that takes into account listening sessions and captures the
listening moods of users in the community. Our session model leads to groups of
songs and groups of listeners with similar behavior across listening sessions
and enables faster inference when compared to the LDA model. Our experiments
with the data from an online media site demonstrate that the session model is
better in terms of the perplexity compared to two other models: the LDA-based
taste model that does not incorporate cross-session information and a baseline
model that does not use latent groupings of songs. Keywords: collaborative filtering, graphical models, mood, music, recommendations,
sessions, social media, taste | |||
| Collaborative location and activity recommendations with GPS history data | | BIBAK | Full-Text | 1029-1038 | |
| Vincent W. Zheng; Yu Zheng; Xing Xie; Qiang Yang | |||
| With the increasing popularity of location-based services, such as tour
guide and location-based social network, we now have accumulated many location
data on the Web. In this paper, we show that, by using the location data based
on GPS and users' comments at various locations, we can discover interesting
locations and possible activities that can be performed there for
recommendations. Our research is highlighted in the following location-related
queries in our daily life: 1) if we want to do something such as sightseeing or
food-hunting in a large city such as Beijing, where should we go? 2) If we have
already visited some places such as the Bird's Nest building in Beijing's
Olympic park, what else can we do there? By using our system, for the first
question, we can recommend her to visit a list of interesting locations such as
Tiananmen Square, Bird's Nest, etc. For the second question, if the user visits
Bird's Nest, we can recommend her to not only do sightseeing but also to
experience its outdoor exercise facilities or try some nice food nearby. To
achieve this goal, we first model the users' location and activity histories
that we take as input. We then mine knowledge, such as the location features
and activity-activity correlations from the geographical databases and the Web,
to gather additional inputs. Finally, we apply a collective matrix
factorization method to mine interesting locations and activities, and use them
to recommend to the users where they can visit if they want to perform some
specific activities and what they can do if they visit some specific places. We
empirically evaluated our system using a large GPS dataset collected by 162
users over a period of 2.5 years in the real-world. We extensively evaluated
our system and showed that our system can outperform several state-of-the-art
baselines. Keywords: collaborative filtering, location and activity recommendations | |||
| Measurement and analysis of an online content voting network: a case study of Digg | | BIBAK | Full-Text | 1039-1048 | |
| Yingwu Zhu | |||
| In online content voting networks, aggregate user activities (e.g.,
submitting and rating content) make high-quality content thrive through the
unprecedented scale, high dynamics and divergent quality of user generated
content (UGC). To better understand the nature and impact of online content
voting networks, we have analyzed Digg, a popular online social news aggregator
and rating website. Based on a large amount of data collected, we provide an
in-depth study of Digg. We study structural properties of Digg social network,
revealing some strikingly distinct properties such as low link symmetry and the
power-law distribution of node outdegree with truncated tails. We explore
impact of the social network on user digging activities, and investigate the
issues of content promotion, content filtering, vote spam and content
censorship, which are inherent to content rating networks. We also provide
insight into design of content promotion algorithms and recommendation-assisted
content discovery. Overall, we believe that the results presented in this paper
are crucial in understanding online content rating networks. Keywords: content filtering, content promotion, social networks | |||
| What you see is what you search: adaptive visual search framework for the web | | BIBAK | Full-Text | 1049-1050 | |
| Jae-wook Ahn; Peter Brusilovsky | |||
| Information retrieval is one of the most popular information access methods
for overcoming the information overload problem of the Web. However, its
interaction model is still utilizing the old text-based ranked lists and static
interaction algorithm. In this paper, we introduce our adaptive visualization
approach for searching the Web, which we call Adaptive VIBE. It is an extended
version of a reference point-based spatial visualization algorithm, and is
designed to serve as a user interaction module for a personalized search
system. Personalized search can incorporate dynamic user interests and
different contexts, improving search results. When it is combined with adaptive
visualization, it can encourage users to become involved in the search process
more actively by exploring the information space and learning new facts for
effective searching. In this paper, we introduce the rationale and functions of
our adaptive visualization approach and discuss the approaches' potential to
create a better search environment for the Web. Keywords: adaptive visualization, exploratory search, personalized search, user
modeling, vibe | |||
| RESTler: crawling RESTful services | | BIBAK | Full-Text | 1051-1052 | |
| Rosa Alarcón; Erik Wilde | |||
| Service descriptions allow designers to document, understand, and use
services, creating new useful and complex services with aggregated business
value. Unlike RPC-based services, REST characteristics require a different
approach to service description. We present the Resource Linking Language
(ReLL) that introduces the concepts of media types, resource types, and link
types as first class citizens for a service description. A proof of concept, a
crawler called RESTler that crawls RESTful services based on ReLL descriptions,
is also presented. Keywords: crawling, rest, soa, web services | |||
| Intelligent ad resizing | | BIBK | Full-Text | 1053-1054 | |
| Anthony P. Badali; Parham Aarabi; Ron Appel | |||
Keywords: image processing, image resizing, internet monetization | |||
| SourceRank: relevance and trust assessment for deep web sources based on inter-source agreement | | BIBAK | Full-Text | 1055-1056 | |
| Raju Balakrishnan; Subbarao Kambhampati | |||
| We consider the problem of deep web source selection and argue that existing
source selection methods are inadequate as they are based on local similarity
assessment. Specically, they fail to account for the fact that sources can vary
in trustworthiness and individual results can vary in importance. In response,
we formulate a global measure to calculate relevance and trustworthiness of a
source based on agreement between the answers provided by different sources.
Agreement is modeled as a graph with sources at the vertices. On this agreement
graph, source quality scores -- namely SourceRank -- are calculated as the
stationary visit probability of a weighted random walk. Our experiments on
online databases and 675 book sources from Google Base show that SourceRank
improves relevance of the results by 25-40% over existing methods and Google
Base ranking. SourceRank also reduces linearly with the corruption levels of
the sources. Keywords: deep web, source selection, source trust, web databases | |||
| Talking about data: sharing richly structured information through blogs and wikis | | BIBAK | Full-Text | 1057-1058 | |
| Edward Benson; Adam Marcus; Fabian Howahl; David Karger | |||
| The web has dramatically enhanced people's ability to communicate ideas,
knowledge, and opinions. But the authoring tools that most people understand,
blogs and wikis, primarily guide users toward authoring text. In this work, we
show that substantial gains in expressivity and communication would accrue if
people could easily share richly structured information in meaningful
visualizations. We then describe several extensions we have created for blogs
and wikis that enable users to publish, share, and aggregate such structured
information using the same workflows they apply to text. In particular, we aim
to preserve those attributes that make blogs and wikis so effective: one-click
access to the information, one-click publishing of content, natural authoring
interfaces, and the ability to easily copy-and-paste information and
visualizations from other sources. Keywords: blogs, linked data, visualization, wikis | |||
| Privacy in dynamic social networks | | BIBAK | Full-Text | 1059-1060 | |
| Smriti Bhagat; Graham Cormode; Balachander Krishnamurthy; Divesh Srivastava | |||
| Anonymization of social networks before they are published or shared has
become an important research question. Recent work on anonymizing social
networks has looked at privacy preserving techniques for publishing a single
instance of the network. However, social networks evolve and a single instance
is inadequate for analyzing the evolution of the social network or for
performing any longitudinal data analysis. We study the problem of repeatedly
publishing social network data as the network evolves, while preserving privacy
of users. Publishing multiple instances of the same network independently has
privacy risks, since stitching the information together may allow an adversary
to identify users in the networks.
We propose methods to anonymize a dynamic network such that the privacy of users is preserved when new nodes and edges are added to the published network. These methods make use of link prediction algorithms to model the evolution of the social network. Using this predicted graph to perform group-based anonymization, the loss in privacy caused by new edges can be reduced. We evaluate the privacy loss on publishing multiple social network instances using our methods. Keywords: dynamic networks, privacy, re-publication | |||
| Finding algorithms in scientific articles | | BIBAK | Full-Text | 1061-1062 | |
| Sumit Bhatia; Prasenjit Mitra; C. Lee Giles | |||
| Algorithms are an integral part of computer science literature. However,
none of the current search engines offer specialized algorithm search facility.
We describe a vertical search engine that identifies the algorithms present in
documents and extracts and indexes the related metadata and textual description
of the identified algorithms. This algorithm specific information is then
utilized for algorithm ranking in response to user queries. Experimental
results show the superiority of our system on other popular search engines. Keywords: algorithms, pseudocodes, vertical search engines | |||
| Exploiting information redundancy to wring out structured data from the web | | BIBAK | Full-Text | 1063-1064 | |
| Lorenzo Blanco; Mirko Bronzi; Valter Crescenzi; Paolo Merialdo; Paolo Papotti | |||
| A large number of web sites publish pages containing structured information
about recognizable concepts, but these data are only partially used by current
applications. Although such information is spread across a myriad of sources,
the web scale implies a relevant redundancy. We present a domain independent
system that exploits the redundancy of information to automatically extract and
integrate data from the Web. Our solution concentrates on sources that provide
structured data about multiple instances from the same conceptual domain, e.g.
financial data, product information. Our proposal is based on an original
approach that exploits the mutual dependency between the data extraction and
the data integration tasks. Experiments confirmed the quality and the
feasibility of the approach. Keywords: data extraction, data integration, wrapper generation | |||
| Caching search engine results over incremental indices | | BIBAK | Full-Text | 1065-1066 | |
| Roi Blanco; Edward Bortnikov; Flavio Junqueira; Ronny Lempel; Luca Telloli; Hugo Zaragoza | |||
| A Web search engine must update its index periodically to incorporate
changes to the Web, and we argue in this work that index updates fundamentally
impact the design of search engine result caches. Index updates lead to the
problem of cache invalidation: invalidating cached entries of queries whose
results have changed. To enable efficient invalidation of cached results, we
propose a framework for developing invalidation predictors and some concrete
predictors. Evaluation using Wikipedia documents and a query log from Yahoo!
shows that selective invalidation of cached search results can lower the number
of query re-evaluations by as much as 30% compared to a baseline time-to-live
scheme, while returning results of similar freshness. Keywords: cache, real-time indexing, search engine, search results | |||
| Visual structure-based web page clustering and retrieval | | BIBAK | Full-Text | 1067-1068 | |
| Paul Bohunsky; Wolfgang Gatterbauer | |||
| Clustering and retrieval of web pages dominantly relies on analyzing either
the content of individual web pages or the link structure between them. Some
literature also suggests to use the structure of web pages, notably the
structure of its DOM tree. However, little work considers the visual structure
of web pages for clustering. In this paper (i) we motivate visual
structure-based web page clustering and retrieval for a number of applications,
(ii) we formalize a visual box model-based representation of web pages that
supports new metrics of visual similarity, and (iii) we report on our current
work on evaluating human perception of visual similarity of web pages and
applying the learned visual similarity features to web page clustering and
retrieval. Keywords: similarity measures, web page clustering, web page representation, web page
retrieval | |||
| Efficient resource allocation and power saving in multi-tiered systems | | BIBAK | Full-Text | 1069-1070 | |
| Andrew Caniff; Lei Lu; Ningfang Mi; Ludmila Cherkasova; Evgenia Smirni | |||
| In this paper, we present Fastrack, a parameter-free algorithm for dynamic
resource provisioning that uses simple statistics to promptly distill
information about changes in workload burstiness. This information, coupled
with the application's end-to-end response times and system bottleneck
characteristics, guide resource allocation that shows to be very effective
under a broad variety of burstiness profiles and bottleneck scenarios. Keywords: burstiness, multi-tiered systems, resource allocation | |||
| RankCompete: simultaneous ranking and clustering of web photos | | BIBAK | Full-Text | 1071-1072 | |
| Liangliang Cao; Andrey Del Pozo; Xin Jin; Jiebo Luo; Jiawei Han; Thomas S. Huang | |||
| With the explosive growth of digital cameras and online media, it has become
crucial to design efficient methods that help users browse and search large
image collections. The recent VisualRank algorithm [4] employs visual
similarity to represent the link structure in a graph so that the classic
PageRank algorithm can be applied to select the most relevant images. However,
measuring visual similarity is difficult when there exist diversified semantics
in the image collection, and the results from VisualRank cannot supply good
visual summarization with diversity. This paper proposes to rank the images in
a structural fashion, which aims to discover the diverse structure embedded in
photo collections, and rank the images according to their similarity among
local neighborhoods instead of across the entire photo collection. We design a
novel algorithm named RankCompete, which generalizes the PageRank algorithm for
the task of simultaneous ranking and clustering. The experimental results show
that RankCompete outperforms VisualRank and provides an efficient but effective
tool for organizing web photos. Keywords: image ranking, image summarization, pagerank | |||
| Study language models with specific user goals | | BIBAK | Full-Text | 1073-1074 | |
| Rongwei Cen; Yiqun Liu; Min Zhang; Liyun Ru; Shaoping Ma | |||
| Under different language contexts, people choose different terms or phrases
to express their feelings and opinions. When a user is writing a paper or
chatting with a friend, he/she applies a specific language model corresponding
to the underlying goal. This paper presents a log-based study of analyzing the
language models with specific goals. We exhibit the statistical information of
terms and software programs, propose some methods to estimate the divergence of
language models with specific user goals and measure the discrimination of
these models. Experimental results show that the language models with different
user goals have large divergence and different discrimination. These study
conclusions can be applied to understand user needs and improve Human-Computer
Interaction (HCI). Keywords: language model, log analysis, user goal | |||
| Navigational complexity in web interactions | | BIBAK | Full-Text | 1075-1076 | |
| Praphul Chandra; Geetha Manjunath | |||
| As the web grows in size, interfaces & interactions across websites
diverge -- for differentiation and arguably for a better user experience.
However, this size & diversity is also a cognitive load for the user who
has to learn a new user interface for every new website she visits. Several
studies have confirmed the importance of well designed websites. In this paper,
we propose a method for quantitative evaluation of the navigational complexity
of user interactions on the web. Our approach of quantifying interaction
complexity exploits the modeling of the web as a graph and uses the information
theoretic definition of complexity. It enables us to measure the navigational
complexity of web interaction in bits. Our approach is structural in nature and
can be applied to both traditional paradigm of web interaction (browsing) and
to emerging paradigms of web interaction like web widgets. Keywords: complexity, graph theory, hypertext, user interaction, widgets | |||
| Transfer learning for behavioral targeting | | BIBAK | Full-Text | 1077-1078 | |
| Tianqi Chen; Jun Yan; Guirong Xue; Zheng Chen | |||
| Recently, Behavioral Targeting (BT) is attracting much attention from both
industry and academia due to its rapid growth in online advertising market.
Though a basic assumption of BT, which is, the users who share similar Web
browsing behaviors will have similar preference over ads, has been empirically
verified, we argue that the users' ad click preference and Web browsing
behavior are not reflecting the same user intent though they are correlated. In
this paper, we propose to formulate BT as a transfer learning problem. We treat
the users' preference over ads and Web browsing behaviors as two different user
behavioral domains and propose to utilize transfer learning strategy across
these two user behavioral domains to segment users for BT ads delivery. We show
that some classical BT solutions could be formulated in transfer learning view.
As an example, we propose to leverage translated learning, which is a recent
proposed transfer learning algorithm, to benefit the BT ads delivery.
Experimental results on real ad click data show that, BT user segmentation by
the approach of transfer learning can outperform the classical user
segmentation strategies for larger than 20% in terms of smoothed ad Click
Through Rate (CTR). Keywords: behavioral targeting, transfer learning, user segmentation | |||
| Context-oriented web video tag recommendation | | BIBAK | Full-Text | 1079-1080 | |
| Zhineng Chen; Juan Cao; YiCheng Song; Junbo Guo; Yongdong Zhang; Jintao Li | |||
| Tag recommendation is a common way to enrich the textual annotation of
multimedia contents. However, state-of-the-art recommendation methods are built
upon the pair-wised tag relevance, which hardly capture the context of the web
video, i.e., when who are doing what at where. In this paper we propose the
context-oriented tag recommendation (CtextR) approach, which expands tags for
web videos under the context-consistent constraint. Given a web video, CtextR
first collects the multi-form WWW resources describing the same event with the
video, which produce an informative and consistent context; and then, the tag
recommendation is conducted based on the obtained context. Experiments on an
80,031 web video collection show CtextR recommends various relevant tags to web
videos. Moreover, the enriched tags improve the performance of web video
categorization. Keywords: context-oriented, social tagging, tag recommendation, web video | |||
| Web-scale knowledge extraction from semi-structured tables | | BIBAK | Full-Text | 1081-1082 | |
| Eric Crestan; Patrick Pantel | |||
| A wealth of knowledge is encoded in the form of tables on the World Wide
Web. We propose a classification algorithm and a rich feature set for
automatically recognizing layout tables and attribute/value tables. We report
the frequencies of these table types over a large analysis of the Web and
propose open challenges for extracting from attribute/value tables semantic
triples (knowledge). We then describe a solution to a key problem in extracting
semantic triples: protagonist detection, i.e., finding the subject of the table
that often is not present in the table itself. In 79% of our Web tables, our
method finds the correct protagonist in its top three returned candidates. Keywords: classification, information extraction, structured data, web tables | |||
| Constructing travel itineraries from tagged geo-temporal breadcrumbs | | BIBAK | Full-Text | 1083-1084 | |
| Munmun De Choudhury; Moran Feldman; Sihem Amer-Yahia; Nadav Golbandi; Ronny Lempel; Cong Yu | |||
| Vacation planning is a frequent laborious task which requires skilled
interaction with a multitude of resources. This paper develops an end-to-end
approach for constructing intra-city travel itineraries automatically by
tapping a latent source reflecting geo-temporal breadcrumbs left by millions of
tourists. In particular, the popular rich media sharing site, Flickr, allows
photos to be stamped by the date and time of when they were taken, and be
mapped to Points Of Interest (POIs) by latitude-longitude information as well
as semantic metadata (e.g., tags) that describe them.
Our extensive user study on a "crowd-sourcing" marketplace (Amazon Mechanical Turk), indicates that high quality itineraries can be automatically constructed from Flickr data, when compared against popular professionally generated bus tours. Keywords: flickr, geo-tags, mechanical turk, media applications, orienteering problem,
rich media, travel itinerary | |||
| How much is your personal recommendation worth? | | BIBAK | Full-Text | 1085-1086 | |
| Paul Dütting; Monika Henzinger; Ingmar Weber | |||
| Suppose you buy a new laptop and, simply because you like it so much, you
recommend it to friends, encouraging them to purchase it as well. What would be
an adequate price for the vendor of the laptop to pay for your recommendation?
Personal recommendations like this are of considerable commercial interest, but unlike in sponsored search auctions there can be no truthful prices. Despite this "lack of truthfulness" the vendor of the product might still decide to pay you for recommendation e.g. because she wants to (i) provide you with an additional incentive to actually recommend her or to (ii) increase your satisfaction and/or brand loyalty. This leads us to investigate a pricing scheme based on the Shapley value [5] that satisfies certain "axioms of fairness". We find that it is vulnerable to manipulations and show how to overcome these difficulties using the anonymity-proof Shapley value of [4]. Keywords: pricing mechanisms, recommendations, shapley value | |||
| A unified ontology-based web page model for improving accessibility | | BIBAK | Full-Text | 1087-1088 | |
| Ruslan R. Fayzrahmanov; Max C. Göbel; Wolfgang Holzinger; Bernhard Krüpl; Robert Baumgartner | |||
| Fast technological advancements and little compliance with accessibility
standards by Web page authors pose serious obstacles to the Web experience of
the blind user. We propose a unified Web document model that enables us to
create a richer browsing experience and improved navigability for blind users.
The model provides an integrated view on all aspects of a Web page and is
leveraged to create a multi-axial user interface. Keywords: web accessibility | |||
| Query parsing in mobile voice search | | BIBAK | Full-Text | 1089-1090 | |
| Junlan Feng | |||
| Mobile voice search is a fast-growing business. It provides users an easier
way to search for information using voice from mobile devices. In this paper,
we describe a statistical approach to query parsing to assure search
effectiveness. The task is to segment speech recognition (ASR) output,
including ASR 1-Best and ASR word lattices, into segments and associate each
segment with needed concepts in the application. We train the models including
concept prior probability, query segment generation probability, and query
subject probability from application data such as query log and source
database. We apply the learned models on a mobile business search application
and demonstrate the robustness of query parsing to ASR errors. Keywords: mobile voice search, query parsing | |||
| RDF compression: basic approaches | | BIBAK | Full-Text | 1091-1092 | |
| Javier D. Fernández; Claudio Gutierrez; Miguel A. Martínez-Prieto | |||
| This paper studies the compressibility of RDF data sets. We show that big
RDF data sets are highly compressible due to the structure of RDF graphs (power
law), organization of URIs and RDF syntax verbosity. We present basic
approaches to compress RDF data and test them with three well-known, real-world
RDF data sets. Keywords: compression, rdf, semantic web | |||
| How google analytics and conventional cookie tracking techniques overestimate unique visitors | | BIBAK | Full-Text | 1093-1094 | |
| Max I. Fomitchev | |||
| We report the results of the analysis of website traffic logs and argue that
both unique IP address and cookies vastly overestimate unique visitors, e.g. by
factor of 8 in our studies. Google Analytics 'absolute unique visitors' measure
is shown to produce similar 6x overestimation. To address the problem we
present a new model for relating unique visitors to IP address or cookies. Keywords: cookies, traffic logs, unique visitors, web, web analytics | |||
| Shout out: integrating news and reader comments | | BIBAK | Full-Text | 1095-1096 | |
| Lisa M. Gandy; Nathan D. Nichols; Kristian J. Hammond | |||
| A useful approach for enabling computers to automatically create new content
is utilizing the text, media, and information already present on the World Wide
Web. The newly created content is known as "machine-generated content". For
example, a machine-generated content system may create a multimedia news show
with two animated anchors presenting a news story; one anchor reads the news
story with text taken from an existing news article, and the other anchor
regularly interrupts with his or her own opinion about the story. In this
paper, we present such a system, and describe in detail its strategy for
autonomously extracting and selecting the opinions given by the second anchor. Keywords: cosine similarity, emotional valence detection, machine-generated content | |||
| A framework for trust establishment and assessment on the web of data | | BIBAK | Full-Text | 1097-1098 | |
| Qi Gao; Geert-Jan Houben | |||
| With the enormous and still growing amount of data and user interaction on
the Web, it becomes more and more necessary for data consumers to be able to
assess the trustworthiness of data on the Web. In this paper, we present a
framework for trust establishment and assessment on the Web of Data. Different
from many approaches that build trust metrics within networks of people, we
propose a model to represent the trust in concrete pieces of web data for a
specific consumer, in which also the context is considered. Further more, we
provide three strategies for trust assessment based on the principles of Linked
Data, to overcome the shortcomings of the conventional web of documents: i.e.
the lack of semantics and interlinking. Keywords: linked data, semantic web, trust, web of data | |||
| On the high density of leadership nuclei in endorsement social networks | | BIBAK | Full-Text | 1099-1100 | |
| Guillermo Garrido; Francesco Bonchi; Aristides Gionis | |||
| In this paper we study the community structure of endorsement networks,
i.e., social networks in which a directed edge u ⇒ v is asserting an
action of support from user u to user v. Examples include scenarios in which a
user u is favoring a photo, liking a post, or following the microblog of user
v.
Starting from the hypothesis that the footprint of a community in an endorsement network is a bipartite directed clique from a set of followers to a set of leaders, we apply frequent itemset mining techniques to discover such bicliques. Our analysis of real networks discovers that an interesting phenomenon is taking place: the leaders of a community are endorsing each other forming a very dense nucleus. Keywords: communities, endorsement social networks | |||
| Measuring the web crawler ethics | | BIBAK | Full-Text | 1101-1102 | |
| C. Lee Giles; Yang Sun; Isaac G. Councill | |||
| Web crawlers are highly automated and seldom regulated manually. The
diversity of crawler activities often leads to ethical problems such as spam
and service attacks. In this research, quantitative models are proposed to
measure the web crawler ethics based on their behaviors on web servers. We
investigate and define rules to measure crawler ethics, referring to the extent
to which web crawlers respect the regulations set forth in robots.txt
configuration files. We propose a vector space model to represent crawler
behavior and measure the ethics of web crawlers based on the behavior vectors.
The results show that ethicality scores vary significantly among crawlers. Most
commercial web crawlers' behaviors are ethical. However, many commercial
crawlers still consistently violate or misinterpret certain robots.txt rules.
We also measure the ethics of big search engine crawlers in terms of return on
investment. The results show that Google has a higher score than other search
engines for a US website but has a lower score than Baidu for Chinese websites. Keywords: ethicality, privacy, robots.txt, web crawler ethics | |||
| SNDocRank: document ranking based on social networks | | BIBAK | Full-Text | 1103-1104 | |
| Liang Gou; Hung-Hsuan Chen; Jung-Hyun Kim; Xiaolong (Luke) Zhang; C. Lee Giles | |||
| To improve the search results for socially-connect users, we propose a
ranking framework, Social Network Document Rank (SNDocRank). This framework
considers both document contents and the similarity between a searcher and
document owners in a social network and uses a Multi-level Actor Similarity
(MAS) algorithm to efficiently calculate user similarity in a social network.
Our experiment results based on YouTube data show that compared with the tf-idf
algorithm, the SNDocRank method returns more relevant documents of interest.
Our findings suggest that in this framework, a searcher can improve search by
joining larger social networks, having more friends, and connecting larger
local communities in a social network. Keywords: information retrieval, multilevel actor similarity, ranking, social networks | |||
| Exploiting content redundancy for web information extraction | | BIBAK | Full-Text | 1105-1106 | |
| Pankaj Gulhane; Rajeev Rastogi; Srinivasan H. Sengamedu; Ashwin Tengli | |||
| We propose a novel extraction approach that exploits content redundancy on
the web to extract structured data from template-based web sites. We start by
populating a seed database with records extracted from a few initial sites. We
then identify values within the pages of each new site that match attribute
values contained in the seed set of records. To filter out noisy attribute
value matches, we exploit the fact that attribute values occur at fixed
positions within template-based sites. We develop an efficient Apriori-style
algorithm to systematically enumerate attribute position configurations with
sufficient matching values across pages. Finally, we conduct an extensive
experimental study with real-life web data to demonstrate the effectiveness of
our extraction approach. Keywords: content redundancy, information extraction | |||
| Exploring searcher interactions for distinguishing types of commercial intent | | BIBAK | Full-Text | 1107-1108 | |
| Qi Guo; Eugene Agichtein | |||
| An improved understanding of the relationship between search intent, result
quality, and searcher behavior is crucial for improving the effectiveness of
web search. While recent progress in user behavior mining has been largely
focused on aggregate server-side click logs, we present a new search behavior
model that incorporates fine-grained user interactions with the search results.
We show that mining these interactions, such as mouse movements and scrolling,
can enable more effective detection of the user's search intent. Potential
applications include automatic search evaluation, improving search ranking,
result presentation, and search advertising. As a case study, we report results
on distinguishing between "research" and "purchase" variants of commercial
intent, that show our method to be more effective than the current
state-of-the-art. Keywords: search behavior modeling, user intent inference | |||
| Deep mashup: a description-based framework for lightweight integration of web contents | | BIBAK | Full-Text | 1109-1110 | |
| Hao Han; Junxia Guo; Takehiro Tokuda | |||
| In this paper, we present a description-based mashup framework for
lightweight integration of Web contents. Our implementation shows that we can
integrate not only the Web services but also the Web applications easily, even
the Web contents dynamically generated by client-side scripts. Keywords: mashup, web application, web feed, web service | |||
| Estimation of user interest in visited web page | | BIBAK | Full-Text | 1111-1112 | |
| Michal Holub; Maria Bielikova | |||
| Nowadays web portals contain large amount of information that is meant for
various visitors or groups of visitors. To effectively navigate within the
content the website needs to "know" its users in order to provide personalized
content to them. We propose a method for automatic estimation of the user's
interest in a web page he visits. This estimation is used for the
recommendation of web portal pages (through presenting adaptive links) that the
user might like. We conducted series of experiments in the domain of our
faculty web portal to evaluate proposed approach. Keywords: adaptive navigation support, link recommendation, user behaviour, user
interest estimation | |||
| Hierarchical feature selection for ranking | | BIBAK | Full-Text | 1113-1114 | |
| Guichun Hua; Min Zhang; Yiqun Liu; Shaoping Ma; Liyun Ru | |||
| Ranking is an essential part of information retrieval (IR) tasks such as Web
search. Nowadays there are hundreds of features for ranking. So learning to
rank (LTR), an interdisciplinary field of IR and machine learning (ML), has
attracted increasing attention. Those features used in the IR are not always
independent from each other, hence the feature selection, an important issue in
ML, should be paid attention to for LTR. However, the state-of-the-art LTR
approaches merely analyze the connection among the features from the aspects of
feature selection. In this paper, we propose a hierarchical feature selection
strategy containing 2 phases for ranking and learn ranking functions. The
experimental results show that ranking functions based on the selected feature
subset significantly outperform the ones based on all features. Keywords: feature selection, learning to rank | |||
| Selectivity estimation for SPARQL graph pattern | | BIBAK | Full-Text | 1115-1116 | |
| Hai Huang; Chengfei Liu | |||
| This paper focuses on selectivity estimation for SPARQL graph patterns,
which is crucial to RDF query optimization. The previous work takes the join
uniformity assumption, which would lead to high inaccurate estimation in the
cases where properties in SPARQL graph patterns are correlated. We take into
account the dependencies among properties in SPARQL graph patterns and propose
a more accurate estimation model. We first focus on two common SPARQL graph
patterns (star and chain patterns) and propose to use Bayesian network and
chain histogram for estimating the selectivity of them. Then, for an arbitrary
composite SPARQL graph pattern, we maximally combines the results of the star
and chain patterns we have precomputed. The experiments show that our method
outperforms existing approaches in accuracy. Keywords: rdf query processing, selectivity estimation | |||
| Yet another paper ranking algorithm advocating recent publications | | BIBAK | Full-Text | 1117-1118 | |
| Won-Seok Hwang; Soo-Min Chae; Sang-Wook Kim; Gyun Woo | |||
| In this paper, we propose a new paper ranking algorithm that gives a high
rank to papers which is credited by other authoritative papers or published in
premier conferences or journals. Also, the proposed algorithm solves a problem
that recent papers are rated poorly due to few citations. Keywords: citation analysis, impact factor, pagerank, ranking | |||
| Enabling entity-based aggregators for web 2.0 data | | BIBAK | Full-Text | 1119-1120 | |
| Ekaterini Ioannou; Claudia Niederée; Yannis Velegrakis | |||
| Selecting and presenting content culled from multiple heterogeneous and
physically distributed sources is a challenging task. The exponential growth of
the web data in modern times has brought new requirements to such integration
systems. Data is not any more produced by content providers alone, but also
from regular users through the highly popular Web 2.0 social and semantic web
applications. The plethora of the available web content increased its demand by
regular users who could not any more wait the development of advanced
integration tools. They wanted to be able to build in a short time their own
specialized integration applications. Aggregators came to the risk of these
users. They allowed them not only to combine distributed content, but also to
process it in ways that generate new services available for further
consumption.
To cope with the heterogeneous data, the Linked Data initiative aims at the creation and exploitation of correspondences across data values. In this work, although we share the Linked Data community vision, we advocate that for the modern web, linking at the data value level is not enough. Aggregators should base their integration tasks on the concept of an entity, i.e., identifying whether different pieces of information correspond to the same real world entity, such as an event or a person. We describe our theory, system, and experimental results that illustrate the approach's effectiveness. Keywords: entity matching, linked data, semantic web | |||
| Antourage: mining distance-constrained trips from flickr | | BIBAK | Full-Text | 1121-1122 | |
| Saral Jain; Stephan Seufert; Srikanta Bedathur | |||
| We study how to automatically extract tourist trips from large volumes of
geo-tagged photographs. Working with more than 8 million of these photographs
that are publicly available via photo-sharing communities such as Flickr and
Panoramio, our goal is to satisfy the needs of a tourist who specifies a
starting location (typically a hotel) together with a bounded travel distance
and demands a tour that visits the popular sites along the way. Our system,
named ANTOURAGE, solves this intractable problem using a novel adaptation of
the max-min ant system (MMAS) meta-heuristic. Experiments using GPS metadata
crawled from Flickr show that ANTOURAGE can generate high-quality tours. Keywords: flickr, geolocation, graph mining, photo collections, trip planning | |||
| Analyzing collective view of future, time-referenced events on the web | | BIBAK | Full-Text | 1123-1124 | |
| Adam Jatowt; Hideki Kawai; Kensuke Kanazawa; Katsumi Tanaka; Kazuo Kunieda; Keiji Yamada | |||
| Humans have always desired to guess the future in order to adapt their
behavior and maximize chances of success. In this paper, we conduct exploratory
analysis of future-related information on the web. We focus on the
future-related information which is grounded in time, that is, the information
on forthcoming events whose expected occurrence dates are already known. We
collect data by crawling search engine index and analyze collective view of
future time-referenced events discussed on the web. Keywords: collective prediction, future-related information, opinion analysis | |||
| Optimizing two stage bigram language models for IR | | BIBAK | Full-Text | 1125-1126 | |
| Sara Javanmardi; Jianfeng Gao; Kuansan Wang | |||
| Although higher order language models (LMs) have shown benefit of capturing
word dependencies for Information retrieval (IR), the tuning of the increased
number of free parameters remains a formidable engineering challenge.
Consequently, in many real world retrieval systems, applying higher order LMs
is an exception rather than the rule. In this study, we address the parameter
tuning problem using a framework based on a linear ranking model in which
different component models are incorporated as features. Using unigram and
bigram LMs with 2 stage smoothing as examples, we show that our method leads to
a bigram LM that outperforms significantly its unigram counterpart and the
well-tuned BM25 model. Keywords: bigram LM, parameter tuning, retrieval model | |||
| The utility of tweeted URLs for web search | | BIBAK | Full-Text | 1127-1128 | |
| Vasileios Kandylas; Ali Dasdan | |||
| Microblogging as introduced by Twitter is becoming a source of tracking
real-time news. Although identifying the highest quality or most useful posts
or tweets from Twitter for breaking news is still an open problem, major web
search engines seem convinced of the value of such posts and have already
started allocating part of their search results pages to them. In this paper,
we study a different aspect of the problem for a search engine: instead of the
value of the posts, we study the value of the (shortened) URLs referenced in
these posts. Our results indicate that unlike frequently bookmarked URLs, which
are generally of high quality, frequently tweeted URLs tend to fall in two
opposite categories: they are either high in quality, or they are spam.
Identifying the quality category of a URL is not trivial, but the combination
of characteristics can reveal some trends. Keywords: content quality, microblogging, shortened urls, twitter | |||
| Trend detection model | | BIBAK | Full-Text | 1129-1130 | |
| Noriaki Kawamae; Ryuichiro Higashinaka | |||
| This paper presents a topic model that detects topic distributions over
time. Our proposed model, Trend Detection Model (TDM) introduces a latent trend
class variable into each document. The trend class has a probability
distribution over topics and a continuous distribution over time. Experiments
using our data set show that TDM is useful as a generative model in the
analysis of the evolution of trends. Keywords: dynamic topic model, latent variable modeling, trend analysis | |||
| Unsupervised query segmentation using click data: preliminary results | | BIBAK | Full-Text | 1131-1132 | |
| Julia Kiseleva; Qi Guo; Eugene Agichtein; Daniel Billsus; Wei Chai | |||
| We describe preliminary results of experiments with an unsupervised
framework for query segmentation, transforming keyword queries into structured
queries. The resulting queries can be used to more accurately search product
databases, and potentially improve result presentation and query suggestion.
The key to developing an accurate and scalable system for this task is to train
a query segmentation or attribute detection system over labeled data, which can
be acquired automatically from query and click-through logs. The main
contribution of our work is a new method to automatically acquire such training
data -- resulting in significantly higher segmentation performance, compared to
previously reported methods. Keywords: attribute extraction, query segmentation, structured queries | |||
| Detecting epidemic tendency by mining search logs | | BIBAK | Full-Text | 1133-1134 | |
| Weize Kong; Yiqun Liu; Shaoping Ma; Liyun Ru | |||
| We consider the problem of detecting epidemic tendency by mining search
logs. We propose an algorithm based on click-through information to select
epidemic related queries/terms. We adopt linear regression to model epidemic
occurrences and frequencies of epidemic related terms (ERTs) in search logs.
The results show our algorithm is effective in finding ERTs which obtain a high
correlation value with epidemic occurrences. We also find the proposed method
performs better when combining different ERTs than using single ERT. Keywords: epidemic detection, query, search log mining | |||
| An information retrieval approach to spelling suggestion | | BIBAK | Full-Text | 1135-1136 | |
| Sai Krishna; Prasad Pingali; Vasudeva Varma | |||
| In this paper, we present a two-step language-independent spelling
suggestion system. In the first step, candidate suggestions are generated using
an Information Retrieval (IR) approach. In step two, candidate suggestions are
re-ranked using a new string similarity measure that uses the length of the
longest common substrings occurring at the beginning and end of the words. We
obtained very impressive results by reranking candidate suggestions using the
new similarity measure. The accuracy of first suggestion is 92.3%, 90.0% and
83.5% for Dutch, Danish and Bulgarian language datasets respectively. Keywords: information retrieval, language independent, spelling suggestion | |||
| Finding influentials based on the temporal order of information adoption in twitter | | BIBAK | Full-Text | 1137-1138 | |
| Changhyun Lee; Haewoon Kwak; Hosung Park; Sue Moon | |||
| Twitter offers an explicit mechanism to facilitate information diffusion and
has emerged as a new medium for communication. Many approaches to find
influentials have been proposed, but they do not consider the temporal order of
information adoption. In this work, we propose a novel method to find
influentials by considering both the link structure and the temporal order of
information adoption in Twitter. Our method finds distinct influentials who are
not discovered by other methods. Keywords: influentials, information diffusion, ranking, social networks, twitter | |||
| The Social Honeypot Project: protecting online communities from spammers | | BIBAK | Full-Text | 1139-1140 | |
| Kyumin Lee; James Caverlee; Steve Webb | |||
| We present the conceptual framework of the Social Honeypot Project for
uncovering social spammers who target online communities and initial empirical
results from Twitter and MySpace. Two of the key components of the Social
Honeypot Project are: (1) The deployment of social honeypots for harvesting
deceptive spam profiles from social networking communities; and (2) Statistical
analysis of the properties of these spam profiles for creating spam classifiers
to actively filter out existing and new spammers. Keywords: social honeypots, social media, spam | |||
| Pusic: musicalize microblog messages for summarization and exploration | | BIBAK | Full-Text | 1141-1142 | |
| Cheng-Te Li; Hung-Che Lai; Chien-Tung Ho; Chien-Lin Tseng; Shou-De Lin | |||
| Micro-blogging services provide platforms for users to share their feelings
and ideas on the go. Designing to produce information stream in almost
Micro-blogging services, although are capable of recording rich and diverse
senses, still suffer from a drawback of not being able to provide deeper and
summarized views. In this paper, we present a novel framework, Pusic, to
musicalize micro-blogging messages for terms or users. Pusic can be used to (1)
summarize users' messages into certain expression of emotions, (2) explore the
emotions and senses and transform them into music, and (3) serve as a
presentation of crowd net art. We generate the music from two aspects: emotion
and harmony. The former is tackled by emotion detection from messages while the
latter is established by rule-based harmonic heuristics according to the
detected emotions. Pusic has been announced online for people's experience and
further investigation. Keywords: emotion, exploration, micro-blogging, music, summarization | |||
| Keyword extraction for social snippets | | BIBAK | Full-Text | 1143-1144 | |
| Zhenhui Li; Ding Zhou; Yun-Fang Juan; Jiawei Han | |||
| Today, a huge amount of text is being generated for social purposes on
social networking services on the Web. Unlike traditional documents, such text
is usually extremely short and tends to be informal. Analysis of such text
benefit many applications such as advertising, search, and content filtering.
In this work, we study one traditional text mining task on such new form of
text, that is extraction of meaningful keywords. We propose several intuitive
yet useful features and experiment with various classification models.
Evaluation is conducted on Facebook data. Performances of various features and
models are reported and compared. Keywords: keyword extraction, online advertising, social snippets | |||
| Entity relation discovery from web tables and links | | BIBAK | Full-Text | 1145-1146 | |
| Cindy Xide Lin; Bo Zhao; Tim Weninger; Jiawei Han; Bing Liu | |||
| The World-Wide Web consists not only of a huge number of unstructured texts,
but also a vast amount of valuable structured data. Web tables [2] are a
typical type of structured information that are pervasive on the web, and
Web-scale methods that automatically extract web tables have been studied
extensively [1]. Many powerful systems (e.g. OCTOPUS [4], Mesa [3]) use
extracted web tables as a fundamental component.
In the database vernacular, a table is defined as a set of tuples which have the same attributes. Similarly, a web table is defined as a set of rows (corresponding to database tuples) which have the same column headers (corresponding to database attributes). Therefore, to extract a web table is to extract a relation on the web. In databases, tables often contain foreign keys which refer to other tables. Therefore, it follows that hyperlinks inside a web table sometimes function as foreign keys to other relations whose tuples are contained in the hyperlink's target pages. In this paper, we explore this idea by asking: can we discover new attributes for web tables by exploring hyperlinks inside web tables? This poster proposes a solution that takes a web table as input. Frequent patterns are generated as new candidate relations by following hyperlinks in the web table. The confidence of candidates are evaluated, and trustworthy candidates are selected to become new attributes for the table. Finally, we show the usefulness of our method by performing experiments on a variety of web domains. Keywords: entity relation discovery, link, web table | |||
| Identifying featured articles in wikipedia: writing style matters | | BIBAK | Full-Text | 1147-1148 | |
| Nedim Lipka; Benno Stein | |||
| Wikipedia provides an information quality assessment model with criteria for
human peer reviewers to identify featured articles. For this classification
task "Is an article featured or not?" we present a machine learning approach
that exploits an article's character trigram distribution. Our approach differs
from existing research in that it aims to writing style rather than evaluating
meta features like the edit history. The approach is robust, straightforward to
implement, and outperforms existing solutions. We underpin these claims by an
experiment design where, among others, the domain transferability is analyzed.
The achieved performances in terms of the F-measure for featured articles are
0.964 within a single Wikipedia domain and 0.880 in a domain transfer
situation. Keywords: domain transfer, information quality, wikipedia | |||
| Retagging social images based on visual and semantic consistency | | BIBAK | Full-Text | 1149-1150 | |
| Dong Liu; Xian-Sheng Hua; Meng Wang; Hong-Jiang Zhang | |||
| The tags on social media websites such as Flickr are frequently imprecise
and incomplete, thus there is still a gap between these tags and the actual
content of the images. This paper proposes a social image "retagging" scheme
that aims at assigning images with better content descriptors. The refining
process is formulated as an optimization framework based on the consistency
between "visual similarity" and "semantic similarity" in social images. An
effective iterative bound optimization algorithm is applied to learn the
optimal tag assignment. In addition, as many tags are intrinsically not
closely-related to the visual content of the images, we employ a
knowledge-based method to differentiate visual content related from unrelated
tags and then constrain the tagging vocabulary of our automatic algorithm
within the content related tags. Experimental results on a Flickr image
collection demonstrate the effectiveness of this approach. Keywords: image tagging; tag refinement; retagging | |||
| Research trails: getting back where you left off | | BIBAK | Full-Text | 1151-1152 | |
| Jiahui Liu; Peter Jin Hong; Elin Rønby Pedersen | |||
| In this paper, we present a prototype system that helps users in early-stage
web research to create and reestablish context across fragmented work process,
without requiring them to explicitly collect and organize the material they
visit. The system clusters a user's web history and shows it as research
trails. We present two user interaction models with the research trails. The
first interaction model is implemented as a standalone application, which
presents a hierarchical view of research trails. The second interaction model
is integrated with the web browser. It shows the user's research trails as
selectable and manipulable visual streams when they open a new tab. Thereby,
the NewTab page serves as a springboard in the browser for a user resuming an
ongoing task. Keywords: browser, early-stage research, research trail | |||
| An efficient random access inverted index for information retrieval | | BIBAK | Full-Text | 1153-1154 | |
| Xiaozhu Liu; Zhiyong Peng | |||
| To improve query performance and space efficiency, an efficient random
access blocked inverted index (RABI) is proposed. RABI divides an inverted list
into blocks and compresses different part of each block with the corresponding
encoding method to decrease space consumption. RABI can provide fast addressing
and random access functions on the compressed blocked inverted index with the
novel hybrid compression method, which can provide both block level and inner
block level skipping function and further enhance both space and time
efficiencies without inserting any additional auxiliary information.
Experimental results show that RABI achieves both high space efficiency and
search efficiency, and outperforms the existing approach significantly. Keywords: information retrieval, inverted index, random access | |||
| Structure-aware music resizing using lyrics | | BIBAK | Full-Text | 1155-1156 | |
| Zhang Liu; Chaokun Wang; Jianmin Wang; Wei Zheng; Shengfei Shi | |||
| World wide web provides plenty of multimedia resources for creating rich
media web applications. However, the collected music and other media resources
always mismatch in the metric of time length. Existent music resizing
approaches suffer from perceptual artifacts which degrade the performance of
resized music. In this paper, a novel structure-aware music resizing approach
is proposed. Through lyrics analysis, our approach can compress different parts
of a music piece in variant compression rates. Experimental results show that
the proposed method can effectively generate resized songs with good quality. Keywords: music resizing, rich media, time-scale compression | |||
| HTTP database connector (HDBC): RESTful access to relational databases | | BIBAK | Full-Text | 1157-1158 | |
| Alexandros Marinos; Erik Wilde; Jiannan Lu | |||
| Relational databases hold a vast quantity of information and making them
accessible to the web is an big challenge. There is a need to make these
databases accessible with as little difficulty as possible, opening them up to
the power and serendipity of the Web. Our work presents a series of patterns
that bridge the relational database model with the architecture of the Web
along with an implementation of some of them. The aim is for relational
databases to be made accessible with no intermediate steps and no extra
metadata required. This approach can vastly increase the data available on the
web, therefore making the Web itself all the more powerful, while enabling its
users to seamlessly perform tasks that previously required bridging multiple
domains and paradigms or were not possible. Keywords: relational databases, rest, web architecture | |||
| Detecting communities from tripartite networks | | BIBAK | Full-Text | 1159-1160 | |
| Tsuyoshi Murata | |||
| Online social media such as delicious and digg are represented as tripartite
networks whose vertices are users, tags, and resources. Detecting communities
from such tripartite networks is practically important. Modularity is often
used as the criteria for evaluating the goodness of network divisions into
communities. For tripartite networks, Neubauer defines a tripartite modularity
which extends Murata's bipartite modularity. However, Neubauer's tripartite
modularity still uses projections and it will lose information that original
tripartite networks have. This paper proposes new tripartite modularity for
tripartite networks that do not use projections. Experimental results show that
better community structures can be detected by optimizing our tripartite
modularity. Keywords: community, modularity, tripartite network | |||
| INEX+DBPEDIA: a corpus for semantic search evaluation | | BIBAK | Full-Text | 1161-1162 | |
| Jose R. Perez-Aguera; Javier Arroyo; Jane Greenberg; Joaquin Perez-Iglesias; Victor Fresno | |||
| This paper presents a new collection based on DBpedia and INEX for
evaluating semantic search performance. The proposed corpus is used to
calculate the impact of considering document's structure on the retrieval
performance of the Lucene and BM25 ranking functions. Results show that BM25
outperforms Lucene in all the considered metrics and that there is room for
future improvements, which may be obtained using a hybrid approach combining
both semantic technology and information retrieval ranking functions. Keywords: evaluation, semantic search | |||
| Protecting data in multi-stakeholder web service system | | BIBAK | Full-Text | 1163-1164 | |
| Tan Phan; Jun han; Garth Heward; Steve Versteeg | |||
| Current Web Service security standards have inadequate support for
end-to-end protection of data when some receivers of the data are unknown to
the sender. This paper presents an approach to aid collaborative partner
services in properly protecting each other's data. Our approach allows each
partner to derive an adequate protection mechanism with minimum performance
overhead for each message it sends based on those of the corresponding messages
it receives. We modify the message handling mechanisms of Web Service engines
to dynamically gather protection requirements for a given outgoing message by
aggregating requirements from original owners of message data. Keywords: data security, propagation, protection mechanism | |||
| Constructing folksonomies by integrating structured metadata | | BIBAK | Full-Text | 1165-1166 | |
| Anon Plangprasopchok; Kristina Lerman; Lise Getoor | |||
| Aggregating many personal hierarchies into a common taxonomy, also known as
a folksonomy, presents several challenges due to its sparseness, ambiguity,
noise, and inconsistency. We describe an approach to folksonomy learning based
on relational clustering that addresses these challenges by exploiting
structured metadata contained in personal hierarchies. Our approach clusters
similar hierarchies using their structure and tag statistics, then
incrementally weaves them into a deeper, bushier tree. We study folksonomy
learning using social metadata extracted from the photo-sharing site Flickr. We
evaluate the learned folksonomy quantitatively by automatically comparing it to
a reference taxonomy created by the Open Directory Project. Our empirical
results suggest that the proposed approach improves upon the state-of-the-art
folksonomy learning method. Keywords: collective knowledge, data mining, folksonomies | |||
| Semantic lexicon adaptation for use in query interpretation | | BIBAK | Full-Text | 1167-1168 | |
| Ana Maria Popescu; Patrick Pantel; Gilad Mishne | |||
| We describe improvements to the use of semantic lexicons by a
state-of-the-art query interpretation system powering a major search engine. We
successfully compute concept label importance information for lexicon strings;
lexicon augmentation with such information leads to a 6.4% precision increase
on affected queries with no query coverage loss. Finally, lexicon filtering
based on label importance leads to a 13% precision increase, but at the expense
of query coverage. Keywords: lexicon adaptation, word sense disambiguation | |||
| Towards comment-based cross-media retrieval | | BIBAK | Full-Text | 1169-1170 | |
| Martin Potthast; Benno Stein; Steffen Becker | |||
| This paper investigates whether Web comments can be exploited for
cross-media retrieval. Comparing Web items such as texts, images, videos,
music, products, or personal profiles can be done at various levels of detail;
our focus is on topic similarity. We propose to compare user-supplied comments
on Web items in lieu of the commented items themselves. If this approach is
feasible, the task of extracting and mapping features between arbitrary pairs
of item types can be circumvented, and well-known text retrieval models can be
applied instead -- given that comments are available. We report on results of a
preliminary, but nonetheless large-scale experiment which shows that, if
comments on textual items are compared with comments on video items, topically
similar pairs achieve a sufficiently high cross-media similarity. Keywords: cross-media retrieval, web comments | |||
| Inferring query intent from reformulations and clicks | | BIBAK | Full-Text | 1171-1172 | |
| Filip Radlinski; Martin Szummer; Nick Craswell | |||
| Many researchers have noted that web search queries are often ambiguous or
unclear. We present an approach for identifying the popular meanings of queries
using web search logs and user click behavior. We show our approach to produce
more complete and user-centric intents than expert judges by evaluating on TREC
queries. This approach was also used by the TREC 2009 Web Track judges to
obtain more representative topic descriptions from real queries. Keywords: diversity, intents, subtopics | |||
| Conversion rate based bid adjustment for sponsored search | | BIBAK | Full-Text | 1173-1174 | |
| Benjamin Rey; Ashvin Kannan | |||
| Advertisers use Sponsored Search to drive traffic to their site at a
conversion rate and cost per conversion that provides value to them. However,
very often advertisers bid at a constant price on a bundle of keywords, either
for lack of enough data to fully optimize their bids at a keyword level, or
indirectly by opting into Advanced Matching (AM) that allows an advertiser to
reach a large number of queries while explicitly bidding only on a limited
number. Then this single bid price reflects the return the advertiser gets from
the full bundle. Under these conditions, the advertiser is competing too
aggressively for some keyword auctions and with too low bids for others. In
this paper, we propose a solution to improve the fairness of each keyword's bid
prices within an AM bundle: adjusting the AM keyword bid by the ratio of its
conversion rate to the conversion rate it would have reached had it been an
Exact Match (EM). First we describe how we measure advertisers' conversion
rates despite the opt-in nature of conversion tracking, and illustrate the need
for bid adjustment in the context of AM. Then we present our approach to
predict conversion rates in a robust manner. Our model uses a number of
features capturing the quality of the match between the ad and the query. Then
we describe how we adjust keyword bid prices to reflect their value to the
advertiser thereby improving (1) the auction through fewer incorrectly high
bids in the auction, (2) advertiser return through more auctions won by high
value keywords and less by low value keywords, and (3) user satisfaction
through higher conversion rate. Finally, we present experimental results from
our live system. Keywords: ad auctions, advanced match, bid optimization, conversion rate, cost per
action, search engine, sponsored search | |||
| useKit: a step towards the executable web 3.0 | | BIBAK | Full-Text | 1175-1176 | |
| Sven Rizzotti; Helmar Burkhart | |||
| A lot of situations from daily life have found their counterpart in the
internet. Buying or selling goods, discovering information or making social
connections have been very successful in the virtual world and make the Web the
central working place.
Until now the purpose of a Web site has been defined at the time of creation. Extension of the available functionality or transfer of the functionality onto another site has only been possible with considerable additional effort. With useKit, we present a software platform that allows users to add individual selected functionalities to any Web site without installing software. useKit offers a new approach towards a collaborative, personalized and individually compiled view of the Web. useKit focuses on personalized applications and services that can be applied to any Web site. In analogy to file system permissions, this can be seen as an instance of the" "executable Web"" or Web 3.0 if Web 1.0 was ""read-only"", and Web 2.0 was "read-write"". User contributed code will morph online applications into omnipresent functional platforms with a single interface. Keywords: mashup, personalization, refinement, service composition, web3.0 | |||
| Web-scale k-means clustering | | BIBAK | Full-Text | 1177-1178 | |
| D. Sculley | |||
| We present two modifications to the popular k-means clustering algorithm to
address the extreme requirements for latency, scalability, and sparsity
encountered in user-facing web applications. First, we propose the use of
mini-batch optimization for k-means clustering. This reduces computation cost
by orders of magnitude compared to the classic batch algorithm while yielding
significantly better solutions than online stochastic gradient descent. Second,
we achieve sparsity with projected gradient descent, and give a fast
ε-accurate projection onto the L1-ball. Source code is freely
available: http://code.google.com/p/sofia-ml Keywords: scalability, sparse solutions, unsupervised clustering | |||
| Learning based access control in online social networks | | BIBAK | Full-Text | 1179-1180 | |
| Mohamed Shehab; Gorrell Cheek; Hakim Touati; Anna C. Squicciarini; Pau-Chen Cheng | |||
| Online social networking sites are experiencing tremendous user growth with
hundreds of millions of active users. As a result, there is a tremendous amount
of user profile data online, e.g., name, birthdate, etc. Protecting this data
is a challenge. The task of access policy composition is a tedious and
confusing effort for the average user having hundreds of friends. We propose an
approach that assists users in composing and managing their access control
policies. Our approach is based on a supervised learning mechanism that
leverages user provided example policy settings as training sets to build
classifiers that are the basis for auto-generated policies. Furthermore, we
provide mechanisms to enable users to fuse policy decisions that are provided
by their friends or others in the social network. These policies then regulate
access to user profile objects. We implemented our approach and, through
extensive experimentation, prove the accuracy of our proposed mechanisms. Keywords: privacy, social networks, supervised learning | |||
| Situation detection and control using spatio-temporal analysis of microblogs | | BIBAK | Full-Text | 1181-1182 | |
| Vivek K. Singh; Mingyan Gao; Ramesh Jain | |||
| Large volumes of spatio-temporal-thematic data being created using sites
like Twitter and Jaiku, can potentially be combined to detect events, and
understand various 'situations' as they are evolving at different
spatio-temporal granularity across the world. Taking inspiration from
traditional image pixels which represent aggregation of photon energies at a
location, we consider aggregation of user interest levels at different
geo-locations as social pixels. Combining such pixels spatio-temporally allows
for creation of social images and video. Here, we describe how the use of
relevant (media processing inspired) situation detection operators upon such
'images', and domain based rules can be used to decide relevant control
actions. The ideas are showcased using a Swine flu monitoring application which
uses Twitter data. Keywords: control, event, microblogs, situation, twitter | |||
| Structural analysis of the emerging event-web | | BIBAK | Full-Text | 1183-1184 | |
| Vivek K. Singh; Ramesh Jain | |||
| Events are the fundamental abstractions to study the dynamic world. We
believe that the next generation of web (i.e. event-web), will focus on
interconnections between events as they occur across space and time [3]. In
fact we argue that the real value of large volumes of microblog data being
created daily lies in its inherent spatio-temporality, and its correlation with
the real-world events. In this context, we studied the structural properties of
a corpus of 5,835,237 Twitter microblogs, and found it to exhibit Power laws
across space and time, much like those exhibited by events in multiple domains.
The properties studied over microblogs on different topics can be applied to
study relationships between related events, as well as data organization for
event-based, real-time, and location-aware applications. Keywords: event-web, microblogs, pareto law, power law, zipfs law | |||
| Tagging and navigability | | BIBAK | Full-Text | 1185-1186 | |
| Adish Singla; Ingmar Weber | |||
| We consider the problem of optimal tagging for navigational purposes in
one's own collection. What is the best that a forgetful user can hope for in
terms of ease of retrieving a labeled object? We prove that the number of tags
has to increase logarithmically in the collection size to maintain a manageable
result set. Using Flickr data we then show that users do indeed apply more and
more tags as their collection grows and that this is not due to a global
increase in tagging activity. However, as the additional terms applied are not
statistically independent, users of large collections still have to deal with
larger and larger result sets, even when more tags are used as search terms. We
pose optimal tag suggestion for navigational purposes as an open problem. Keywords: data organization, navigability, tagging | |||
| Sampling high-quality clicks from noisy click data | | BIBAK | Full-Text | 1187-1188 | |
| Adish Singla; Ryen W. White | |||
| Click data captures many users' document preferences for a query and has
been shown to help significantly improve search engine ranking. However, most
click data is noisy and of low frequency, with queries associated to documents
via only one or a few clicks. This severely limits the usefulness of click data
as a ranking signal. Given potentially noisy clicks comprising results with at
most one click for a query, how do we extract high-quality clicks that may be
useful for ranking? In this poster, we introduce a technique based on query
entropy for noise reduction in click data. We study the effect of query entropy
and as well as features such as user engagement and the match between the query
and the document. Based on query entropy plus other features, we can sample
noisy data to 15% of its overall size with 43% query recall and an average
increase of 20% in precision for recalled queries. Keywords: click data, noise elimination, query entropy, web search ranking | |||
| Estimating the web robot population | | BIBAK | Full-Text | 1189-1190 | |
| Yang Sun; C. Lee Giles | |||
| In this research, capture-recapture (CR) models are used to estimate the
population of web robots based on web server access logs from different
websites. Each robot is considered as an individual randomly surfing the web
and each website is considered as a trap that records the visitation of robots.
We use maximum likelihood estimator to fit the observation data. Results show
that there are 3,860 identifiable robot User-Agent strings and 780,760 IP
addresses being used by web robots around the world. We also examine the
origination of the named robots by their IP addresses. The results suggest that
over 50% of web robot IP addresses are from United States and China. Keywords: web robot population | |||
| Automated detection of session fixation vulnerabilities | | BIBAK | Full-Text | 1191-1192 | |
| Yusuke Takamatsu; Yuji Kosuga; Kenji Kono | |||
| Session fixation is a technique for obtaining the visitor's session
identifier (SID) by forcing the visitor to use the SID supplied by the
attacker. The attacker who obtains the victim's SID can masquerade as the
visitor. In this paper, we propose a technique to automatically detect session
fixation vulnerabilities in web applications. Our technique uses attack
simulator that executes a real session fixation attack and check whether it is
successful or not. In the experiment, our system successfully detected
vulnerabilities in our original test cases and in a real world web application. Keywords: session fixation, web application security | |||
| Keyword search over key-value stores | | BIBAK | Full-Text | 1193-1194 | |
| Arash Termehchy; Marianne Winslett | |||
| Key-value stores (KVSs) are the most prevalent storage systems for large
scale web services. As they do not have the structural complexities of RDBMSs,
they are more efficient. In this paper, we introduce the problem of keyword
search over KVSs and analyze its differences with keyword search over documents
and relational databases. We propose a novel method called Keyword Search over
Key-value stores (KSTORE) to solve the problem. Our user study using two real
life data sets shows that KSTORE provides better ranking than the extended
versions of document and relational database keyword search proposals and poses
reasonable overheads. Keywords: cloud computing, key-value stores, keyword search | |||
| Scalable discovery of contradictions on the web | | BIBAK | Full-Text | 1195-1196 | |
| Mikalai Tsytsarau; Themis Palpanas; Kerstin Denecke | |||
| Our study addresses the problem of large-scale contradiction detection and
management, from data extracted from the Web. We describe the first systematic
solution to the problem, based on a novel statistical measure for
contradictions, which exploits first- and second-order moments of sentiments.
Our approach enables the interactive analysis and online identification of
contradictions under multiple levels of time granularity. The proposed
algorithm can be used to analyze and track opinion evolution over time and to
identify interesting trends and patterns. It uses an incrementally updatable
data structure to achieve computational efficiency and scalability. Experiments
with real datasets show promising time performance and accuracy. Keywords: contradiction analysis, opinion mining | |||
| A practical system for harvesting and monitoring hot topics on the web | | BIBAK | Full-Text | 1197-1198 | |
| Xiaojun Wan; Jianwu Yang | |||
| This poster briefly describes a practical system named FounderWISE for
harvesting and monitoring hot topics on the Web. FounderWISE consists of five
components: Web crawler, text classifier, topic detector, topic summarizer and
topic analyzer. In this poster we present two key components of topic detector
and topic analyzer. The system has been successfully deployed in a few Chinese
major government departments. Keywords: founderwise, information diffusion, topic detection, topic evolution | |||
| Co-optimization of multiple relevance metrics in web search | | BIBAK | Full-Text | 1199-1200 | |
| Dong Wang; Chenguang Zhu; Weizhu Chen; Gang Wang; Zheng Chen | |||
| Several relevance metrics, such as NDCG, precision and pSkip, are proposed
to measure search relevance, where different metrics try to characterize search
relevance from different perspectives. Yet we empirically find that the direct
optimization of one metric cannot always achieve the optimal ranking of another
metric. In this paper, we propose two novel relevance optimization approaches,
which take different metrics into a global consideration where the objective is
to achieve an ideal tradeoff between different metrics. To achieve this
objective, we propose to co-optimize multiple relevance metrics and show their
effectiveness. Keywords: lambdarank, learning to rank, user feedback | |||
| Analyzing content-level properties of the web adversphere | | BIBAK | Full-Text | 1201-1202 | |
| Yong Wang; Daniel Burgener; Aleksandar Kuzmanovic; Gabriel Maciá-Fernández | |||
| Advertising has become an integral and inseparable part of the World Wide
Web. However, neither public auditing nor monitoring mechanisms still exist in
this emerging area. In this paper, we present our initial efforts on building a
content-level auditing service for web-based ad networks. Our content-level
measurements -- understanding the ad distribution mechanisms and evaluating
location-based and behavioral targeting approaches -- bring useful auditing
information to all entities involved in the online advertising business. We
extensively evaluate Google's, AOL's, and Adblade's ad networks and demonstrate
how their different design philosophies dominantly affect their performance at
the content level. Keywords: behavioral targeting, location-based advertising, web-based ad network | |||
| How accurately can one's interests be inferred from friends | | BIBAK | Full-Text | 1203-1204 | |
| Zhen Wen; Ching-Yung Lin | |||
| Search and recommendation systems must effectively model user interests in
order to provide personalized results. The proliferation of social software
makes social network an increasingly important source for user interest
modeling, because of the social influence and correlation among friends.
However, there are large variations in people's contribution of social content.
Therefore, it is impractical to accurately model interests for all users. As a
result, applications need to decide whether to utilize a user interest model
based on its accuracy. To address this challenge, we present a study on the
accuracy of user interests inferred from three types of social content: social
bookmarking, file sharing, and electronic communication, in an organizational
social network within a large-scale enterprise. First, we demonstrate that
combining different types of social content to infer user interests outperforms
methods that use only one type of social content. Second, we present a
technique to predict the inference accuracy based on easily observed network
characteristics, including user activeness, network in-degree, out-degree, and
betweenness centrality. Keywords: accuracy, social networks, user modeling | |||
| Learning to evaluate the visual quality of web pages | | BIBAK | Full-Text | 1205-1206 | |
| Ou Wu; Yunfei Chen; Bing Li; Weiming Hu | |||
| A beautiful and well-laid out Web page greatly facilitates users' accessing
and enhances browsing experiences. We use "visual quality (VQ)" to denote the
aesthetics of Web pages. In this paper, a computational aesthetics approach is
proposed to learn the evaluation model for the visual quality of Web pages.
First, a Web page layout extraction algorithm (V-LBE) is introduced to
partition a Web page into major layout blocks. Then, regarding a Web page as a
semi-structured image, features known to significantly affect the visual
quality of a Web page are extracted to construct a feature vector. The
experimental results show the initial success of our approach. Potential
applications include Web search and Web design. Keywords: aesthetics, learning, semi-structured image, visual quality | |||
| Are search engine users equally reliable? | | BIBAK | Full-Text | 1207-1208 | |
| Qianli Xing; Yiqun Liu; Rongwei Cen; Min Zhang; Shaoping Ma; Liyun Ru | |||
| In this paper, we study on the reliability of search engine users using
click-through data. We proposed a graph-based approach to evaluate user
reliability according to how users click on search result lists. We tried to
incorporate this measure of reliability into relevance feedback for improving
ranking performances. Experimental results indicate that the proposed approach
is both effective and applicable. Keywords: relevance feedback, user reliability, web search | |||
| RerankEverything: a reranking interface for browsing search results | | BIBAK | Full-Text | 1209-1210 | |
| Takehiro Yamamoto; Satoshi Nakamura; Katsumi Tanaka | |||
| This paper proposes a system called RerankEverything, which enables users to
rerank search results in any search service, such as a Web search engine, an
e-commerce site, a hotel reservation site and so on. In conventional search
services, interactions between users and services are quite limited and
complicated. In addition, search functions and interactions to refine search
results differ depending on the services. By using RerankEverything, users can
interactively explore search results in accordance with their interests by
reranking search results from various viewpoints. Keywords: reranking, search user interfaces, wrapper generation | |||
| LINKREC: a unified framework for link recommendation with user attributes and graph structure | | BIBAK | Full-Text | 1211-1212 | |
| Zhijun Yin; Manish Gupta; Tim Weninger; Jiawei Han | |||
| With the phenomenal success of networking sites (e.g., Facebook, Twitter and
LinkedIn), social networks have drawn substantial attention. On online social
networking sites, link recommendation is a critical task that not only helps
improve user experience but also plays an essential role in network growth. In
this paper we propose several link recommendation criteria, based on both user
attributes and graph structure. To discover the candidates that satisfy these
criteria, link relevance is estimated using a random walk algorithm on an
augmented social graph with both attribute and structure information. The
global and local influence of the attributes is leveraged in the framework as
well. Besides link recommendation, our framework can also rank attributes in a
social network. Experiments on DBLP and IMDB data sets demonstrate that our
method outperforms state-of-the-art methods based on network structure and node
attribute information for link recommendation. Keywords: link recommendation, random walk | |||
| A link-based similarity measure for scientific literature | | BIBAK | Full-Text | 1213-1214 | |
| Seok-Ho Yoon; Sang-Wook Kim; Sunju Park | |||
| In this paper, we propose a new approach to measure similarities among
academic papers based on their references. Our similarity measure uses both
in-link and out-link by transforming in-link and out-link into undirected
links. Keywords: link-based similarity measure, scientific literature | |||
| Social group suggestion from user image collections | | BIBAK | Full-Text | 1215-1216 | |
| Jie Yu; Xin Jin; Jiawei Han; Jiebo Luo | |||
| Photo-sharing services have attracted millions of people and helped
construct massive social networks on the Web. A popular trend is that users
share their image collections within social groups, which greatly promotes the
interactions between users and expands their social networks. Existing systems
have difficulties in generating satisfactory social group suggestions because
the images are classified independently and their relationship in a collection
is ignored. In this work, we intend to produce suggestions of suitable
photo-sharing groups from a user's personal photo collection by mining images
on the Web and leveraging the collection context. Both visual content and
textual annotations are integrated to generate initial prediction of the events
or topics depicted in the images. A user collection-based label propagation
method is proposed to improve the group suggestion by modeling the relationship
of images in the same collection as a sparse weighted graph. Experiments on
real user images and comparisons with the state-of-the-art techniques
demonstrate the effectiveness of the proposed approaches. Keywords: group recommendation, label propagation, social image | |||
| A quality-aware model for sales prediction using reviews | | BIBAK | Full-Text | 1217-1218 | |
| Xiaohui Yu; Yang Liu; Xiangji Huang; Aijun An | |||
| Writing and publishing reviews online has become an increasingly popular way
for people to express opinions and sentiments. Analyzing the large volume of
online reviews available can produce useful knowledge that are of interest to
vendors and other parties. Prior studies in the literature have shown that
online reviews have a significant correlation with the sales of products, and
therefore mining the reviews could help predict the sales performance of
relevant products. However, those studies fail to consider one important factor
that may significantly affect the accuracy of the prediction, i.e., the quality
of the reviews. In this paper, we propose a regression model that explicitly
takes into account the quality factor, and discusses how this quality
information can be predicted when it is not readily available. Experimental
results on a movie review dataset confirm the effectiveness of the proposed
model. Keywords: review mining, review quality mining | |||
| Review recommendation with graphical model and EM algorithm | | BIBAK | Full-Text | 1219-1220 | |
| Richong Zhang; Thomas Tran | |||
| Automatically assessing the quality and helpfulness of consumer reviews is
more and more desirable with the evolutionary development of online review
systems. Existing helpfulness assessment methodologies make use of the positive
vote fraction as a benchmark and heuristically find a "best guess" to estimate
the helpfulness of review documents. This benchmarking methodology ignores the
voter population size and treats the same positive vote fraction as the same
helpfulness value. We propose a review recommendation approach that make use of
the probability density of the review helpfulness as the benchmark and exploit
graphical model and Expectation Maximization (EM) algorithm for the inference
of review helpfulness. The experimental results demonstrate that the proposed
approach is superior to existing approaches. Keywords: helpfulness, online review, recommendation | |||
| Selective recrawling for object-level vertical search | | BIBAK | Full-Text | 1221-1222 | |
| Yaqian Zhou; Mengjing Jiang; Qi Zhang; Xuanjing Huang; Lide Wu | |||
| In this paper we propose a novel recrawling method based on navigation
patterns called Selective Recrawling. The goal of selective recrawling is to
automatically select page collections that have large coverage and little
redundancy to a pre-defined vertical domain. It only requires several seed
objects and can select a set of URL patterns to cover most objects. The
selected set can be used to recrawl the web pages for quite a period of time
and renewed periodically. Experiments on local event data show that our method
can greatly reduce the downloading of web pages while keep the comparative
object coverage. Keywords: recrawling, vertical search | |||
| Efficient web pages identification for entity resolution | | BIBAK | Full-Text | 1223-1224 | |
| Jia Zhu; Gabriel Fung; Xiaofang Zhou | |||
| Entity resolution (ER) is a problem that arises in many areas. In most of
cases, it represents a task that multiple entities from different sources
require to be identified if they refer to the same or different objects because
there are not unique identifiers associated with them. In this paper, we
propose a model using web pages identification to identify entities and merge
those entities refer to one object together. We use a classical name
disambiguation problem as case study and examine our model on a subset of
digital library records as the first stage of our work. The favorable results
indicated that our proposed approach is highly effective. Keywords: entity resolution, name disambiguation, web pages identification | |||
| Anonymizing user profiles for personalized web search | | BIBAK | Full-Text | 1225-1226 | |
| Yun Zhu; Li Xiong; Christopher Verdery | |||
| We study the problem of anonymizing user profiles so that user privacy is
sufficiently protected while the anonymized profiles are still effective in
enabling personalized web search. We propose a Bayes-optimal privacy notion to
bound the prior and posterior probability of associating a user with an
individual term in the anonymized user profile set. We also propose a novel
bundling technique that clusters user profiles into groups by taking into
account the semantic relationships between the terms while satisfying the
privacy constraint. We evaluate our approach through a set of preliminary
experiments using real data demonstrating its feasibility and effectiveness. Keywords: anonymization, personalized search, privacy-preserving data publishing | |||
| Patch-based skin color detection and its application to pornography image filtering | | BIBAK | Full-Text | 1227-1228 | |
| Haiqiang Zuo; Weiming Hu; Ou Wu | |||
| Along with the explosive growth of the World Wide Web, an immense industry
for the production and consumption of pornography has grown. Though the
censorship and legal restraints on pornography are discriminating in different
historical, cultural and national contexts, selling pornography to minors is
not allowed in most cases. Detecting human skin tone is of utmost importance in
pornography image filtering algorithms. In this paper, we propose two
patch-based skin color detection algorithms: regular patch and irregular patch
skin color detection algorithms. On the basis of skin detection, we extract
31-dimensional features from the input image, and these features are fed into a
random forest classifier. Our algorithm has been incorporated into an
adult-content filtering infrastructure, and is now in active use for preventing
minors from accessing pornographic images via mobile phones. Keywords: covariance descriptors, pornography, skin color detection, skin patch, skin
tone | |||
| Access: news and blog analysis for the social sciences | | BIBAK | Full-Text | 1229-1232 | |
| Mikhail Bautin; Charles B. Ward; Akshay Patil; Steven S. Skiena | |||
| The social sciences strive to understand the political, social, and cultural
world around us, but have been impaired by limited access to the quantitative
data sources enjoyed by the hard sciences. Careful analysis of Web document
streams holds enormous potential to solve longstanding problems in a variety of
social science disciplines through massive data analysis. This paper introduces
the TextMap Access system, which provides ready access to a wealth of
interesting statistics on millions of people, places, and things across a
number of interesting web corpora. Powered by a flexible and scalable
distributed statistics computation framework using Hadoop, continually updated
corpora include newspapers, blogs, patent records, legal documents, and
scientific abstracts; well over a terabyte of raw text and growing daily. The
Lydia Textmap Access system, available through http://www.textmap.com/access,
provides instant access for students and scholars through a convenient web
user-interface. We describe the architecture of the TextMap Access system, and
its impact on current research in political science, sociology, and
business/marketing. Keywords: blog analysis, Hadoop, news analysis, social sciences | |||
| Hearsay: a new generation context-driven multi-modal assistive web browser | | BIBAK | Full-Text | 1233-1236 | |
| Yevgen Borodin; Faisal Ahmed; Muhammad Asiful Islam; Yury Puzis; Valentyn Melnyk; Song Feng; I. V. Ramakrishnan; Glenn Dausch | |||
| This demo will present HearSay, a multi-modal non-visual web browser, which
aims to bridge the growing Web Accessibility divide between individuals with
visual impairments and their sighted counterparts, and to facilitate full
participation of blind individuals in the growing Web-based society. Keywords: assistive browser, audio interface, blind users, multi-modal, screen reader,
web accessibility | |||
| A client-server architecture for state-dependent dynamic visualizations on the web | | BIBAK | Full-Text | 1237-1240 | |
| Daniel Coffman; Danny Soroker; Chandra Narayanaswami; Aaron Zinman | |||
| As sophisticated enterprise applications move to the Web, some advanced user
experiences become difficult to migrate due to prohibitively high computation,
memory, and bandwidth requirements. State-dependent visualizations of
large-scale data sets are particularly difficult since a change in the client's
context necessitates a change in the displayed results. This paper describes a
Web architecture where clients are served a session-specific image of the data,
with this image divided into tiles dynamically generated by the server. This
set of tiles is supplemented with a corpus of metadata describing the immediate
vicinity of interest; additional metadata is delivered as needed in a
progressive fashion in support and anticipation of the user's actions. We
discuss how the design of this architecture was motivated by the goal of
delivering a highly responsive user experience. As an example of a complete
application built upon this architecture, we present OrgMaps, an interactive
system for navigating hierarchical data, enabling fluid, low-latency navigation
of trees of hundreds of thousands of nodes on standard Web browsers using only
HTML and JavaScript. Keywords: rich internet applications | |||
| Hierarchical cluster visualization in web mapping systems | | BIBAK | Full-Text | 1241-1244 | |
| Jean-Yves Delort | |||
| This paper presents a technique for visualizing large spatial data sets in
Web Mapping Systems (WMS). The technique creates a hierarchical clustering
tree, which is subsequently used to extract clusters that can be displayed at a
given scale without cluttering the map. Voronoi polygons are used as
aggregation symbols to represent the clusters. This technique retains
hierarchical relationships between data items at different scales. In addition,
aggregation symbols do not overlap, and their sizes and the number of points
that they cover is controlled by the same parameter. A prototype has been
implemented and tested showing the effectiveness of the method for visualizing
large data sets in WMS. Keywords: cluster footprints, hierarchical clustering, visualization | |||
| Rants: a framework for rank editing and sharing in web search | | BIBAK | Full-Text | 1245-1248 | |
| Byron J. Gao; Joey Jan | |||
| With a Wiki-like search interface, users can edit ranks of search results
and share the edits with the rest of the world. This is an effective way of
personalization, as well as a practice of mass collaboration that allows users
to vote for ranking and improve search performance. Currently, there are
several ongoing experimentation efforts from the industry, e.g., SearchWiki by
Google and U Rank by Microsoft. Beyond that, there is little published research
on this new search paradigm. In this paper, we make an effort to establish a
framework for rank editing and sharing in the context of web search, where we
identify fundamental issues and propose principled solutions. Comparing to
existing systems, for rank editing, our framework allows users to specify not
only relative, but also absolute preferences. For edit sharing, our framework
provides enhanced flexibility, allowing users to select arbitrarily aggregated
views. In addition, edits can be shared among similar queries. We present a
prototype system Rants, that implements the framework and provides search
services through the Google web search API. Keywords: edit sharing, mass collaboration, personalization, rank editing, search
interface, social search | |||
| QuickSuggest: character prediction on web appliances | | BIBAK | Full-Text | 1249-1252 | |
| Ullas Gargi; Rich Gossweiler | |||
| As traditional media and information devices integrate with the web, they
must suddenly support a vastly larger database of relevant items. Many devices
use remote controls with on-screen keyboards which are not well suited for text
entry but are difficult to displace. We introduce a text entry method which
significantly improves text entry speed for on-screen keyboards using the same
simple Up/Down/Left/Right/Enter interface common to remote controls and gaming
devices used to enter text. The paper describes QuickSuggest's novel adaptive
user interface, demonstrates quantitative improvements from simulation results
on millions of user queries and shows ease of use and efficiency with no
learning curve in user experiments. Keywords: language models, onscreen keyboards, predictive text input, query
suggestions, remote controls, user interface, web appliances | |||
| visKQWL, a visual renderer for a semantic web query language | | BIBAK | Full-Text | 1253-1256 | |
| Andreas Hartl; Klara Weiand; François Bry | |||
| KiWi is a semantic Wiki that combines the Wiki philosophy of collaborative
content creation with the methods of the Semantic Web in order to enable
effective knowledge management.
Querying a Wiki must be simple enough for beginning users, yet powerful enough to accommodate experienced users. To this end, the keyword-based KiWi query language (KWQL) supports queries ranging from simple lists of keywords to expressive rules for selecting and reshaping Wiki (meta-)data. In this demo, we showcase visKWQL, a visual interface for the KWQL language aimed at supporting users in the query construction process. visKWQL and its editor are described, and their functionality is illustrated using example queries. visKWQL's editor provides guidance throughout the query construction process through hints, warnings and highlighting of syntactic errors. The editor enables round-tripping between the twin languages KWQL and visKWQL, meaning that users can switch freely between the textual and visual form when constructing or editing a query. It is implemented using HTML, JavaScript, and CSS, and can thus be used in (almost) any web browser without any additional software. Keywords: keyword querying, semantic web, semantic wikis, visual query languages, wiki | |||
| GRAPE: a system for disambiguating and tagging people names in web search | | BIBAK | Full-Text | 1257-1260 | |
| Lili Jiang; Wei Shen; Jianyong Wang; Ning An | |||
| Name ambiguity is a big challenge in people information retrieval and has
received considerable attention, especially with the increasing volume of Web
data in recent years. In this demo, we present a system, GRAPE, which is
capable of finding people related information over the Web. The salient
features of our system are people name disambiguation and people tag
presentation, which effectively distinguish different people entities sharing
the same name and uniquely represent each namesake with a cluster of tags, such
as occupation, birthdate, and organization. Keywords: clustering, name disambiguation, tag presentation | |||
| iRIN: image retrieval in image-rich information networks | | BIBAK | Full-Text | 1261-1264 | |
| Xin Jin; Jiebo Luo; Jie Yu; Gang Wang; Dhiraj Joshi; Jiawei Han | |||
| In this demo, we present a system called iRIN designed for performing image
retrieval in image-rich information networks. We first introduce MoK-SimRank to
significantly improve the speed of SimRank, one of the most popular algorithms
for computing node similarity in information networks. Next, we propose an
algorithm called SimLearn to (1) extend MoK-SimRank to heterogeneous image-rich
information network, and (2) account for both link-based and content-based
similarities by seamlessly integrating reinforcement learning with feature
learning. Keywords: image retrieval, information network, ranking | |||
| Live web search experiments for the rest of us | | BIBAK | Full-Text | 1265-1268 | |
| Timothy Jones; David Hawking; Ramesh Sankaranarayana | |||
| There are significant barriers to academic research into user Web search
preferences. Academic researchers are unable to manipulate the results shown by
a major search engine to users and would have no access to the interaction data
collected by the engine. Our initial approach to overcoming this was to ask
participants to submit queries to an experimental search engine rather than
their usual search tool. Over several different experiments we found that
initial user buy-in was high but that people quickly drifted back to their old
habits and stopped contributing data. Here, we report our investigation of
possible reasons why this occurs. An alternative approach is exemplified by the
Lemur browser toolbar, which allows local collection of user interaction data
from search engine sessions, but does not allow result pages to be modified. We
will demonstrate a new Firefox toolbar that we have developed to support
experiments in which search results may be arbitrarily manipulated. Using our
toolbar, academics can set up the experiments they want to conduct, while
collecting (subject to human experimentation guidelines) queries, clicks and
dwell times as well as optional explicit judgments. Keywords: browser extensions, implicit measures, web search | |||
| Debugging standard document formats | | BIBAK | Full-Text | 1269-1272 | |
| Nabil Layaida; Pierre Geneves | |||
| We present a tool for helping XML schema designers to obtain a high quality
level for their specifications.
The tool allows one to analyze relations between classes of XML documents and formally prove them. For instance, the tool can be used to check forward and backward compatibilities of recommendations. When such a relation does not hold, the tool allows one to identify the reasons and reports detailed counter-examples that exemplify the problem. For this purpose, the tool relies on recent advances in logic-based automated theorem proving techniques that allow for efficient reasoning on very large sets of XML documents. We believe this tool can be of great value for standardization bodies that define specifications using various XML type definition languages (such as W3C specifications), and are concerned with quality assurance for their normative recommendations. Keywords: compatibility, format, schema, xml | |||
| PageSense: style-wise web page advertising | | BIBAK | Full-Text | 1273-1276 | |
| Lusong Li; Tao Mei; Xiang Niu; Chong-Wah Ngo | |||
| This paper presents an innovative style-wise advertising platform for web
page. Web page "style" mainly refers to visual effects, such as color and
layout. Unlike the most popular ad-network such as Google AdSense which needs
publishers to change the original structure of their pages and define the
position and style of the embedded ads manually, style-wise page advertising
aims to automatically deliver style-consistent ads at proper positions within
the web page, without breaking the layout of the original page. Our system is
motivated from the fact that almost 90% web pages contain blank regions without
any content. Given a web page with some blank regions, style-wise page
advertising is able to detect the suitable blank areas for advertising, rank
the ads according to the semantic relevance and web page style, and embed
relevant ads into these nonintrusive areas. Style-wise page advertising
represents one of the first attempts towards contextual advertising which
enables the publishers to save effort when applying online advertising
services. Keywords: multimodal relevance, web page advertising, web page style | |||
| Web-based framework for spatiotemporal screen real estate management of interactive public displays | | BIBAK | Full-Text | 1277-1280 | |
| Tomas Linden; Tommi Heikkinen; Timo Ojala; Hannu Kukka; Marko Jurmu | |||
| In this paper we present a web-based framework for spatiotemporal screen
real estate management of interactive public displays. The framework
facilitates dynamic partitioning of the screen real estate into virtual screens
assigned for multiple concurrent web applications. The framework is utilized in
the implementation of so-called UBI-hotspot, which provides various information
services via different interaction modalities including mobile. The framework
facilitates seamless integration of third party web applications residing
anywhere in the public Internet into the UBI-hotspot, thus catering for a
scalable and open architecture. We report the deployment of a network of indoor
and outdoor UBI-hotspots at downtown Oulu, Finland. The quantitative data on
the usage of the UBI-hotspots implicitly speaks in favor of the practical
applicability of the framework. Keywords: ubi-hotspot, ubiquitous computing, urban computing | |||
| Structured audio podcasts via web text-to-speech system | | BIBAK | Full-Text | 1281-1284 | |
| Giulio Mori; Maria Claudia Buzzi; Marina Buzzi; Barbara Leporini | |||
| Audio podcasting is increasingly present in the educational field and is
especially appreciated as an ubiquitous/pervasive tool ("anywhere, anytime, at
any pace") for acquiring or expanding knowledge. We designed and implemented a
Web-based Text To Speech (TTS) system for automatic generation of a set of
structured audio podcasts from a single text document. The system receives a
document in input (doc, rtf, or txt), and in output provides a set of audio
files that reflect the document's internal structure (one mp3 file for each
document section), ready to be downloaded on portable mp3 players. Structured
audio files are useful for everyone but are especially appreciated by blind
users, who must explore content audially. Fully accessible for the blind, our
system offers WAI-ARIA-based Web interfaces for easy navigation and interaction
via screen reader and voice synthesizer, and produces a set of accessible audio
files for Rockbox mp3 players (mp3 and talk files), allowing blind users to
also listen to naturally spoken file names (instead of their spelled-out
strings).
In this demo, we will show how the system works when a user interacts via screen reader and voice synthesizer, showing the interaction with both our Web-based system and with an mp3 player. Keywords: audio podcasting, blind, e-learning, mp3 files, TTS, WAI-ARIA | |||
| ourSpaces: linking provenance and social data in a virtual research environment | | BIBAK | Full-Text | 1285-1288 | |
| Richard Reid; Edoardo Pignotti; Peter Edwards; Adele Laing | |||
| Web-based Virtual Research Environments (VREs) have been proposed as one way
in which e-Science tools can be deployed to support and enhance the research
process. We are exploring the use of Linked Data in combination with the Open
Provenance Model (OPM) and Social Web concepts to facilitate interactions
between people and data in the context of a VRE. In this demo we present the
ourSpaces VRE and outline the technologies used to link together provenance,
research artefacts, projects, geographical locations and social data in the
context of interdisciplinary research. Keywords: e-science, linked data, provenance, vre | |||
| Diversifying landmark image search results by learning interested views from community photos | | BIBAK | Full-Text | 1289-1292 | |
| Yuheng Ren; Mo Yu; Xin-Jing Wang; Lei Zhang; Wei-Ying Ma | |||
| In this paper, we demonstrate a novel landmark photo search and browsing
system: Agate, which ranks landmark image search results considering their
relevance, diversity and quality. Agate learns from community photos the most
interested aspects and related activities of a landmark, and generates
adaptively a Table of Content (TOC) as a summary of the attractions to
facilitate the user browsing. Image search results are thus re-ranked with the
TOC so as to ensure a quick overview of the attractions of the landmarks. A
novel non-parametric TOC generation and set-based ranking algorithm, MoM-DPM
Sets, is proposed as the key technology of Agate. Experimental results based on
human evaluation show the effectiveness of our model and users' preference for
Agate. Keywords: landmark image search, set-based ranking, user interest modeling | |||
| New-web search with microblog annotations | | BIBAK | Full-Text | 1293-1296 | |
| Tom Rowlands; David Hawking; Ramesh Sankaranarayana | |||
| Web search engines discover indexable documents by recursively 'crawling'
from a seed URL. Their rankings take into account link popularity. While this
works well, it introduces biases towards older documents. Older documents are
more likely to be the target of links, while new documents with few, or no,
incoming links are unlikely to rank highly in search results.
We describe a novel system for 'new-Web' search based on links retrieved from the Twitter micro-blogging service. The Twitter service allows individuals, organisations and governments to rapidly disseminate very short messages to a wide variety of interested parties. When a Twitter message contains a URL, we use the Twitter message as a description of the URL's target. As Twitter is frequently used for discussion of current events, these messages offer useful, up-to-date annotations and instantaneous popularity readings for a small, but timely, portion of the Web. Our working system is simple and fast and we believe may offer a significant advantage in revealing new information on the Web that would otherwise be hidden from searchers. Beyond the basic system, we anticipate the Twitter messages may add supplementary terms for a URL, or add weight to existing terms, and that the reputation or authority of each message sender may serve to weight both annotations and query-independent popularity. Keywords: demonstration, information retrieval, microblogging, search, twitter, web
search | |||
| A framework for querying graph-based business process models | | BIBAK | Full-Text | 1297-1300 | |
| Sherif Sakr; Ahmed Awad | |||
| We present a framework for querying and reusing graph-based business process
models. The framework is based on a new visual query language for business
processes called BPMN-Q. The language addresses processes definitions and
extends the standard BPMN visual notations for modeling business processes for
its concrete syntax. BPMN-Q is used to query process models by matching a
process model graph to a query graph. Moreover, the reusing framework is
enhanced with a semantic query expander component. This component provides the
users with the flexibility to get not only the perfectly matched process models
to their queries but also the models with high similarity. The query engine of
the framework is built on top of traditional RDBMS. A novel decomposition based
and selectivity-aware relational processing mechanism is employed to achieve an
efficient and scalable performance for graph-based BPMN-Q queries. Keywords: querying business process -- BPMN -- process models | |||
| Sig.ma: live views on the web of data | | BIBAK | Full-Text | 1301-1304 | |
| Giovanni Tummarello; Richard Cyganiak; Michele Catasta; Szymon Danielczyk; Renaud Delbru; Stefan Decker | |||
| We demonstrate Sig.ma, both a service and an end user application to access
the Web of Data as an integrated information space.
Sig.ma uses an holistic approach in which large scale semantic web indexing, logic reasoning, data aggregation heuristics, ad hoc ontology consolidation, external services and responsive user interaction all play together to create rich entity descriptions. These consolidated entity descriptions then form the base for embeddable data mashups, machine oriented services as well as data browsing services. Finally, we discuss Sig.ma's peculiar characteristics and report on lessons learned and ideas it inspires. Keywords: aggregated search, RDFa, semantic web, web of data | |||
| Not so creepy crawler: easy crawler generation with standard xml queries | | BIBAK | Full-Text | 1305-1308 | |
| Franziska von dem Bussche; Klara Weiand; Benedikt Linse; Tim Furche; François Bry | |||
| Web crawlers are increasingly used for focused tasks such as the extraction
of data from Wikipedia or the analysis of social networks like last.fm. In
these cases, pages are far more uniformly structured than in the general Web
and thus crawlers can use the structure of Web pages for more precise data
extraction and more expressive analysis.
In this demonstration, we present a focused, structure-based crawler generator, the "Not so Creepy Crawler" (nc2 ). What sets nc2 apart, is that all analysis and decision tasks of the crawling process are delegated to an (arbitrary) XML query engine such as XQuery or Xcerpt. Customizing crawlers just means writing (declarative) XML queries that can access the currently crawled document as well as the metadata of the crawl process. We identify four types of queries that together suffice to realize a wide variety of focused crawlers. We demonstrate nc2 with two applications: The first extracts data about cities from Wikipedia with a customizable set of attributes for selecting and reporting these cities. It illustrates the power of nc2 where data extraction from Wiki-style, fairly homogeneous knowledge sites is required. In contrast, the second use case demonstrates how easy nc2 makes even complex analysis tasks on social networking sites, here exemplified by last.fm. Keywords: data extraction, web crawler, web query, xml | |||
| MindFinder: image search by interactive sketching and tagging | | BIBAK | Full-Text | 1309-1312 | |
| Changhu Wang; Zhiwei Li; Lei Zhang | |||
| In this technical demonstration, we showcase the MindFinder system $-$ a
novel image search engine. Different from existing interactive image search
engines, most of which only provide image-level relevance feedback, MindFinder
enables users to sketch and tag query images at object level. By considering
the image database as a huge repository, MindFinder is able to help users
present and refine their initial thoughts in their mind, and finally turn
thoughts to a beautiful image(s). Multiple actions are enabled for users to
flexibly design their queries in a bilateral interactive manner by leveraging
the whole image database, including tagging, refining query by dragging and
dropping objects from search results, as well as editing objects. After each
action, the search results will be updated in real time to provide users
up-to-date materials to further formulate the query. By the deliberate but easy
design of the query, MindFinder not only tries to enable users to present on
the query panel whatever they are imagining, but also returns to users the most
similar images to the picture in users' mind. By scaling up the image database
to 10 million, MindFinder has the potential to reveal whatever in users' mind,
that is where the name MindFinder comes from. Keywords: MindFinder, interactive search, sketching, tagging | |||
| FormSys: form-processing web services | | BIBAK | Full-Text | 1313-1316 | |
| Ingo M. Weber; Hye-young Paik; Boualem Benatallah; Zifei Gong; Liangliang Zheng; Corren Vorwerk | |||
| In this paper we present FormSys, a Web-based system that service-enables
form documents. It offers two main services: filling in forms based on Web
services' incoming SOAP messages, and invoking Web services based on filled-in
forms. This can be applied to benefit individuals to reduce the number of often
repetitive form fields they have to complete manually in many scenarios. It can
also help organisations to remove the need for manual data entry by
automatically triggering business process implementations based on incoming
case data from filled-in forms. While the concept applies to forms of any type
of document, our implementation uses Adobe AcroForms due to its universal
applicability, availability of a usable API, and end-user appeal. In the demo,
we will show the two core functions, namely soap2pdf and pdf2soap, along with
use case applications of the services developed from real world scenarios.
Essentially, this work demonstrates how PDFs can be used as a channel for
interacting with Web services. Keywords: documents, processes, service-enabled forms, web services | |||
| Falconer: once SIOC meets semantic search engine | | BIBAK | Full-Text | 1317-1320 | |
| Gang Wu; Mengdong Yang; Ke Wu; Guilin Qi; Yuzhong Qu | |||
| Falconer is a semantic Web search engine enhanced SIOC
(Semantically-Interlinked Online Communities) application, which is designed to
demonstrate the ability of accelerating the creation and reuse process of
semantic Web data with easy-to-use user interfaces. In this process, semantic
Web search engines feed existing semantic data into the SIOC framework, where
new semantic data are composed by the community and indexed again by those
search engines. Compared to existing social (semantic) Web applications,
Falconer inherently conforms to SIOC specification. It provides semantic search
engine based user registration suggestion, friends auto-discovery, and semantic
annotation for forum post content. Another distinctive feature is that it
enables users to subscribe any resource having a URI as the topic they are
interested in. The relationships among users, topics, and posts are further
visualized for analyzing the topic trends in the community. As all semantic
data are formatted in RDF and RDFa, they can be queried with SPARQL query
language. Keywords: SIOC, semantic search engine, social semantic web | |||
| Interactive image search by 2D semantic map | | BIBAK | Full-Text | 1321-1324 | |
| Hao Xu; Jingdong Wang; Xian-Sheng Hua; Shipeng Li | |||
| In this demo, we present a novel interactive image search system, image
search by 2D semantic map. This system enables users to indicate what semantic
concepts are expected to appear and even how these concepts are spatially
distributed in the desired images. To this end, we design an intuitive
interface for users to formulate a query in the form of 2D semantic map, called
concept map, by typing textual queries in a blank canvas. In the ranking
process, by interpreting each textual concept as a set of representative visual
instances, the concept map query is translated into a visual instance map,
which is then used for comparison with the images in the database. Besides, in
this demo, we also show an image search system with a simplest semantic map, a
2D color map, where the concepts are limited from the colors. Keywords: color map, concept map, interactive image search | |||
| eduKEN: a tool for fine-grained video comment collection and analysis | | BIBAK | Full-Text | 1325-1328 | |
| Wai Gen Yee; Andrew Yates; Ophir Frieder; Armin Moehrle | |||
| An increasing amount of Web information is in video format. Today's search
technology allows videos to be found using graphical features and textual
descriptions. However, the information gleaned from video features is coarse,
while textual descriptions are often short and fail to capture the precise
content of videos. We hypothesize that user comments contain supplemental
information that effectively describes the content of a video. This
information, once extracted, can be applied to a search engine index to improve
video search accuracy.
A complete comment-based indexing system must encourage users to post comments and be able to analyze comment data. To support research in the relevant areas of interface design, text analysis, and information retrieval, we present eduKEN, a flexible Web-based tool that allows comment collection, analysis, and search. To support the collection of real-world comment data in an educational setting, eduKEN includes an additional user interface designed for classroom use. Keywords: annotations, search | |||
| ZoomRDF: semantic fisheye zooming on RDF data | | BIBAK | Full-Text | 1329-1332 | |
| Kang Zhang; Haofen Wang; Duc Thanh Tran; Yong Yu | |||
| With the development of Semantic Web in recent years, an increasing amount
of semantic data has been created in form of Resource Description Framework
(RDF). Current visualization techniques help users quickly understand the
underlying RDF data by displaying its structure in an overview. However,
detailed information can only be accessed by further navigation. An alternative
approach is to display the global context as well as the local details
simultaneously in a unified view. This view supports the visualization and
navigation on RDF data in an integrated way. In this demonstration, we present
ZoomRDF, a framework that: i) adapts a space-optimized visualization algorithm
for RDF, which allows more resources to be displayed, thus maximizes the
utilization of display space, ii) combines the visualization with a fisheye
zooming concept, which assigns more space to some individual nodes while still
preserving the overview structure of the data, iii) considers both the
importance of resources and the user interaction on them, which offers more
display space to those elements the user may be interested in. We implement the
framework based on the Gene Ontology and demonstrate that it facilitates tasks
like RDF data exploration and editing. Keywords: RDF data, information visualization, semantic web | |||
| Optimizing user interaction for web-based mobile tasks | | BIBAK | Full-Text | 1333-1336 | |
| Dong Zhou; Ajay Chander; Hiroshi Inamura | |||
| This paper describes the design and prototype implementation of MIntOS
(Mobile Interaction Optimization System), a system for improving mobile
interaction in web-based activities. MIntOS monitors users' interactions both
for gathering interaction history and for the runtime construction of
interaction context. A simple approach based on interaction burstiness is used
to break interaction sequences into Trails, which approximates user tasks. Such
Trails are used to generate rules for online, context-sensitive prediction of
future interaction sequences. Predicted user interaction sequences are then
optimized to reduce the amount of user input and user wait time using
techniques such as interaction short-cuts, automatic text copying and
form-filling, as well as page pre-fetching. Such optimized interaction
sequences are, at real-time, recommended to the user through UI enhancements in
a non-intrusive manner. Keywords: interaction trail, trail optimization, user interaction monitoring,
web-based mobile tasks | |||
| Search is dead!: long live search | | BIBAK | Full-Text | 1337-1338 | |
| Andrei Broder; Elizabeth F. Churchill; Marti Hearst; Barney Pell; Prabhakar Raghavan; Andrew Tomkins | |||
| Back in the heady days of 1999 and WWW8 (Toronto) we held a panel titled
"Finding Anything in the Billion Page Web: Are Algorithms the Key?" In
retrospect the answer to this question seems laughably obvious -- the search
industry has burgeoned on a foundation of algorithms, cloud computing and
machine learning. As we move into the second decade of this millennium, we are
confronted with a dizzying array of new paradigms for finding content,
including social networks and location-based search and advertising. This panel
pulls together senior experts from academia and the major search principals to
debate whether search will continue to look anything like the
2-keywords-give-10-blue-links paradigm that Google has popularized. What do
emerging approaches and paradigms -- natural language search, social search,
location-based search -- mean for the future of search in general? Keywords: algorithms, location-based search, search, social search | |||
| Video search: are algorithms all we need? | | BIBAK | Full-Text | 1339-1340 | |
| Jeff Ubois; Jamie Davidson; Marko Grobelnik; Paul Over; Hans Westerhof | |||
| This panel will debate various approaches to improving video search, and
explore how professional cataloguing, crowd sourced metadata, and improvements
in search algorithms will evolve over the next ten years. Panelists will
explore the needs of large scale video archives, and compare these against the
current capabilities of video search. Keywords: algorithms, archives, design, search, video | |||
| What the web can't do | | BIBAK | Full-Text | 1341-1342 | |
| David A. Shamma; Seth Fitzsimmonds; Joe Gregorio; Adam Hupp; Ramesh Jain; Kevin Marks | |||
| This panel discusses how polling in the HTTPd protocol affects how we are
building the next generation of the web and its applications. As other
technologies (HTML, Javascript, etc.) move forward, we ask should the web's
protocol also evolve or is it sufficient for the web to continue through just
GET and POST? Keywords: http, rest, streaming, www, xmpp | |||
| Web search engine metrics: (direct metrics to measure user satisfaction) | | BIBAK | Full-Text | 1343-1344 | |
| Ali Dasdan; Kostas Tsioutsiouliklis; Emre Velipasaoglu | |||
| Search engines are important resources for finding information on the Web.
They are also important for publishers and advertisers to present their content
to users. Thus, user satisfaction is key and must be quantified. In this
tutorial, we give a practical review of web search metrics from a user
satisfaction point of view. We cover metrics for relevance, comprehensiveness,
coverage, diversity, discovery freshness, content freshness, and presentation.
We will also describe how these metrics can be mapped to proxy metrics for the
stages of a generic search engine pipeline. The practitioners can apply these
metrics readily and the researchers can get motivation for new problems to work
on, especially in formalizing and refining metrics. Keywords: comprehensiveness metrics, content freshness metrics, coverage metrics,
discovery freshness metrics, diversity metrics, presentation metrics, relevance
metrics, web metrics, web search | |||
| Enterprise and desktop search | | BIBAK | Full-Text | 1345-1346 | |
| Pavel Dmitriev; Pavel Serdyukov; Sergey Chernov | |||
| With the growing amount of information on users' desktops and increasing
scale and complexity of intranets, Enterprise and Desktop Search are becoming
two increasingly important Information Retrieval applications. While the
challenges arising there are not completely different from those that the web
community has faced for years, advanced web search solutions are often unable
to address them properly. In this tutorial we give a research prospective on
distinctive features of both Enterprise and Desktop Search, explain typical
search scenarios, and review existing ranking techniques and algorithms. Keywords: desktop search, enterprise search, expert search, exploratory search, user
feedback | |||
| How to consume linked data on the web: tutorial description | | BIBAK | Full-Text | 1347-1348 | |
| Olaf Hartig; Juan Sequeda; Jamie Taylor; Patrick Sinclair | |||
| In the past two years, the amount of data published in RDF and following the
Linked Data principles has increased dramatically. Everyday people are
publishing datasets as Linked Data. However, applications that consume Linked
Data are not mainstream yet. To overcome this issue, we present a beginners
tutorial on consuming Linked Data. We will discuss existing techniques how
users can currently consume Linked Data and use it in their current
applications. Keywords: application, consuming, linked data, semantic web, web of data, web of
linked data | |||
| Optimal marketing and pricing over social networks | | BIBAK | Full-Text | 1349-1350 | |
| Nicole Immorlica; Vahab S. Mirrokni | |||
| We discuss the use of social networks in implementing viral marketing
strategies. In the first part of this tutorial, we study influence maximization
or how the structure of the social network affects the spread of behaviors and
technologies. In the second part, we then consider how one might monopolize
these natural processes to generate revenue in a revenue maximization setting. Keywords: influence, revenue maximization, social networks | |||
| Web search/browse log mining: challenges, methods, and applications | | BIBAK | Full-Text | 1351-1352 | |
| Daxin Jiang; Jian Pei; Hang Li | |||
| Huge amounts of search and browse log data has been accumulated in various
search engines. Such massive search/browse log data, on the one hand, provides
great opportunities to mine the wisdom of crowds and improve Web search as well
as online advertisement. On the other hand, designing effective and efficient
algorithms and tools to clean, model, and process large scale log data presents
great challenges.
In this tutorial, we give a systematic survey on the applications, challenges, fundamental principles and state-of-the-art methods of mining large scale search and browse log data. We start with an introduction of search and browse log data and an overview of various log mining applications. Then, we focus on four popular areas of log mining applications, namely query understanding, document understanding, query-document matching, and user understanding. For each area, we review the major tasks, analyze the challenges, and exemplify several representative solutions. Finally, we discuss several new directions in search/browse log mining. The tutorial slides are available at the authors' homepages after the tutorial is presented. Keywords: log mining, search and advertisement applications, search and browse logs | |||
| Applications of open search tools | | BIBAK | Full-Text | 1353-1354 | |
| Rosie Jones; Ted Drake | |||
| It costs about 300M dollars to just build a search system that scales to the
web. Web services which open up web search results to the public allow
academics, developers, and entrepreneurs to achieve instant web search parity
for free and enable them to focus on building their additional secret sauce on
top to create even grander relevant services. For example, a social network
could leverage open search and their data about users to personalize web
search. Additionally, one of the best ways to gather data about web search
behavior is to build your own search system. Proto-type IR and web search
systems based on open search can be used to gather user interaction data and
test the applicability of research ideas. Open Source tools like Lucene and
Nutch and open search services like from major search engines can greatly help
developers implement these types of systems quickly, allowing for fast
production and evaluation. We will give detailed overviews of the popular open
search tools, showcasing examples of search and IR algorithms and systems
implemented using them, as well as discussing how evaluation can be performed. Keywords: open search tools, open source search | |||
| Introduction to social recommendation | | BIBAK | Full-Text | 1355-1356 | |
| Irwin King; Michael R. Lyu; Hao Ma | |||
| As the exponential growth of information generated on the World Wide Web,
Social Recommendation has emerged as one of the hot research topics recently.
Social Recommendation forms a specific type of information filtering technique
that attempts to suggest information (blogs, news, music, travel plans, web
pages, images, tags, etc.) that are likely to interest the users. Social
Recommendation involves the investigation of collective intelligence by using
computational techniques such as machine learning, data mining, natural
language processing, etc. on social behavior data collected from blogs, wikis,
recommender systems, question & answer communities, query logs, tags, etc.
from areas such as social networks, social search, social media, social
bookmarks, social news, social knowledge sharing, and social games. In this
tutorial, we will introduce Social Recommendation and elaborate on how the
various characteristics and aspects are involved in the social platforms for
collective intelligence. Moreover, we will discuss the challenging issues
involved in Social Recommendation in the context of theory and models of social
networks, methods to improve recommender systems using social contextual
information, ways to deal with partial and incomplete information in the social
context, scalability and algorithmic issues with social computational
techniques. Keywords: collaborative filtering, collective intelligence, recommender systems,
social network model and theories, social recommendation | |||
| Linking content in unstructured sources | | BIBAK | Full-Text | 1357-1358 | |
| Marie-Francine Moens | |||
| This tutorial focuses on the task of automated information linking in text
and multimedia sources. In any task where information is fused from different
sources, this linking is a necessary step. To solve the problem we borrow
methods from computational linguistics, computer vision and data mining.
Although the main focus is on finding equivalence relations in the sources, the
tutorial opens views on the recognition of other relation types. Keywords: content alignment and linking | |||
| RESTful web services: principles, patterns, emerging technologies | | BIBAK | Full-Text | 1359-1360 | |
| Cesare Pautasso; Erik Wilde | |||
| Recent technology trends in Web services indicate that a solution
eliminating the perceived complexity of the WS-* standard technology stack may
be in sight: advocates of Representational State Transfer (REST) have come to
believe that their ideas explaining why the World Wide Web works are just as
applicable to solve enterprise application integration problems and to
radically simplify the plumbing required to implement a Service-Oriented
Architecture (SOA). In this tutorial we give an introduction to the REST
architectural style as the foundation for RESTful Web services. The tutorial
starts from the basic design principles of REST and how they are applied to
service oriented computing. Service-orientation concentrates on identifying
self-contained units of functionality, which should then be exposed as easily
reusable and repurposable services. This tutorial focuses not on the
identification of those units, but on how to design the services representing
them. We explain how decisions on the SOA level already shape the architectural
style that will be used for the eventual IT architecture, and how the SOA
process itself has to be controlled to yield services which can then be
implemented RESTfully. We do not claim that REST is the only architectural
style that can be used for SOA design, but we do argue that it does have
distinct advantages for loosely coupled services and massive scale, and that
any SOA approach already has to be specifically RESTful on the business level
to yield meaningful input for IT architecture design. Keywords: REST, SOA, patterns, web services | |||
| Implementing the media fragments URI specification | | BIBAK | Full-Text | 1361-1364 | |
| Davy Van Deursen; Raphaël Troncy; Erik Mannens; Silvia Pfeiffer; Yves Lafon; Rik Van de Walle | |||
| In this paper, we describe two examples of implementations of the Media
Fragments URI specification which is currently being developed by the W3C Media
Fragments Working Group. The group's mission is to create standard addressing
schemes for media fragments on the Web using Uniform Resource Identifiers
(URIs). We describe two scenarios to illustrate the implementations. More
specifically, we show how User Agents (UA) will either be able to resolve media
fragment URIs without help from the server, or will make use of a media
fragments-aware server. Finally, we present some ongoing discussions and issues
regarding the implementation of the Media Fragments specification. Keywords: media fragments, video accessibility, video url | |||
| Exposing audio data to the web: an API and prototype | | BIBAK | Full-Text | 1365-1368 | |
| David Humphrey; Corban Brook; Alistair MacDonald | |||
| The HTML5 specification introduces the audio and video media elements, and
with them the opportunity to change the way media is integrated on the web. The
current HTML5 media API provides ways to play and get limited information about
audio and video, but no way to programatically access or create such media. In
this paper we present an enhanced API for these media elements, as well as
details about a Mozilla Firefox implementation created by the authors, which
allows web developers to read and write raw audio data. Keywords: audio, fft, firefox, html5 | |||
| Enabling WebGL | | BIBAK | Full-Text | 1369-1370 | |
| Catherine Leung; Andor Salga | |||
| WebGL leverages the power of OpenGL to present accelerated 3D graphics on a
webpage. The ability to put hardware-accelerated 3D content in the browser will
provide a means for the creation of new web based applications that were
previously the exclusive domain of the desktop environment. It will also allow
the inclusion of features that standalone 3D applications do not have. While
WebGL succeeds in bringing the power and low-level API of OpenGL to the
browser, it also expects a lot of web developers, who are used to the DOM and
JavaScript libraries like jQuery. This paper will look at how mid level APIs
can help web developers create unique 3D content that is more than just
duplicates of a standalone desktop application on a web page. We will present
one such web application named Motionview, built with C3DL, that provides a new
means for artist and motion capture studios to communicate with each other. We
will also highlight some upcoming project ideas that make use of 3D browser
technology in a way that would not have been possible in a desktop environment. Keywords: 3d, browser, canvas, javascript, motion capture, processing.js, WebGL | |||
| The spoken web: software development and programming through voice | | BIBAK | Full-Text | 1371-1374 | |
| Arun Kumar; Sheetal K. Agarwal; Priyanka Manwani | |||
| It has been a constant aim of computer scientists, programming language
designers and practitioners to raise the level of programming abstractions and
bring them as close to the user's natural context as possible. The efforts
started right from our transition from machine code programming to assembly
language programming, from there to high level procedural languages, followed
by object oriented programming. Nowadays, service oriented software development
and composition are the norm.
There have also been notable efforts such as Alice system from CMU to simplify the programming experience through the use of 3D virtual worlds. The holy grail has been to enable non-technical users such as kids or non-technical people to be able to understand and pick up programming and software development easily. We present a novel approach to software development that lets people use their voice to program or create new software through composition. We demonstrate some basic programming tasks achieved by simply talking to a system over an ordinary phone. Such programs constructed by talking can be created in user's local language and do not require IT literacy or even literacy as a prerequisite. We believe this approach will have a deep impact on software development, especially development of web based software in the very near future. Keywords: programming by voice, spoken web, voicesites | |||
| 'Follow me': a web-based, location-sharing architecture for large, indoor environments | | BIBAK | Full-Text | 1375-1378 | |
| Polychronis Ypodimatopoulos; Andrew Lippman | |||
| We leverage the ubiquity of bluetooth-enabled devices and propose a
decentralized, web-based architecture that allows users to share their location
by following each other in the style of Twitter. We demonstrate a prototype
that operates in a large building which generates a dataset of detected
bluetooth devices at a rate of 30 new devices per day, including the respective
location where they were last detected. Users then query the dataset using
their unique bluetooth ID and share their current location with their followers
by means of unique URIs that they control. Our separation between producers
(the building) and consumers (the users) of bluetooth device location data
allows us to create socially-aware applications that respect user's privacy
while limiting the software necessary to run on mobile devices to just a web
browser. Keywords: agents, bluetooth, location sharing | |||
| IBM's jazz integration architecture: building a tools integration architecture and community inspired by the web | | BIBAK | Full-Text | 1379-1382 | |
| Scott Rich | |||
| In this paper, we describe our experience in the Jazz project, beginning
from a "Classical" repository- and Java-centric design and evolving towards an
architecture which borrows heavily from the architecture of the Web. Along the
way, we formed an open community to collaborate on adapting this architecture
to various tool domains. Finally, we discuss our experience delivering the
first generation of tools built and integrated using these techniques. Keywords: representational state transfer, rest, software development | |||
| TWC data-gov corpus: incrementally generating linked government data from data.gov | | BIBAK | Full-Text | 1383-1386 | |
| Li Ding; Dominic DiFranzo; Alvaro Graves; James R. Michaelis; Xian Li; Deborah L. McGuinness; James A. Hendler | |||
| The Open Government Directive is making US government data available via
websites such as Data.gov for public access. In this paper, we present a
Semantic Web based approach that incrementally generates Linked Government Data
(LGD) for the US government. In focusing on the trade-off between high quality
LGD generation (requiring non-trivial human expert input) and massive LGD
generation (requiring low human processing cost), our work is highlighted by
the following features: (i) supporting low-cost and extensible LGD publishing
for massive government data; (ii) using Social Semantic Web (Web3.0)
technologies to incrementally enhance published LGD via crowdsourcing, and
(iii) facilitating mash-ups by declaratively reusing cross-dataset mappings
which usually are hard-coded in applications. Keywords: crowdsourcing, linked data, linked government data, open government, social
semantic web | |||