| WWW at 15 years: looking forward | | BIBA | Full-Text | 1 | |
| Tim Berners-Lee | |||
| The key property of the WWWW is its universality: One must be able to access
it whatever the hardware device, software platform, and network one is using,
and despite the disabilities one might have, and whether oner is in a
"developed" or "developing" country; it must support information of any
language, culture, quality, medium, and field without discrimination so that a
hypertext link can go anywhere; it must support information intended for
people, and that intended for machine processing. The Web architecture
incorporated various choices which support these axes of universality.
Currently the architecture and the principles are being exploited in the recent Mobile Web initiative in W3C to promote content which can be accessed optimally from conventional computers and mobile devices. New exciting areas arise every few months as possible Semantic Web flagship applications. As new areas burst forth, the fundamental principles remain important and are extended and adjusted. At the same time, the principles of openness and consensus among international stakeholders which the WWW consortium employs for new technology are adjusted, but ever-important. | |||
| Semantic similarity between search engine queries using temporal correlation | | BIBAK | Full-Text | 2-11 | |
| Steve Chien; Nicole Immorlica | |||
| We investigate the idea of finding semantically related search engine
queries based on their temporal correlation; in other words, we infer that two
queries are related if their popularities behave similarly over time. To this
end, we first define a new measure of the temporal correlation of two queries
based on the correlation coefficient of their frequency functions. We then
conduct extensive experiments using our measure on two massive query streams
from the MSN search engine, revealing that this technique can discover a wide
range of semantically similar queries. Finally, we develop a method of
efficiently finding the highest correlated queries for a given input query
using far less space and time than the naive approach, making real-time
implementation possible. Keywords: query stream analysis, search engines, semantic similarity among queries | |||
| Duplicate detection in click streams | | BIBAK | Full-Text | 12-21 | |
| Ahmed Metwally; Divyakant Agrawal; Amr El Abbadi | |||
| We consider the problem of finding duplicates in data streams. Duplicate
detection in data streams is utilized in various applications including fraud
detection. We develop a solution based on Bloom Filters [9], and discuss the
space and time requirements for running the proposed algorithm in both the
contexts of sliding, and landmark stream windows. We run a comprehensive set of
experiments, using both real and synthetic click streams, to evaluate the
performance of the proposed solution. The results demonstrate that the proposed
solution yields extremely low error rates. Keywords: advertising networks, approximate queries, data streams, duplicate detection | |||
| Improving recommendation lists through topic diversification | | BIBAK | Full-Text | 22-32 | |
| Cai-Nicolas Ziegler; Sean M. McNee; Joseph A. Konstan; Georg Lausen | |||
| In this work we present topic diversification, a novel method designed to
balance and diversify personalized recommendation lists in order to reflect the
user's complete spectrum of interests. Though being detrimental to average
accuracy, we show that our method improves user satisfaction with
recommendation lists, in particular for lists generated using the common
item-based collaborative filtering algorithm.
Our work builds upon prior research on recommender systems, looking at properties of recommendation lists as entities in their own right rather than specifically focusing on the accuracy of individual recommendations. We introduce the intra-list similarity metric to assess the topical diversity of recommendation lists and the topic diversification approach for decreasing the intra-list similarity. We evaluate our method using book recommendation data, including offline analysis on 361,349 ratings and an online study involving more than 2,100 subjects. Keywords: accuracy, collaborative filtering, diversification, metrics, recommender
systems | |||
| GlobeDB: autonomic data replication for web applications | | BIBAK | Full-Text | 33-42 | |
| Swaminathan Sivasubramanian; Gustavo Alonso; Guillaume Pierre; Maarten van Steen | |||
| We present GlobeDB, a system for hosting Web applications that performs
autonomic replication of application data. GlobeDB offers data-intensive Web
applications the benefits of low access latencies and reduced update traffic.
The major distinction in our system compared to existing edge computing
infrastructures is that the process of distribution and replication of
application data is handled by the system automatically with very little manual
administration. We show that significant performance gains can be obtained this
way. Performance evaluations with the TPC-W benchmark over an emulated
wide-area network show that GlobeDB reduces latencies by a factor of 4 compared
to non-replicated systems and reduces update traffic by a factor of 6 compared
to fully replicated systems. Keywords: autonomic replication, data replication, edge services, performance | |||
| Hierarchical substring caching for efficient content distribution to low-bandwidth clients | | BIBAK | Full-Text | 43-53 | |
| Utku Irmak; Torsten Suel | |||
| While overall bandwidth in the internet has grown rapidly over the last few
years, and an increasing number of clients enjoy broadband connectivity, many
others still access the internet over much slower dialup or wireless links. To
address this issue, a number of techniques for optimized delivery of web and
multimedia content over slow links have been proposed, including protocol
optimizations, caching, compression, and multimedia transcoding, and several
large ISPs have recently begun to widely promote dialup acceleration services
based on such techniques. A recent paper by Rhea, Liang, and Brewer proposed an
elegant technique called value-based caching that caches substrings of files,
rather than entire files, and thus avoids repeated transmission of substrings
common to several pages or page versions.
We propose and study a hierarchical substring caching technique that provides significant savings over this basic approach. We describe several additional techniques for minimizing overheads and perform an evaluation on a large set of real web access traces that we collected. In the second part of our work, we compare our approach to a widely studied alternative approach based on delta compression, and show how to integrate the two for best overall performance. The studied techniques are typically employed in a client-proxy environment, with each proxy serving a large number of clients, and an important aspect is how to conserve resources on the proxy while exploiting the significant memory and CPU power available on current clients. Keywords: HTTP, WWW, compression, web caching, web proxies | |||
| Executing incoherency bounded continuous queries at web data aggregators | | BIBAK | Full-Text | 54-65 | |
| Rajeev Gupta; Ashish Puri; Krithi Ramamritham | |||
| Continuous queries are used to monitor changes to time varying data and to
provide results useful for online decision making. Typically a user desires to
obtain the value of some function over distributed data items, for example, to
determine when and whether (a) the traffic entering a highway from multiple
feed roads will result in congestion in a thoroughfare or (b) the value of a
stock portfolio exceeds a threshold. Using the standard Web infrastructure for
these applications will increase the reach of the underlying information. But,
since these queries involve data from multiple sources, with sources supporting
standard HTTP (pull-based) interfaces, special query processing techniques are
needed. Also, these applications often have the flexibility to tolerate some
incoherency, i.e., some differences between the results reported to the user
and that produced from the virtual database made up of the distributed data
sources.
In this paper, we develop and evaluate client-pull-based techniques for refreshing data so that the results of the queries over distributed data can be correctly reported, conforming to the limited incoherency acceptable to the users. We model as well as estimate the dynamics of the data items using a probabilistic approach based on Markov Chains. Depending on the dynamics of data we adapt the data refresh times to deliver query results with the desired coherency. The commonality of data needs of multiple queries is exploited to further reduce refresh overheads. Effectiveness of our approach is demonstrated using live sources of dynamic data: the number of refreshes it requires is (a) an order of magnitude less than what we would need if every potential update is pulled from the sources, and (b) comparable to the number of messages needed by an ideal algorithm, one that knows how to optimally refresh the data from distributed data sources. Our evaluations also bring out a very practical and attractive tradeoff property of pull based approaches, e.g., a small increase in tolerable incoherency leads to a large decrease in message overheads. Keywords: Markov model, coherency, continuous queries, fidelity, online decision
making | |||
| Fully automatic wrapper generation for search engines | | BIBAK | Full-Text | 66-75 | |
| Hongkun Zhao; Weiyi Meng; Zonghuan Wu; Vijay Raghavan; Clement Yu | |||
| When a query is submitted to a search engine, the search engine returns a
dynamically generated result page containing the result records, each of which
usually consists of a link to and/or snippet of a retrieved Web page. In
addition, such a result page often also contains information irrelevant to the
query, such as information related to the hosting site of the search engine and
advertisements. In this paper, we present a technique for automatically
producing wrappers that can be used to extract search result records from
dynamically generated result pages returned by search engines. Automatic search
result record extraction is very important for many applications that need to
interact with search engines such as automatic construction and maintenance of
metasearch engines and deep Web crawling. The novel aspect of the proposed
technique is that it utilizes both the visual content features on the result
page as displayed on a browser and the HTML tag structures of the HTML source
file of the result page. Experimental results indicate that this technique can
achieve very high extraction accuracy. Keywords: information extraction, search engine, wrapper generation | |||
| Web data extraction based on partial tree alignment | | BIBAK | Full-Text | 76-85 | |
| Yanhong Zhai; Bing Liu | |||
| This paper studies the problem of extracting data from a Web page that
contains several structured data records. The objective is to segment these
data records, extract data items/fields from them and put the data in a
database table. This problem has been studied by several researchers. However,
existing methods still have some serious limitations. The first class of
methods is based on machine learning, which requires human labeling of many
examples from each Web site that one is interested in extracting data from. The
process is time consuming due to the large number of sites and pages on the
Web. The second class of algorithms is based on automatic pattern discovery.
These methods are either inaccurate or make many assumptions. This paper
proposes a new method to perform the task automatically. It consists of two
steps, (1) identifying individual data records in a page, and (2) aligning and
extracting data items from the identified data records. For step 1, we propose
a method based on visual information to segment data records, which is more
accurate than existing methods. For step 2, we propose a novel partial
alignment technique based on tree matching. Partial alignment means that we
align only those data fields in a pair of data records that can be aligned (or
matched) with certainty, and make no commitment on the rest of the data fields.
This approach enables very accurate alignment of multiple data records.
Experimental results using a large number of Web pages from diverse domains
show that the proposed two-step technique is able to segment data records,
align and extract data from them very accurately. Keywords: data extraction, data record extraction, wrapper | |||
| Thresher: automating the unwrapping of semantic content from the World Wide Web | | BIBAK | Full-Text | 86-95 | |
| Andrew Hogue; David Karger | |||
| We describe Thresher, a system that lets non-technical users teach their
browsers how to extract semantic web content from HTML documents on the World
Wide Web. Users specify examples of semantic content by highlighting them in a
web browser and describing their meaning. We then use the tree edit distance
between the DOM subtrees of these examples to create a general pattern, or
wrapper, for the content, and allow the user to bind RDF classes and predicates
to the nodes of these wrappers. By overlaying matches to these patterns on
standard documents inside the Haystack semantic web browser, we enable a rich
semantic interaction with existing web pages, "unwrapping" semantic data buried
in the pages' HTML. By allowing end-users to create, modify, and utilize their
own patterns, we hope to speed adoption and use of the Semantic Web and its
applications. Keywords: RDF, haystack, semantic Web, tree edit distance, wrapper induction | |||
| The case for technology for developing regions | | BIBA | Full-Text | 96 | |
| Eric A. Brewer | |||
| Moore's Law and the wave of technologies it enabled have led to tremendous improvements in productivity and the quality of life in the industrialized world. Yet, technology has had almost no effect on the four billion people that make less US$2000/day. In this talk I argue that the decreasing costs of computing and wireless networking make this the right time to spread the benefits of technology, and that the biggest missing piece is a lack of focus on the problems that matter, including health, education, and government. After covering some example applications that have shown very high impact, I take an early look at the research agenda for developing regions. Finally, I examine some of the pragmatic issues required to make progress on these very challenging problems. My goal is to convince high-tech researchers that technology for developing regions is an important and viable research topic. | |||
| Ranking a stream of news | | BIBAK | Full-Text | 97-106 | |
| Gianna M. Del Corso; Antonio Gullí; Francesco Romani | |||
| According to a recent survey made by Nielsen NetRatings, searching on news
articles is one of the most important activity online. Indeed, Google, Yahoo,
MSN and many others have proposed commercial search engines for indexing news
feeds. Despite this commercial interest, no academic research has focused on
ranking a stream of news articles and a set of news sources. In this paper, we
introduce this problem by proposing a ranking framework which models: (1) the
process of generation of a stream of news articles, (2) the news articles
clustering by topics, and (3) the evolution of news story over the time. The
ranking algorithm proposed ranks news information, finding the most
authoritative news sources and identifying the most interesting events in the
different categories to which news article belongs. All these ranking measures
take in account the time and can be obtained without a predefined sliding
window of observation over the stream. The complexity of our algorithm is
linear in the number of pieces of news still under consideration at the time of
a new posting. This allow a continuous on-line process of ranking. Our ranking
framework is validated on a collection of more than 300,000 pieces of news,
produced in two months by more then 2000 news sources belonging to 13 different
categories (World, U.S, Europe, Sports, Business, etc). This collection is
extracted from the index of comeToMyHead, an academic news search engine
available online. Keywords: information extraction, news engines, news ranking | |||
| Algorithmic detection of semantic similarity | | BIBAK | Full-Text | 107-116 | |
| Ana G. Maguitman; Filippo Menczer; Heather Roinestad; Alessandro Vespignani | |||
| Automatic extraction of semantic information from text and links in Web
pages is key to improving the quality of search results. However, the
assessment of automatic semantic measures is limited by the coverage of user
studies, which do not scale with the size, heterogeneity, and growth of the
Web. Here we propose to leverage human-generated metadata -- namely topical
directories -- to measure semantic relationships among massive numbers of pairs
of Web pages or topics. The Open Directory Project classifies millions of URLs
in a topical ontology, providing a rich source from which semantic
relationships between Web pages can be derived. While semantic similarity
measures based on taxonomies (trees) are well studied, the design of
well-founded similarity measures for objects stored in the nodes of arbitrary
ontologies (graphs) is an open problem. This paper defines an
information-theoretic measure of semantic similarity that exploits both the
hierarchical and non-hierarchical structure of an ontology. An experimental
study shows that this measure improves significantly on the traditional
taxonomy-based approach. This novel measure allows us to address the general
question of how text and link analyses can be combined to derive measures of
relevance that are in good agreement with semantic similarity. Surprisingly,
the traditional use of text similarity turns out to be ineffective for
relevance ranking. Keywords: Web mining, Web search, content and link similarity, ranking evaluation,
semantic similarity | |||
| SemRank: ranking complex relationship search results on the semantic web | | BIBAK | Full-Text | 117-127 | |
| Kemafor Anyanwu; Angela Maduko; Amit Sheth | |||
| While the idea that querying mechanisms for complex relationships (otherwise
known as Semantic Associations) should be integral to Semantic Web search
technologies has recently gained some ground, the issue of how search results
will be ranked remains largely unaddressed. Since it is expected that the
number of relationships between entities in a knowledge base will be much
larger than the number of entities themselves, the likelihood that Semantic
Association searches would result in an overwhelming number of results for
users is increased, therefore elevating the need for appropriate ranking
schemes. Furthermore, it is unlikely that ranking schemes for ranking entities
(documents, resources, etc.) may be applied to complex structures such as
Semantic Associations.
In this paper, we present an approach that ranks results based on how predictable a result might be for users. It is based on a relevance model SemRank, which is a rich blend of semantic and information-theoretic techniques with heuristics that supports the novel idea of modulative searches, where users may vary their search modes to effect changes in the ordering of results depending on their need. We also present the infrastructure used in the SSARK system to support the computation of SemRank values for resulting Semantic Associations and their ordering. Keywords: SemRank, discovery query, path expression tree, ranking complex
relationships, semantic Web, semantic associations search, semantic match,
semantic ranking, semantic relationship search, semantic similarity, semantic
summary | |||
| A service creation environment based on end to end composition of Web services | | BIBAK | Full-Text | 128-137 | |
| Vikas Agarwal; Koustuv Dasgupta; Neeran Karnik; Arun Kumar; Ashish Kundu; Sumit Mittal; Biplav Srivastava | |||
| The demand for quickly delivering new applications is increasingly becoming
a business imperative today. Application development is often done in an ad hoc
manner, without standard frameworks or libraries, thus resulting in poor reuse
of software assets. Web services have received much interest in industry due to
their potential in facilitating seamless business-to-business or enterprise
application integration. A web services composition tool can help automate the
process, from creating business process functionality, to developing executable
workflows, to deploying them on an execution environment. However, we find that
the main approaches taken thus far to standardize and compose web services are
piecemeal and insufficient. The business world has adopted a (distributed)
programming approach in which web service instances are described using WSDL,
composed into flows with a language like BPEL and invoked with the SOAP
protocol. Academia has propounded the AI approach of formally representing web
service capabilities in ontologies, and reasoning about their composition using
goal-oriented inferencing techniques from planning. We present the first
integrated work in composing web services end to end from specification to
deployment by synergistically combining the strengths of the above approaches.
We describe a prototype service creation environment along with a use-case
scenario, and demonstrate how it can significantly speed up the time-to-market
for new services. Keywords: Web services composition, planning, semantic Web | |||
| Ensuring required failure atomicity of composite Web services | | BIBAK | Full-Text | 138-147 | |
| Sami Bhiri; Olivier Perrin; Claude Godart | |||
| The recent evolution of Internet, driven by the Web services technology, is
extending the role of the Web from a support of information interaction to a
middleware for B2B interactions.
Indeed, the Web services technology allows enterprises to outsource parts of their business processes using Web services. And it also provides the opportunity to dynamically offer new value-added services through the composition of pre-existing Web services. In spite of the growing interest in Web services, current technologies are found lacking efficient transactional support for composite Web services (CSs). In this paper, we propose a transactional approach to ensure the failure atomicity, of a CS, required by partners. We use the Accepted Termination States (ATS) property as a mean to express the required failure atomicity. Partners specify their CS, mainly its control flow, and the required ATS. Then, we use a set of transactional rules to assist designers to compose a valid CS with regards to the specified ATS. Keywords: failure atomicity, reliable Web services compositions, transactional models | |||
| Web service interfaces | | BIBAK | Full-Text | 148-159 | |
| Dirk Beyer; Arindam Chakrabarti; Thomas A. Henzinger | |||
| We present a language for specifying web service interfaces. A web service
interface puts three kinds of constraints on the users of the service. First,
the interface specifies the methods that can be called by a client, together
with types of input and output parameters; these are called signature
constraints. Second, the interface may specify propositional constraints on
method calls and output values that may occur in a web service conversation;
these are called consistency constraints. Third, the interface may specify
temporal constraints on the ordering of method calls; these are called protocol
constraints. The interfaces can be used to check, first, if two or more web
services are compatible, and second, if a web service A can be safely
substituted for a web service B. The algorithm for compatibility checking
verifies that two or more interfaces fulfill each others' constraints. The
algorithm for substitutivity checking verifies that service A demands fewer and
fulfills more constraints than service B. Keywords: Web service compatibility, Web service interfaces, Web service
substitutivity, Web services, formal specification, formal verification | |||
| Building adaptable and reusable XML applications with model transformations | | BIBAK | Full-Text | 160-169 | |
| Ivan Kurtev; Klaas van den Berg | |||
| We present an approach in which the semantics of an XML language is defined
by means of a transformation from an XML document model (an XML schema) to an
application specific model. The application specific model implements the
intended behavior of documents written in the language. A transformation is
specified in a model transformation language used in the Model Driven
Architecture (MDA) approach for software development. Our approach provides a
better separation of three concerns found in XML applications: syntax, syntax
processing logic and intended meaning of the syntax. It frees the developer of
low-level syntactical details and improves the adaptability and reusability of
XML applications. Declarative transformation rules and the explicit application
model provide a finer control over the application parts affected by
adaptations. Transformation rules and the application model for an XML language
may be composed with the corresponding rules and application models defined for
other XML languages. In that way we achieve reuse and composition of XML
applications. Keywords: MDA, XML, XML processing, model transformations, transformation language | |||
| Exception handling in workflow-driven Web applications | | BIBAK | Full-Text | 170-179 | |
| Marco Brambilla; Stefano Ceri; Sara Comai; Christina Tziviskou | |||
| As the Web becomes a platform for implementing B2B applications, the need
arises of Web conceptual models for describing Web oriented workflow
applications implementing business processes. In this context, new problems
about process correctness arise, due to the loose control of Web applications
upon the behavior of their Web clients. Indeed, incoherent user's behavior can
lead to inconsistent processes.
This paper presents a high level approach to the management of exceptions that occur during the execution of processes on the Web. We present a classification of exceptions that can be raised inside workflow-driven Web applications, and recovery policies to retrieve coherent status and data after an exception. We devise these concepts at high level and then we exploit them using a Web modeling language (WebML) that in turn provides development facilities like automatic code generation, validation of hypertext models, and so on. An industrial implementation experience is briefly presented too. Keywords: Web applications, exceptions, failure, navigation behavior, workflow | |||
| AwareDAV: a generic WebDAV notification framework and implementation | | BIBAK | Full-Text | 180-189 | |
| Henning Qin Jehøj; Niels Olof Bouvin; Kaj Grønbæk | |||
| WebDAV needs awareness support in order to be a full-fledged collaboration
system, This paper introduces AwareDAV, a new WebDAV extension framework
enabling shared awareness through event notification. By extending the WebDAV
protocol with seven new request-methods and an extensible XML based event
subscription scheme, AwareDAV supports fine grained event subscriptions over a
range of transport mechanisms and enables a wide range of collaboration
scenarios. This paper describes the design of AwareDAV, its API, experiences
with its initial implementation, as well as a comparison with Microsoft
Exchange and WebDAV-notify. Keywords: AwareDAV, CSCW, WebDAV, event notification | |||
| Learning domain ontologies for Web service descriptions: an experiment in bioinformatics | | BIBAK | Full-Text | 190-198 | |
| Marta Sabou; Chris Wroe; Carole Goble; Gilad Mishne | |||
| The reasoning tasks that can be performed with semantic web service
descriptions depend on the quality of the domain ontologies used to create
these descriptions. However, building such domain ontologies is a time
consuming and difficult task.
We describe an automatic extraction method that learns domain ontologies for web service descriptions from textual documentations attached to web services. We conducted our experiments in the field of bioinformatics by learning an ontology from the documentation of the web services used in myGrid, a project that supports biology experiments on the Grid. Based on the evaluation of the extracted ontology in the context of the project, we conclude that the proposed extraction method is a helpful tool to support the process of building domain ontologies for web service descriptions. Keywords: OWL-S, Web services, bioinformatics, domain ontology, ontology evaluation,
ontology learning, semantic Web | |||
| Making RDF presentable: integrated global and local semantic Web browsing | | BIBAK | Full-Text | 199-206 | |
| Lloyd Rutledge; Jacco van Ossenbruggen; Lynda Hardman | |||
| This paper discusses generating document structure from annotated media
repositories in a domain-independent manner. This approaches the vision of a
universal RDF browser. We start by applying the search-and-browse paradigm
established for the WWW to RDF presentation. Furthermore, this paper adds to
this paradigm the clustering-based derivation of document structure from search
returns, providing simple but domain-independent hypermedia generation from RDF
stores. While such generated presentations hardly meet the standards of those
written by humans, they provide quick access to media repositories when the
required document has not yet been written. The resulting system allows a user
to specify a topic for which it generates a hypermedia document providing
guided navigation through virtually any RDF repository. The impact for content
providers is that as soon as one adds new media items and their annotations to
a repository, they become immediately available for automatic integration into
subsequently requested presentations. Keywords: RDF, browsing, clustering, hypermedia generation, media archives, search,
semantic Web | |||
| Shared lexicon for distributed annotations on the Web | | BIBAK | Full-Text | 207-214 | |
| Paolo Avesani; Marco Cova | |||
| The interoperability among distributed and autonomous systems is the
ultimate challenge facing the semantic web. Heterogeneity of data
representation is the main source of problems. This paper proposes an
innovative solution that combines lexical approaches and language games. The
benefits for distributed annotation systems on the web are twofold: firstly, it
will reduce the complexity of the semantic problem by moving the focus from the
full-featured ontology level to the simpler lexicon level; secondly, it will
avoid the drawback of a centralized third party mediator that may become a
single point of failure.
The main contributions of this work are concerned with: * providing a proof of concept that language games can be an effective solution to creating and managing a distributed process of agreement on a shared lexicon, * describing a fully distributed service oriented architecture for language games, * providing empirical evidence on a real world case study in the domain of ski mountaineering. Keywords: distributed annotations, emergent semantics, interoperability, language
games | |||
| Using XForms to simplify Web programming | | BIBAK | Full-Text | 215-224 | |
| Richard Cardone; Danny Soroker; Alpana Tiwari | |||
| The difficulty of developing and deploying commercial web applications
increases as the number of technologies they use increases and as the
interactions between these technologies become more complex. This paper
describes a way to avoid this increasing complexity by re-examining the basic
requirements of web applications. Our approach is to first separate client
concerns from server concerns, and then to reduce the interaction between
client and server to its most elemental: parameter passing. We define a
simplified programming model for form-based web applications and we use XForms
and a subset of J2EE as enabling technologies. We describe our implementation
of an MVC-based application builder for this model, which automatically
generates the code needed to marshal input and output data between clients and
servers. This marshalling uses type checking and other forms of validation on
both clients and servers. We also show how our programming model and
application builder support the customization of web applications for different
execution targets, including, for example, different client devices. Keywords: J2EE, MVC, Web application, XForms, XMLBeans, eclipse, visual builder | |||
| Web-assisted annotation, semantic indexing and search of television and radio news | | BIBAK | Full-Text | 225-234 | |
| Mike Dowman; Valentin Tablan; Hamish Cunningham; Borislav Popov | |||
| The Rich News system, that can automatically annotate radio and television
news with the aid of resources retrieved from the World Wide Web, is described.
Automatic speech recognition gives a temporally precise but conceptually
inaccurate annotation model. Information extraction from related web news sites
gives the opposite: conceptual accuracy but no temporal data. Our approach
combines the two for temporally accurate conceptual semantic annotation of
broadcast news. First low quality transcripts of the broadcasts are produced
using speech recognition, and these are then automatically divided into
sections corresponding to individual news stories. A key phrases extraction
component finds key phrases for each story and uses these to search for web
pages reporting the same event. The text and meta-data of the web pages is then
used to create index documents for the stories in the original broadcasts,
which are semantically annotated using the KIM knowledge management platform. A
web interface then allows conceptual search and browsing of news stories, and
playing of the parts of the media files corresponding to each news story. The
use of material from the World Wide Web allows much higher quality textual
descriptions and semantic annotations to be produced than would have been
possible using the ASR transcript directly. The semantic annotations can form a
part of the Semantic Web, and an evaluation shows that the system operates with
high precision, and with a moderate level of recall. Keywords: Web search, automatic speech recognition, key-phrase extraction, media
archiving, multimedia, natural language processing, semantic Web, semantic
annotation, topical segmentation | |||
| Improving Web search efficiency via a locality based static pruning method | | BIBAK | Full-Text | 235-244 | |
| Edleno S. de Moura; Célia F. dos Santos; Daniel R. Fernandes; Altigran S. Silva; Pavel Calado; Mario A. Nascimento | |||
| The unarguably fast, and continuous, growth of the volume of indexed (and
indexable) documents on the Web poses a great challenge for search engines.
This is true regarding not only search effectiveness but also time and space
efficiency. In this paper we present an index pruning technique targeted for
search engines that addresses the latter issue without disconsidering the
former. To this effect, we adopt a new pruning strategy capable of greatly
reducing the size of search engine indices. Experiments using a real search
engine show that our technique can reduce the indices' storage costs by up to
60% over traditional lossless compression methods, while keeping the loss in
retrieval precision to a minimum. When compared to the indices size with no
compression at all, the compression rate is higher than 88%, i.e., less than
one eighth of the original size. More importantly, our results indicate that,
due to the reduction in storage overhead, query processing time can be reduced
to nearly 65% of the original time, with no loss in average precision. The new
method yields significative improvements when compared against the best known
static pruning method for search engine indices. In addition, since our
technique is orthogonal to the underlying search algorithms, it can be adopted
by virtually any search engine. Keywords: indexing, information retrieval, pruning, search engines, web search | |||
| Sampling search-engine results | | BIBAK | Full-Text | 245-256 | |
| Aris Anagnostopoulos; Andrei Z. Broder; David Carmel | |||
| We consider the problem of efficiently sampling Web search engine query
results. In turn, using a small random sample instead of the full set of
results leads to efficient approximate algorithms for several applications,
such as:
* Determining the set of categories in a given taxonomy spanned by the search
results; * Finding the range of metadata values associated to the result set in order to enable "multi-faceted search;" * Estimating the size of the result set; * Data mining associations to the query terms. We present and analyze an efficient algorithm for obtaining uniform random samples applicable to any search engine based on posting lists and document-at-a-time evaluation. (To our knowledge, all popular Web search engines, e.g. Google, Inktomi, AltaVista, AllTheWeb, belong to this class.) Furthermore, our algorithm can be modified to follow the modern object-oriented approach whereby posting lists are viewed as streams equipped with a next method, and the next method for Boolean and other complex queries is built from the next method for primitive terms. In our case we show how to construct a basic next(p) method that samples term posting lists with probability p, and show how to construct next(p) methods for Boolean operators (AND, OR, WAND) from primitive methods. Finally, we test the efficiency and quality of our approach on both synthetic and real-world data. Keywords: WAND, sampling, search engines, weighted AND | |||
| Three-level caching for efficient query processing in large Web search engines | | BIBAK | Full-Text | 257-266 | |
| Xiaohui Long; Torsten Suel | |||
| Large web search engines have to answer thousands of queries per second with
interactive response times. Due to the sizes of the data sets involved, often
in the range of multiple terabytes, a single query may require the processing
of hundreds of megabytes or more of index data. To keep up with this immense
workload, large search engines employ clusters of hundreds or thousands of
machines, and a number of techniques such as caching, index compression, and
index and query pruning are used to improve scalability. In particular,
two-level caching techniques cache results of repeated identical queries at the
frontend, while index data for frequently used query terms are cached in each
node at a lower level.
We propose and evaluate a three-level caching scheme that adds an intermediate level of caching for additional performance gains. This intermediate level attempts to exploit frequently occurring pairs of terms by caching intersections or projections of the corresponding inverted lists. We propose and study several offline and online algorithms for the resulting weighted caching problem, which turns out to be surprisingly rich in structure. Our experimental evaluation based on a large web crawl and real search engine query log shows significant performance gains for the best schemes, both in isolation and in combination with the other caching levels. We also observe that a careful selection of cache admission and eviction policies is crucial for best overall performance. Keywords: Web search, caching, inverted index | |||
| Innovation for a human-centered network: NTT's R&D activities for achieving the NTT group's medium-term management strategy | | BIBA | Full-Text | 267 | |
| Yuji Inoue | |||
| This talk presents NTT's approach for realizing a Human-Centered Network. Last November, we announced the NTT Group's Medium-Term Management Strategy, which consists of three management objectives: (1) building the ubiquitous broadband market and helping achieve the e-Japan Strategy and the u-Japan Initiative; (2) building a safe, secure, and convenient communications network environment and broadband access infrastructure, while achieving a seamless migration from the legacy telephone network to the next generation network; and (3) striving to increase corporate value and achieve sustainable growth. Since the management strategy takes account of Japan's future social issues such as declining birthrate and aging population, the need to reduce the environmental load, etc, we believe that the R&D activities directed towards accomplishing these objectives consequently lead to the realization of a Human-Centered Network. | |||
| Sub-document queries over XML with XSQirrel | | BIBAK | Full-Text | 268-277 | |
| Arnaud Sahuguet; Bogdan Alexe | |||
| This paper describes XSQirrel, a new XML query language that transforms a
document into a sub-document, i.e. a tree where the root-to-leaf paths are a
subset of the root-to-leaf paths from the original document.
We show that this type of queries is extremely useful for various applications (e.g. web services) and that the currently existing query languages are poorly equipped to express, reason and evaluate such queries. In particular, we emphasize the need to be able to compose such queries. We present the XSQirrel language with its syntax, semantics and two language specific operators, union and composition. For the evaluation of the language, we leverage well established query technologies by translating XSQirrel expressions into XPath programs, XQuery queries or XSLT stylesheets. We provide some experimental results that compare our various evaluation strategies. We also show the runtime benefits of query composition over sequential evaluation. Keywords: XML, XSQirrel, sub-document | |||
| XJ: facilitating XML processing in Java | | BIBAK | Full-Text | 278-287 | |
| Matthew Harren; Mukund Raghavachari; Oded Shmueli; Michael G. Burke; Rajesh Bordawekar; Igor Pechtchanski; Vivek Sarkar | |||
| The increased importance of XML as a data representation format has led to
several proposals for facilitating the development of applications that operate
on XML data. These proposals range from runtime API-based interfaces to
XML-based programming languages. The subject of this paper is XJ, a research
language that proposes novel mechanisms for the integration of XML as a
first-class construct into Java. The design goals of XJ distinguish it from
past work on integrating XML support into programming languages --
specifically, the XJ design adheres to the XML Schema and XPath standards.
Moreover, it supports in-place updates of XML data thereby keeping with the
imperative nature of Java. We have built a prototype compiler for XJ, and our
preliminary experiments demonstrate that the performance of XJ programs can
approach that of traditional low-level API-based interfaces, while providing a
higher level of abstraction. Keywords: Java, XML, language design | |||
| XQuery containment in presence of variable binding dependencies | | BIBAK | Full-Text | 288-297 | |
| Li Chen; Elke A. Rundensteiner | |||
| Semantic caching is an important technology for improving the response time
of future user queries specified over remote servers. This paper deals with the
fundamental query containment problem in an XQuery-based semantic caching
system. To our best knowledge, the impact of subtle differences in XQuery
semantics caused by different ways of specifying variables on query containment
has not yet been studied. We introduce the concept of variable binding
dependencies for representing the hierarchical element dependencies preserved
by an XQuery. We analyze the problem of XQuery containment in the presence of
such dependencies. We propose a containment mapping technique for nested XQuery
in presence of variable binding dependencies. The implication of the nested
block structure on XQuery containment is also considered. We mention the
performance gains achieved by a semantic caching system we build based on the
proposed technique. Keywords: XQuery containment, variable binding dependency | |||
| eBag: a ubiquitous Web infrastructure for nomadic learning | | BIBAK | Full-Text | 298-306 | |
| Christina Brodersen; Bent G. Christensen; Kaj Grønbæk; Christian Dindler; Balasuthas Sundararajah | |||
| This paper describes the eBag infrastructure, which is a generic
infrastructure inspired from work with school children who could benefit from a
electronic schoolbag for collaborative handling of their digital material. The
eBag infrastructure is utilizing the Context-aware HyCon framework and
collaborative web services based on WebDAV. A ubiquitous login and logout
mechanism has been built based on BlueTooth sensor networks. The eBag
infrastructure has been tried out in field tests with school kids. In this
paper we discuss experiences and design issues for ubiquitous Web integration
in interactive school environments with multiple interactive whiteboards and
workstations. This includes proposals for specialized and adaptive XLink
structures for organizing school materials as well as issues in login/logout
based on proximity of different display surfaces. Keywords: HyCon, WebDAV, XLink, adaptive, context-aware, eBag | |||
| Online curriculum on the semantic Web: the CSD-UoC portal for peer-to-peer e-learning | | BIBAK | Full-Text | 307-314 | |
| Dimitris Kotzinos; Sofia Pediaditaki; Apostolos Apostolidis; Nikolaos Athanasis; Vassilis Christophides | |||
| Online Curriculum Portals aim to support networks of instructors and
learners by providing a space of convergence for enhancing peer-to-peer
learning interactions among individuals of an educational institution. To this
end, effective, open and scalable e-learning systems are required to acquire,
store, and share knowledge under the form of learning objects (LO). In this
paper, we are interested in exploiting the semantic relationships that
characterize these LOs (e.g., prerequisite, part-of or see-also) in order to
capture and access individual and group knowledge in conjunction with the
learning processes supported by educational institutions. To achieve this
functionality, Semantic Web (e.g., RDF/s) and declarative query languages
(e.g., RQL) are employed to represent LOs and their relationships (e.g., LOM),
as well as, to support navigation at the conceptual e-learning Portal space. In
this way, different LOs could be presented to the same learners, according to
the traversed schema navigation paths (i.e., learning paths). Using the Apache
Jetspeed framework we are able to generate and assemble at run-time portlets
(i.e., pluggable web components) for visualizing personalized views as dynamic
web pages. Last but not least, both learners and instructors can employ the
same Portal GUI for updating semantically described LOs and thus support an
open-ended continuum of learning. To the best of our knowledge, the work
presented in this paper is the first Online Curriculum Portal platform
supporting the aforementioned functionality. Keywords: IEEE-LOM, e-learning portals, JetSpeed portlets, semantic Web | |||
| The classroom sentinel: supporting data-driven decision-making in the classroom | | BIBAK | Full-Text | 315-321 | |
| Mark K. Singley; Richard B. Lam | |||
| Whereas schools typically record mounds of data regarding student
performance, attendance, and other behaviors over the course of a school year,
rarely is that data consulted and used to inform day-to-day instructional
practice in the classroom. As teachers come under increasing pressure to ensure
success for all of their students, we are attempting to provide tools to help
teachers make sense of what is happening in their classrooms and take
appropriate proactive and/or remedial action. One such tool is a Web service
we've dubbed the Classroom Sentinel. The Classroom Sentinel mines electronic
gradebook and other student information system data sources to detect critical
teaching and learning patterns and bring those patterns to the attention of the
teacher in the form of timely alerts. In this paper, we introduce the notion of
classroom patterns, present some examples, and describe a framework for alert
generation and delivery. Keywords: alert generation, classroom pattern detection, data integration, data-driven
decision making, teacher cognition | |||
| Topic segmentation of message hierarchies for indexing and navigation support | | BIBAK | Full-Text | 322-331 | |
| Jong Wook Kim; K. Selçuk Candan; Mehmet E. Dönderler | |||
| Message hierarchies in web discussion boards grow with new postings. Threads
of messages evolve as new postings focus within or diverge from the original
themes of the threads. Thus, just by investigating the subject headings or
contents of earlier postings in a message thread, one may not be able to guess
the contents of the later postings. The resulting navigation problem is further
compounded for blind users who need the help of a screen reader program that
can provide only a linear representation of the content. We see that, in order
to overcome the navigation obstacle for blind as well as sighted users, it is
essential to develop techniques that help identify how the content of a
discussion board grows through generalizations and specializations of topics.
This knowledge can be used in segmenting the content in coherent units and
guiding the users through segments relevant to their navigational goals. Our
experimental results showed that the segmentation algorithm described in this
paper provides up to 80-85% success rate in labeling messages. The algorithm is
being deployed in a software system to reduce the navigational load of blind
students in accessing web-based electronic course materials; however, we note
that the techniques are equally applicable for developing web indexing and
summarization tools for users with sight. Keywords: assistive technology for blind users, discussion boards, navigational aid,
segmentation | |||
| Gimme' the context: context-driven automatic semantic annotation with C-PANKOW | | BIBAK | Full-Text | 332-341 | |
| Philipp Cimiano; Günter Ladwig; Steffen Staab | |||
| Without the proliferation of formal semantic annotations, the Semantic Web
is certainly doomed to failure. In earlier work we presented a new paradigm to
avoid this: the 'Self Annotating Web', in which globally available knowledge is
used to annotate resources such as web pages. In particular, we presented a
concrete method instantiating this paradigm, called PANKOW (Pattern-based
ANnotation through Knowledge On the Web). In PANKOW, a named entity to be
annotated is put into several linguistic patterns that convey competing
semantic meanings. The patterns that are matched most often on the Web indicate
the meaning of the named entity -- leading to automatic or semi-automatic
annotation.
In this paper we present C-PANKOW (Context-driven PANKOW), which alleviates several shortcomings of PANKOW. First, by downloading abstracts and processing them off-line, we avoid the generation of large number of linguistic patterns and correspondingly large number of Google queries. Second, by linguistically analyzing and normalizing the downloaded abstracts, we increase the coverage of our pattern matching mechanism and overcome several limitations of the earlier pattern generation process. Third, we use the annotation context in order to distinguish the significance of a pattern match for the given annotation task. Our experiments show that C-PANKOW inherits all the advantages of PANKOW (no training required etc.), but in addition it is far more efficient and effective. Keywords: annotation, information extraction, metadata, semantic Web | |||
| Opinion observer: analyzing and comparing opinions on the Web | | BIBAK | Full-Text | 342-351 | |
| Bing Liu; Minqing Hu; Junsheng Cheng | |||
| The Web has become an excellent source for gathering consumer opinions.
There are now numerous Web sites containing such opinions, e.g., customer
reviews of products, forums, discussion groups, and blogs. This paper focuses
on online customer reviews of products. It makes two contributions. First, it
proposes a novel framework for analyzing and comparing consumer opinions of
competing products. A prototype system called Opinion Observer is also
implemented. The system is such that with a single glance of its visualization,
the user is able to clearly see the strengths and weaknesses of each product in
the minds of consumers in terms of various product features. This comparison is
useful to both potential customers and product manufacturers. For a potential
customer, he/she can see a visual side-by-side and feature-by-feature
comparison of consumer opinions on these products, which helps him/her to
decide which product to buy. For a product manufacturer, the comparison enables
it to easily gather marketing intelligence and product benchmarking
information. Second, a new technique based on language pattern mining is
proposed to extract product features from Pros and Cons in a particular type of
reviews. Such features form the basis for the above comparison. Experimental
results show that the technique is highly effective and outperform existing
methods significantly. Keywords: information extraction, opinion analysis, sentiment analysis, visualization | |||
| Towards usable Web privacy and security | | BIBA | Full-Text | 352 | |
| Lorrie Faith Cranor | |||
| Internet users now rely on a whole arsenal of tools to protect their security and privacy. Experts recommend that computer users install personal firewalls, anti-virus software, spyware blockers, spam filters, cookie managers, and a variety of other tools to keep themselves safe. Users are told to pick hard-to-guess passwords, use a different password at every Web site, and not to write any of their passwords down. They are told to read privacy policies before providing personal information to Web sites, look for lock icons before typing in a credit card number, refrain from opening email attachments from people they don't know, and even to think twice about opening email attachments from people they do know. With so many do's and don'ts, it is not surprising that much of this advice is ignored. In this talk I will highlight usability problems that make it difficult for people to protect their privacy and security on the Web, and I will discuss a number of approaches to addressing these problems. | |||
| Accessibility: a Web engineering approach | | BIBAK | Full-Text | 353-362 | |
| Peter Plessers; Sven Casteleyn; Yeliz Yesilada; Olga De Troyer; Robert Stevens; Simon Harper; Carole Goble | |||
| Currently, the vast majority of web sites do not support accessibility for
visually impaired users. Usually, these users have to rely on screen readers:
applications that sequentially read the content of a web page in audio.
Unfortunately, screen readers are not able to detect the meaning of the
different page objects, and thus the implicit semantic knowledge conveyed in
the presentation of the page is lost. One approach described in literature to
tackle this problem, is the Dante approach, which allows semantic annotation of
web pages to provide screen readers with extra (semantic) knowledge to better
facilitate the audio presentation of a web page. Until now, such annotations
were done manually, and failed for dynamic pages. In this paper, we combine the
Dante approach with a web design method, WSDM, to fully automate the generation
of the semantic annotation for visually impaired users. To do so, the semantic
knowledge gathered during the design process is exploited, and the annotations
are generated as a by-product of the design process, requiring no extra effort
from the designer. Keywords: Web engineering, accessibility, semantic Web, visual impairment | |||
| A multilingual usage consultation tool based on internet searching: more than a search engine, less than QA | | BIBAK | Full-Text | 363-371 | |
| Kumiko Tanaka-Ishii; Hiroshi Nakagawa | |||
| We present a usage consultation tool, based on Internet searching, for
language learners. When a user enters a string of words for which he wants to
find usages, the system sends this string as a query to a search engine and
obtains search results about the string. The usages are extracted by performing
statistical analysis on snippets and then fed back to the user.
Unlike existing tools, this usage consultation tool is multi-lingual, so that usages can be obtained even in a language for which there are no well-established analytical methods. Our evaluation has revealed that usages can be obtained more effectively than by only using a search engine directly. Also, we have found that the resulting usage does not depend on the search engine for a prominent usage when the amount of data downloaded from the search engine is increased. Keywords: question answering, text mining, usage consultation | |||
| Improving portlet interoperability through deep annotation | | BIBAK | Full-Text | 372-381 | |
| Oscar Díaz; Jon Iturrioz; Arantza Irastorza | |||
| Portlets (i.e. multi-step, user-facing applications to be syndicated within
a portal) are currently supported by most portal frameworks. However, there is
not yet a definitive answer to portlet interoperation whereby data flows
smoothly from one portlet to a neighbouring one. Both data-based and API-based
approaches exhibit some drawbacks in either the limitation of the sharing scope
or the standardization effort required. We argue that these limitations can be
overcome by using deep annotation techniques. By providing additional markup
about the background services, deep annotation strives to interact with these
underlying services rather than with the HTML surface that conveys the markup.
In this way, the portlet producer can extend a portlet markup, a fragment, with
data about the processes whose rendering this fragment supports. Then, the
portlet consumer (e.g. a portal) can use deep annotation to map an output
process in fragment A to an input process in fragment B. This mapping results
in fragment B having its input form (or other "input" widget) filled up. We
consider deep annotation as particularly valid for portlet interoperation due
to the controlled and cooperative environment that characterizes the portal
setting. Keywords: data-flow, deep-annotation, event, portal ontology, portlet interoperability | |||
| CubeSVD: a novel approach to personalized Web search | | BIBA | Full-Text | 382-390 | |
| Jian-Tao Sun; Hua-Jun Zeng; Huan Liu; Yuchang Lu; Zheng Chen | |||
| As the competition of Web search market increases, there is a high demand for personalized Web search to conduct retrieval incorporating Web users' information needs. This paper focuses on utilizing clickthrough data to improve Web search. Since millions of searches are conducted everyday, a search engine accumulates a large volume of clickthrough data, which records who submits queries and which pages he/she clicks on. The clickthrough data is highly sparse and contains different types of objects (user, query and Web page), and the relationships among these objects are also very complicated. By performing analysis on these data, we attempt to discover Web users' interests and the patterns that users locate information. In this paper, a novel approach CubeSVD is proposed to improve Web search. The clickthrough data is represented by a 3-order tensor, on which we perform 3-mode analysis using the higher-order singular value decomposition technique to automatically capture the latent factors that govern the relations among these multi-type objects: users, queries and Web pages. A tensor reconstructed based on the CubeSVD analysis reflects both the observed interactions among these objects and the implicit associations among them. Therefore, Web search activities can be carried out based on CubeSVD analysis. Experimental evaluations using a real-world data set collected from an MSN search engine show that CubeSVD achieves encouraging search results in comparison with some standard methods. | |||
| Automatic identification of user goals in Web search | | BIBAK | Full-Text | 391-400 | |
| Uichin Lee; Zhenyu Liu; Junghoo Cho | |||
| There has been recent interests in studying the "goal" behind a user's Web
query, so that this goal can be used to improve the quality of a search
engine's results. Previous studies have mainly focused on using manual
query-log investigation to identify Web query goals. In this paper we study
whether and how we can automate this goal-identification process. We first
present our results from a human subject study that strongly indicate the
feasibility of automatic query-goal identification. We then propose two types
of features for the goal-identification task: user-click behavior and
anchor-link distribution. Our experimental evaluation shows that by combining
these features we can correctly identify the goals for 90% of the queries
studied. Keywords: Web search, query classification, user goals | |||
| User-centric Web crawling | | BIBAK | Full-Text | 401-411 | |
| Sandeep Pandey; Christopher Olston | |||
| Search engines are the primary gateways of information access on the Web
today. Behind the scenes, search engines crawl the Web to populate a local
indexed repository of Web pages, used to answer user search queries. In an
aggregate sense, the Web is very dynamic, causing any repository of Web pages
to become out of date over time, which in turn causes query answer quality to
degrade. Given the considerable size, dynamicity, and degree of autonomy of the
Web as a whole, it is not feasible for a search engine to maintain its
repository exactly synchronized with the Web.
In this paper we study how to schedule Web pages for selective (re)downloading into a search engine repository. The scheduling objective is to maximize the quality of the user experience for those who query the search engine. We begin with a quantitative characterization of the way in which the discrepancy between the content of the repository and the current content of the live Web impacts the quality of the user experience. This characterization leads to a user-centric metric of the quality of a search engine's local repository. We use this metric to derive a policy for scheduling Web page (re)downloading that is driven by search engine usage and free of exterior tuning parameters. We then focus on the important subproblem of scheduling refreshing of Web pages already present in the repository, and show how to compute the priorities efficiently. We provide extensive empirical comparisons of our user-centric method against prior Web page refresh strategies, using real Web data. Our results demonstrate that our method requires far fewer resources to maintain same search engine quality level for users, leaving substantially more resources available for incorporating new Web pages into the search repository. Keywords: Web crawling, Web page refreshing, user-centric | |||
| An abuse-free fair contract signing protocol based on the RSA signature | | BIBAK | Full-Text | 412-421 | |
| Guilin Wang | |||
| A fair contract signing protocol allows two potentially mistrusted parities
to exchange their commitments (i.e., digital signatures) to an agreed contract
over the Internet in a fair way, so that either each of them obtains the
other's signature, or neither party does. Based on the RSA signature scheme, a
new digital contract signing protocol is proposed in this paper. Like the
existing RSA-based solutions for the same problem, our protocol is not only
fair, but also optimistic, since the third trusted party is involved only in
the situations where one party is cheating or the communication channel is
interrupted. Furthermore, the proposed protocol satisfies a new property, i.e.,
it is abuse-free. That is, if the protocol is executed unsuccessfully, none of
the two parties can show the validity of intermediate results to others.
Technical details are provided to analyze the security and performance of the
proposed protocol. In summary, we present the first abuse-free fair contract
signing protocol based on the RSA signature, and show that it is both secure
and efficient. Keywords: RSA, contract signing, cryptographic protocols, digital signatures,
e-commerce, fair-exchange, security | |||
| TrustGuard: countering vulnerabilities in reputation management for decentralized overlay networks | | BIBA | Full-Text | 422-431 | |
| Mudhakar Srivatsa; Li Xiong; Ling Liu | |||
| Reputation systems have been popular in estimating the trustworthiness and predicting the future behavior of nodes in a large-scale distributed system where nodes may transact with one another without prior knowledge or experience. One of the fundamental challenges in distributed reputation management is to understand vulnerabilities and develop mechanisms that can minimize the potential damages to a system by malicious nodes. In this paper, we identify three vulnerabilities that are detrimental to decentralized reputation management and propose TrustGuard -- a safeguard framework for providing a highly dependable and yet efficient reputation system. First, we provide a dependable trust model and a set of formal methods to handle strategic malicious nodes that continuously change their behavior to gain unfair advantages in the system. Second, a transaction based reputation system must cope with the vulnerability that malicious nodes may misuse the system by flooding feedbacks with fake transactions. Third, but not least, we identify the importance of filtering out dishonest feedbacks when computing reputation-based trust of a node, including the feedbacks filed by malicious nodes through collusion. Our experiments show that, comparing with existing reputation systems, our framework is highly dependable and effective in countering malicious nodes regarding strategic oscillating behavior, flooding malevolent feedbacks with fake transactions, and dishonest feedbacks. | |||
| Static approximation of dynamically generated Web pages | | BIBAK | Full-Text | 432-441 | |
| Yasuhiko Minamide | |||
| Server-side programming is one of the key technologies that support today's
WWW environment. It makes it possible to generate Web pages dynamically
according to a user's request and to customize pages for each user. However,
the flexibility obtained by server-side programming makes it much harder to
guarantee validity and security of dynamically generated pages.
To check statically the properties of Web pages generated dynamically by a server-side program, we develop a static program analysis that approximates the string output of a program with a context-free grammar. The approximation obtained by the analyzer can be used to check various properties of a server-side program and the pages it generates. To demonstrate the effectiveness of the analysis, we have implemented a string analyzer for the server-side scripting language PHP. The analyzer is successfully applied to publicly available PHP programs to detect cross-site scripting vulnerabilities and to validate pages they generate dynamically. Keywords: HTML validation, context-free grammars, cross-site scripting, server-side
scripting, static analysis | |||
| A search engine for natural language applications | | BIBAK | Full-Text | 442-452 | |
| Michael J. Cafarella; Oren Etzioni | |||
| Many modern natural language-processing applications utilize search engines
to locate large numbers of Web documents or to compute statistics over the Web
corpus. Yet Web search engines are designed and optimized for simple human
queries -- they are not well suited to support such applications. As a result,
these applications are forced to issue millions of successive queries resulting
in unnecessary search engine load and in slow applications with limited
scalability.
In response, this paper introduces the Bindings Engine (BE), which supports queries containing typed variables and string-processing functions. For example, in response to the query "powerful <noun>" BE will return all the nouns in its index that immediately follow the word "powerful", sorted by frequency. In response to the query "Cities such as ProperNoun(Head(<NounPhrase>))", BE will return a list of proper nouns likely to be city names. BE's novel neighborhood index enables it to do so with O(k) random disk seeks and O(k) serial disk reads, where k is the number of non-variable terms in its query. As a result, BE can yield several orders of magnitude speedup for large-scale language-processing applications. The main cost is a modest increase in space to store the index. We report on experiments validating these claims, and analyze how BE's space-time tradeoff scales with the size of its index and the number of variable types. Finally, we describe how a BE-based application extracts thousands of facts from the Web at interactive speeds in response to simple user queries. Keywords: corpus, indexing, information extraction, language, query, search engine,
variables | |||
| An enhanced model for searching in semantic portals | | BIBAK | Full-Text | 453-462 | |
| Lei Zhang; Yong Yu; Jian Zhou; ChenXi Lin; Yin Yang | |||
| Semantic Portal is the next generation of web portals that are powered by
Semantic Web technologies for improved information sharing and exchange for a
community of users. Current methods of searching in Semantic Portals are
limited to keyword-based search using information retrieval (IR) techniques,
ontology-based formal query and reasoning, or a simple combination of the two.
In this paper, we propose an enhanced model that tightly integrates IR with
formal query and reasoning to fully utilize both textual and semantic
information for searching in Semantic Portals. The model extends the search
capabilities of existing methods and can answer more complex search requests.
The ideas in a fuzzy description logic (DL) IR model and a formal DL query
method are employed and combined in our model. Based on the model, a semantic
search service is implemented and evaluated. The evaluation shows very large
improvements over existing methods. Keywords: fuzzy description logic, fuzzy reasoning, information retrieval, semantic
portal, semantic search | |||
| Disambiguating Web appearances of people in a social network | | BIBAK | Full-Text | 463-470 | |
| Ron Bekkerman; Andrew McCallum | |||
| Say you are looking for information about a particular person. A search
engine returns many pages for that person's name but which pages are about the
person you care about, and which are about other people who happen to have the
same name? Furthermore, if we are looking for multiple people who are related
in some way, how can we best leverage this social network? This paper presents
two unsupervised frameworks for solving this problem: one based on link
structure of the Web pages, another using Agglomerative/Conglomerative Double
Clustering (A/CDC) -- an application of a recently introduced multi-way
distributional clustering method. To evaluate our methods, we collected and
hand-labeled a dataset of over 1000 Web pages retrieved from Google queries on
12 personal names appearing together in someones in an email folder. On this
dataset our methods outperform traditional agglomerative clustering by more
than 20%, achieving over 80% F-measure. Keywords: Web appearance, document clustering, information bottleneck, link structure,
name disambiguation, social network | |||
| A convenient method for securely managing passwords | | BIBAK | Full-Text | 471-479 | |
| J. Alex Halderman; Brent Waters; Edward W. Felten | |||
| Computer users are asked to generate, keep secret, and recall an increasing
number of passwords for uses including host accounts, email servers, e-commerce
sites, and online financial services. Unfortunately, the password entropy that
users can comfortably memorize seems insufficient to store unique, secure
passwords for all these accounts, and it is likely to remain constant as the
number of passwords (and the adversary's computational power) increases into
the future. In this paper, we propose a technique that uses a strengthened
cryptographic hash function to compute secure passwords for arbitrarily many
accounts while requiring the user to memorize only a single short password.
This mechanism functions entirely on the client; no server-side changes are
needed. Unlike previous approaches, our design is both highly resistant to
brute force attacks and nearly stateless, allowing users to retrieve their
passwords from any location so long as they can execute our program and
remember a short secret. This combination of security and convenience will, we
believe, entice users to adopt our scheme. We discuss the construction of our
algorithm in detail, compare its strengths and weaknesses to those of related
approaches, and present Password Multiplier, an implementation in the form of
an extension to the Mozilla Firefox web browser. Keywords: password security, website user authentication | |||
| Improving understanding of website privacy policies with fine-grained policy anchors | | BIBAK | Full-Text | 480-488 | |
| Stephen E. Levy; Carl Gutwin | |||
| Website privacy policies state the ways that a site will use personal
identifiable information (PII) that is collected from fields and forms in
web-based transactions. Since these policies can be complex, machine-readable
versions have been developed that allow automatic comparison of a site's
privacy policy with a user's privacy preferences. However, it is still
difficult for users to determine the cause and origin of conformance conflicts,
because current standards operate at the page level -- they can only say that
there is a conflict on the page, not where the conflict occurs or what causes
it. In this paper we describe fine-grained policy anchors, an extension to the
way a website implements the Platform for Privacy Preferences (P3P), that
solves this problem. Fine grained policy anchors enable field-level comparisons
of policy and preference, field-specific conformance displays, and faster
access to additional conformance information. We built a prototype user agent
based on these extensions and tested it with representative users. We found
that fine-grained anchors do help users understand how privacy policy relates
to their privacy preferences, and where and why conformance conflicts occur. Keywords: APPEL, P3P, conformance conflicts, e-commerce, privacy policies, privacy
preferences, user agents | |||
| Hardening Web browsers against man-in-the-middle and eavesdropping attacks | | BIBAK | Full-Text | 489-498 | |
| Haidong Xia; José Carlos Brustoloni | |||
| Existing Web browsers handle security errors in a manner that often confuses
users. In particular, when a user visits a secure site whose certificate the
browser cannot verify, the browser typically allows the user to view and
install the certificate and connect to the site despite the verification
failure. However, few users understand the risk of man-in-the-middle attacks
and the principles behind certificate-based authentication. We propose
context-sensitive certificate verification (CSCV), whereby the browser
interrogates the user about the context in which a certificate verification
error occurs. Considering the context, the browser then guides the user in
handling and possibly overcoming the security error. We also propose specific
password warnings (SPW) when users are about to send passwords in a form
vulnerable to eavesdropping. We performed user studies to evaluate CSCV and
SPW. Our results suggest that CSCV and SPW can greatly improve Web browsing
security and are easy to use even without training. Moreover, CSCV had greater
impact than did staged security training. Keywords: HTTPS, SSL, Web browser, certificate, eavesdropping attack, just-in-time
instruction, man-in-the-middle attack, password, safe staging, well-in-advance
instruction | |||
| ATMEN: a triggered network measurement infrastructure | | BIBAK | Full-Text | 499-509 | |
| Balachander Krishnamurthy; Harsha V. Madhyastha; Oliver Spatscheck | |||
| Web performance measurements and availability tests have been carried out
using a variety of infrastructures over the last several years. Disruptions in
the Internet can lead to Web sites being unavailable or increase user-perceived
latency. The unavailability could be due to DNS, failures in segments of the
physical network cutting off thousands of users, or attacks. Prompt reactions
to network-wide events can be facilitated by local or remote measurement and
monitoring. Better yet, a distributed set of intercommunicating measurement and
monitoring entities that react to events dynamically could go a long way to
handle disruptions.
We have designed and built ATMEN, a triggered measurement infrastructure to communicate and coordinate across various administrative entities. ATMEN nodes can trigger new measurements, query ongoing passive measurements or historical measurements stored on remote nodes, and coordinate the responses to make local decisions. ATMEN reduces wasted measurements by judiciously reusing measurements along three axes: spatial, temporal, and application. We describe the use of ATMEN for key Web applications such as performance based ranking of popular Web sites and availability of DNS servers on which most Web transactions are dependent. The evaluation of ATMEN is done using multiple network monitoring entities called Gigascopes installed across the USA, measurement data of a popular network application involving millions of users distributed across the Internet, and scores of clients to aid in gathering measurement information upon demand. Our results show that such a system can be built in a scalable fashion. Keywords: DNS availability, measurement, measurement infrastructure, performance-based
ranking, reuse, triggered measurements | |||
| On the lack of typical behavior in the global Web traffic network | | BIBAK | Full-Text | 510-518 | |
| Mark Meiss; Filippo Menczer; Alessandro Vespignani | |||
| We offer the first large-scale analysis of Web traffic based on network flow
data. Using data collected on the Internet2 network, we constructed a weighted
bipartite client-server host graph containing more than 18 x 106 vertices
and 68 x 106 edges valued by relative traffic flows. When considered as a
traffic map of the World-Wide Web, the generated graph provides valuable
information on the statistical patterns that characterize the global
information flow on the Web. Statistical analysis shows that client-server
connections and traffic flows exhibit heavy-tailed probability distributions
lacking any typical scale. In particular, the absence of an intrinsic average
in some of the distributions implies the absence of a prototypical scale
appropriate for server design, Web-centric network design, or traffic modeling.
The inspection of the amount of traffic handled by clients and servers and
their number of connections highlights non-trivial correlations between
information flow and patterns of connectivity as well as the presence of
anomalous statistical patterns related to the behavior of users on the Web. The
results presented here may impact considerably the modeling, scalability
analysis, and behavioral study of Web applications. Keywords: Web usage, degree, network flows, power laws, scale-free networks, strength,
traffic statistics | |||
| Analysis of multimedia workloads with implications for internet streaming | | BIBA | Full-Text | 519-528 | |
| Lei Guo; Songqing Chen; Zhen Xiao; Xiaodong Zhang | |||
| In this paper, we study the media workload collected from a large number of commercial Web sites hosted by a major ISP and that collected from a large group of home users connected to the Internet via a well-known cable company. Some of our key findings are: (1) Surprisingly, the majority of media contents are still delivered via downloading from Web servers. (2) A substantial percentage of media downloading connections are aborted before completion due to the long waiting time. (3) A hybrid approach, pseudo streaming, is used by clients to imitate real streaming. (4) The mismatch between the downloading rate and the client playback speed in pseudo streaming is common, which either causes frequent playback delays to the clients, or unnecessary traffic to the Internet. (5) Compared with streaming, downloading and pseudo streaming are neither bandwidth efficient nor performance effective. To address this problem, we propose the design of AutoStream, an innovative system that can provide additional previewing and streaming services automatically for media objects hosted on standard Web sites in server farms at the client's will. | |||
| Real and the future of digital media | | BIB | Full-Text | 529 | |
| Rob Glaser | |||
| On optimal service selection | | BIBAK | Full-Text | 530-538 | |
| P. A. Bonatti; P. Festa | |||
| While many works have been devoted to service matchmaking and modeling
nonfunctional properties, the problem of matching service requests to offers in
an optimal way has not yet been extensively studied. In this paper we formalize
three kinds of optimal service selection problems, based on different criteria.
Then we study their complexity and implement solutions. We prove that one-time
costs make the optimal selection problem computationally hard; in the absence
of these costs the problem can be solved in polynomial time. We designed and
implemented both exact and heuristic (suboptimal) algorithms for the hard case,
and carried out a preliminary experimental evaluation with interesting results. Keywords: automatic service composition, nonfunctional properties, service
matchmaking, service selection problem | |||
| G-ToPSS: fast filtering of graph-based metadata | | BIBAK | Full-Text | 539-547 | |
| Milenko Petrovic; Haifeng Liu; Hans-Arno Jacobsen | |||
| RDF is increasingly being used to represent metadata. RDF Site Summary (RSS)
is an application of RDF on the Web that has considerably grown in popularity.
However, the way RSS systems operate today does not scale well. In this paper
we introduce G-ToPSS, a scalable publish/subscribe system for selective
information dissemination. G-ToPSS is particularly well suited for applications
that deal with large-volume content distribution from diverse sources. RSS is
an instance of the content distribution problem. G-ToPSS allows use of ontology
as a way to provide additional information about the data. Furthermore, in this
paper we show how G-ToPSS can support RDFS class taxonomies. We have
implemented and experimentally evaluated G-ToPSS and we provide results in the
paper demonstrating its scalability compared to alternatives. Keywords: RDF, content-based routing, graph matching, information dissemination,
publish/subscribe | |||
| Automating metadata generation: the simple indexing interface | | BIBAK | Full-Text | 548-556 | |
| Kris Cardinaels; Michael Meire; Erik Duval | |||
| In this paper, we focus on the development of a framework for automatic
metadata generation. The first step towards this framework is the definition of
an Application Programmer Interface (API), which we call the Simple Indexing
Interface (SII). The second step is the definition of a framework for
implementation of the SII. Both steps are presented in some detail in this
paper. We also report on empirical evaluation of the metadata that the SII and
supporting framework generated in a real-life context. Keywords: learning objects, metadata generation | |||
| PageRank as a function of the damping factor | | BIBAK | Full-Text | 557-566 | |
| Paolo Boldi; Massimo Santini; Sebastiano Vigna | |||
| PageRank is defined as the stationary state of a Markov chain. The chain is
obtained by perturbing the transition matrix induced by a web graph with a
damping factor α that spreads uniformly part of the rank. The choice of
α is eminently empirical, and in most cases the original suggestion
α = 0.85 by Brin and Page is still used. Recently, however, the behaviour
of PageRank with respect to changes in α was discovered to be useful in
link-spam detection[21]. Moreover, an analytical justification of the value
chosen for α is still missing. In this paper, we give the first
mathematical analysis of PageRank when α changes. In particular, we show
that, contrarily to popular belief, for real-world graphs values of α
close to 1 do not give a more meaningful ranking. Then, we give closed-form
formulae for PageRank derivatives of any order, and an extension of the Power
Method that approximates them with convergence O (tk αt) for
the k-th derivative. Finally, we show a tight connection between iterated
computation and analytical behaviour by proving that the k-th iteration of the
Power Method gives exactly the PageRank value obtained using a Maclaurin
polynomial of degree k. The latter result paves the way towards the application
of analytical methods to the study of PageRank. Keywords: PageRank, Web graph, approximation | |||
| Object-level ranking: bringing order to Web objects | | BIBAK | Full-Text | 567-574 | |
| Zaiqing Nie; Yuanzhi Zhang; Ji-Rong Wen; Wei-Ying Ma | |||
| In contrast with the current Web search methods that essentially do
document-level ranking and retrieval, we are exploring a new paradigm to enable
Web search at the object level. We collect Web information for objects relevant
for a specific application domain and rank these objects in terms of their
relevance and popularity to answer user queries. Traditional PageRank model is
no longer valid for object popularity calculation because of the existence of
heterogeneous relationships between objects. This paper introduces PopRank, a
domain-independent object-level link analysis model to rank the objects within
a specific domain. Specifically we assign a popularity propagation factor to
each type of object relationship, study how different popularity propagation
factors for these heterogeneous relationships could affect the popularity
ranking, and propose efficient approaches to automatically decide these
factors. Our experiments are done using 1 million CS papers, and the
experimental results show that PopRank can achieve significantly better ranking
results than naively applying PageRank on the object graph. Keywords: PageRank, PopRank, Web information retrieval, Web objects, link analysis | |||
| A uniform approach to accelerated PageRank computation | | BIBAK | Full-Text | 575-582 | |
| Frank McSherry | |||
| In this note we consider a simple reformulation of the traditional power
iteration algorithm for computing the stationary distribution of a Markov
chain. Rather than communicate their current probability values to their
neighbors at each step, nodes instead communicate only changes in probability
value. This reformulation enables a large degree of flexibility in the manner
in which nodes update their values, leading to an array of optimizations and
features, including faster convergence, efficient incremental updating, and a
robust distributed implementation.
While the spirit of many of these optimizations appear in previous literature, we observe several cases where this unification simplifies previous work, removing technical complications and extending their range of applicability. We implement and measure the performance of several optimizations on a sizable (34M node) web subgraph, seeing significant composite performance gains, especially for the case of incremental recomputation after changes to the web graph. Keywords: PageRank, link analysis, random walks, web graph | |||
| Information search and re-access strategies of experienced web users | | BIBAK | Full-Text | 583-592 | |
| Anne Aula; Natalie Jhaveri; Mika Käki | |||
| Experienced web users have strategies for information search and re-access
that are not directly supported by web browsers or search engines. We studied
how prevalent these strategies are and whether even experienced users have
problems with searching and re-accessing information. With this aim, we
conducted a survey with 236 experienced web users. The results showed that this
group has frequently used key strategies (e.g., using several browser windows
in parallel) that they find important, whereas some of the strategies that have
been suggested in previous studies are clearly less important for them (e.g.,
including URLs on a webpage). In some aspects, such as query formulation, this
group resembles less experienced web users. For instance, we found that most of
the respondents had misconceptions about how their search engine handles
queries, as well as other problems with information search and re-access. In
addition to presenting the prevalence of the strategies and rationales for
their use, we present concrete designs solutions and ideas for making the key
strategies also available to less experienced users. Keywords: experienced web users, information re-access, questionnaire study, web
search | |||
| Browsing fatigue in handhelds: semantic bookmarking spells relief | | BIBAK | Full-Text | 593-602 | |
| Saikat Mukherjee; I. V. Ramakrishnan | |||
| Focused Web browsing activities such as periodically looking up headline
news, weather reports, etc., which require only selective fragments of
particular Web pages, can be made more efficient for users of
limited-display-size handheld mobile devices by delivering only the target
fragments. Semantic bookmarks provide a robust conceptual framework for
recording and retrieving such targeted content not only from the specific pages
used in creating the bookmarks but also from any user-specified page with
similar content semantics. This paper describes a technique for realizing
semantic bookmarks by coupling machine learning with Web page segmentation to
create a statistical model of the bookmarked content. These models are used to
identify and retrieve the bookmarked content from Web pages that share a common
content domain. In contrast to ontology-based approaches where semantic
bookmarks are limited to available concepts in the ontology, the learning-based
approach allows users to bookmark ad-hoc personalized semantic concepts to
effectively target content that fits the limited display of handhelds. User
evaluation measuring the effectiveness of a prototype implementation of
learning-based semantic bookmarking at reducing browsing fatigue in handhelds
is provided. Keywords: Web page partitioning, handheld device content adaptation, semantic
bookmarking | |||
| WebPod: persistent Web browsing sessions with pocketable storage devices | | BIBAK | Full-Text | 603-612 | |
| Shaya Potter; Jason Nieh | |||
| We present WebPod, a portable system that enables mobile users to use the
same persistent, personalized web browsing session on any Internet-enabled
device. No matter what computer is being used, WebPod provides a consistent
browsing session, maintaining all of a user's plugins, bookmarks, browser web
content, open browser windows, and browser configuration options and
preferences. This is achieved by leveraging rapid improvements in capacity,
cost, and size of portable storage devices. WebPod provides a virtualization
and checkpoint/restart mechanism that decouples the browsing environment from
the host, enabling web browsing sessions to be suspended to portable storage,
carried around, and resumed from the storage device on another computer. WebPod
virtualization also isolates web browsing sessions from the host, protecting
the browsing privacy of the user and preventing malicious web content from
damaging the host. We have implemented a Linux WebPod prototype and demonstrate
its ability to quickly suspend and resume web browsing sessions, enabling a
seamless web browsing experience for mobile users as they move among computers. Keywords: checkpoint/restart, portable storage, process migration, virtualization, web
browsing | |||
| Named graphs, provenance and trust | | BIBAK | Full-Text | 613-622 | |
| Jeremy J. Carroll; Christian Bizer; Pat Hayes; Patrick Stickler | |||
| The Semantic Web consists of many RDF graphs nameable by URIs. This paper
extends the syntax and semantics of RDF to cover such Named Graphs. This
enables RDF statements that describe graphs, which is beneficial in many
Semantic Web application areas. As a case study, we explore the application
area of Semantic Web publishing: Named Graphs allow publishers to communicate
assertional intent, and to sign their graphs; information consumers can
evaluate specific graphs using task-specific trust policies, and act on
information from those Named Graphs that they accept. Graphs are trusted
depending on: their content; information about the graph; and the task the user
is performing. The extension of RDF to Named Graphs provides a formally defined
framework to be a foundation for the Semantic Web trust layer. Keywords: RDF, provenance, semantic Web, trust | |||
| OWL DL vs. OWL flight: conceptual modeling and reasoning for the semantic Web | | BIBAK | Full-Text | 623-632 | |
| Jos de Bruijn; Rubén Lara; Axel Polleres; Dieter Fensel | |||
| The Semantic Web languages RDFS and OWL have been around for some time now.
However, the presence of these languages has not brought the breakthrough of
the Semantic Web the creators of the languages had hoped for. OWL has a number
of problems in the area of interoperability and usability in the context of
many practical application scenarios which impede the connection to the
Software Engineering and Database communities. In this paper we present OWL
Flight, which is loosely based on OWL, but the semantics is grounded in Logic
Programming rather than Description Logics, and it borrows the constraint-based
modeling style common in databases. This results in different types of modeling
primitives and enforces a different style of ontology modeling. We analyze the
modeling paradigms of OWL DL and OWL Flight, as well as reasoning tasks
supported by both languages. We argue that different applications on the
Semantic Web require different styles of modeling and thus both types of
languages are required for the Semantic Web. Keywords: description logics, logic programming, ontologies, semantic Web | |||
| Debugging OWL ontologies | | BIBAK | Full-Text | 633-640 | |
| Bijan Parsia; Evren Sirin; Aditya Kalyanpur | |||
| As an increasingly large number of OWL ontologies become available on the
Semantic Web and the descriptions in the ontologies become more complicated,
finding the cause of errors becomes an extremely hard task even for experts.
Existing ontology development environments provide some limited support, in
conjunction with a reasoner, for detecting and diagnosing errors in OWL
ontologies. Typically these are restricted to the mere detection of, for
example, unsatisfiable concepts. We have integrated a number of simple
debugging cues generated from our description logic reasoner, Pellet, in our
hypertextual ontology development environment, Swoop. These cues, in
conjunction with extensive undo/redo and Annotea based collaboration support in
Swoop, significantly improve the OWL debugging experience, and point the way to
more general improvements in the presentation of an ontology to new users. Keywords: OWL, explanation, ontology engineering, semantic Web | |||
| Scaling link-based similarity search | | BIBAK | Full-Text | 641-650 | |
| Dániel Fogaras; Balázs Rácz | |||
| To exploit the similarity information hidden in the hyperlink structure of
the web, this paper introduces algorithms scalable to graphs with billions of
vertices on a distributed architecture. The similarity of multi-step
neighborhoods of vertices are numerically evaluated by similarity functions
including SimRank [20], a recursive refinement of cocitation; PSimRank, a novel
variant with better theoretical characteristics; and the Jaccard coefficient,
extended to multi-step neighborhoods. Our methods are presented in a general
framework of Monte Carlo similarity search algorithms that precompute an index
database of random fingerprints, and at query time, similarities are estimated
from the fingerprints. The performance and quality of the methods were tested
on the Stanford Webbase [19] graph of 80M pages by comparing our scores to
similarities extracted from the ODP directory [26]. Our experimental results
suggest that the hyperlink structure of vertices within four to five steps
provide more adequate information for similarity search than single-step
neighborhoods. Keywords: fingerprint, link-analysis, scalability, similarity search | |||
| LSH forest: self-tuning indexes for similarity search | | BIBAK | Full-Text | 651-660 | |
| Mayank Bawa; Tyson Condie; Prasanna Ganesan | |||
| We consider the problem of indexing high-dimensional data for answering
(approximate) similarity-search queries. Similarity indexes prove to be
important in a wide variety of settings: Web search engines desire fast,
parallel, main-memory-based indexes for similarity search on text data;
database systems desire disk-based similarity indexes for high-dimensional
data, including text and images; peer-to-peer systems desire distributed
similarity indexes with low communication cost. We propose an indexing scheme
called LSH Forest which is applicable in all the above contexts. Our index uses
the well-known technique of locality-sensitive hashing (LSH), but improves upon
previous designs by (a) eliminating the different data-dependent parameters for
which LSH must be constantly hand-tuned, and (b) improving on LSH's performance
guarantees for skewed data distributions while retaining the same storage and
query overhead. We show how to construct this index in main memory, on disk, in
parallel systems, and in peer-to-peer systems. We evaluate the design with
experiments on multiple text corpora and demonstrate both the self-tuning
nature and the superior performance of LSH Forest. Keywords: peer-to-peer (P2P), similarity indexes | |||
| Partitioning of Web graphs by community topology | | BIBAK | Full-Text | 661-669 | |
| Hidehiko Ino; Mineichi Kudo; Atsuyoshi Nakamura | |||
| We introduce a stricter Web community definition to overcome boundary
ambiguity of a Web community defined by Flake, Lawrence and Giles [2], and
consider the problem of finding communities that satisfy our definition. We
discuss how to find such communities and hardness of this problem.
We also propose Web page partitioning by equivalence relation defined using the class of communities of our definition. Though the problem of efficiently finding all communities of our definition is NP-complete, we propose an efficient method of finding a subclass of communities among the sets partitioned by each of n-1 cuts represented by a Gomory-Hu tree [10], and partitioning a Web graph by equivalence relation defined using the subclass. According to our preliminary experiments, partitioning by our method divided the pages retrieved by keyword search into several different categories to some extent. Keywords: Web community, graph partitioning, maximum flow algorithm | |||
| Incremental maintenance for materialized XPath/XSLT views | | BIBAK | Full-Text | 671-681 | |
| Makoto Onizuka; Fong Yee Chan; Ryusuke Michigami; Takashi Honishi | |||
| This paper proposes an incremental maintenance algorithm that efficiently
updates the materialized XPath/XSLT views defined using XPath expressions in
XP([],*,//,vars). The algorithm consists of two processes. 1) The dynamic
execution flow of an XSLT program is stored as an XT (XML Transformation) tree
during the full transformation. 2) In response to a source XML data update, the
impacted portions of the XT-tree are identified and maintained by partially
re-evaluating the XSLT program. This paper discusses the XPath/XSLT features of
incremental view maintenance for subtree insertion/deletion and applies them to
the maintenance algorithm. Experiments show that the incremental maintenance
algorithm outperforms full XML transformation algorithms by factors of up to
500. Keywords: XML, XPath, XSLT, materialized view, view maintenance | |||
| Compiling XSLT 2.0 into XQuery 1.0 | | BIBAK | Full-Text | 682-691 | |
| Achille Fokoue; Kristoffer Rose; Jérôme Siméon; Lionel Villard | |||
| As XQuery is gathering momentum as the standard query language for XML,
there is a growing interest in using it as an integral part of the XML
application development infrastructure. In that context, one question which is
often raised is how well XQuery interoperates with other XML languages, and
notably with XSLT. XQuery 1.0 [16] and XSLT 2.0 [7] share a lot in common: they
share XPath 2.0 as a common sub-language and have the same expressiveness.
However, they are based on fairly different programming paradigms. While XSLT
has adopted a highly declarative template based approach, XQuery relies on a
simpler, and more operational, functional approach.
In this paper, we present an approach to compile XSLT 2.0 into XQuery 1.0, and a working implementation of that approach. The compilation rules explain how XSLT's template-based approach can be implemented using the functional approach of XQuery and underpins the tight connection between the two languages. The resulting compiler can be used to migrate a XSLT code base to XQuery, or to enable the use of XQuery runtimes (e.g., as will soon be provided by most relational database management systems) for XSLT users. We also identify a number of areas where compatibility between the two languages could be improved. Finally, we show experiments on actual XSLT stylesheets, demonstrating the applicability of the approach in practice. Keywords: Web services, XML, XQuery, XSLT | |||
| An adaptive, fast, and safe XML parser based on byte sequences memorization | | BIBAK | Full-Text | 692-701 | |
| Toshiro Takase; Hisashi Miyashita; Toyotaro Suzumura; Michiaki Tatsubori | |||
| XML (Extensible Markup Language) processing can incur significant runtime
overhead in XML-based infrastructural middleware such as Web service
application servers. This paper proposes a novel mechanism for efficiently
processing similar XML documents. Given a new XML document as a byte sequence,
the XML parser proposed in this paper normally avoids syntactic analysis but
simply matches the document with previously processed ones, reusing those
results. Our parser is adaptive since it partially parses and then remembers
XML document fragments that it has not met before. Moreover, it processes
safely since its partial parsing correctly checks the well-formedness of
documents. Our implementation of the proposed parser complies with the JSR 63
standard of the Java API for XML Processing (JAXP) 1.1 specification. We
evaluated Deltarser performance with messages using Google Web services.
Comparing to Piccolo (and Apache Xerces), it effectively parses 35% (106%)
faster in a server-side use-case scenario, and 73% (126%) faster in a
client-side use-case scenario. Keywords: SAX, XML parsers, automata | |||
| CaTTS: calendar types and constraints for Web applications | | BIBAK | Full-Text | 702-711 | |
| François Bry; Frank-André Ries; Stephanie Spranger | |||
| Data referring to cultural calendars such as the widespread Gregorian dates
but also dates after the Chinese, Hebrew, or Islamic calendars as well as data
referring to professional calendars like fiscal years or teaching terms are
omnipresent on the Web. Formalisms such as XML Schema have acknowledged this by
offering a rather extensive set of Gregorian dates and times as basic data
types. This article introduces into CaTTS, the Calendar and Time Type System.
CaTTS goes far beyond predefined date and time types after the Gregorian
calendar as supported by XML Schema. CaTTS first gives rise to declaratively
specify more or less complex cultural or professional calendars including
specificities such as leap seconds, leap years, and time zones. CaTTS further
offers a tool for the static type checking (of data typed after calendar(s)
defined in CaTTS). CaTTS finally offers a language for declaratively expressing
and a solver for efficiently solving temporal constraints (referring to
calendar(s) expressed in CaTTS). CaTTS complements data modeling and reasoning
methods designed for generic Semantic Web applications such as RDF or OWL with
methods specific to the particular application domain of calendars and time. Keywords: Web reasoning, calendars, time, types | |||
| Expressiveness of XSDs: from practice to theory, there and back again | | BIBAK | Full-Text | 712-721 | |
| Geert Jan Bex; Wim Martens; Frank Neven; Thomas Schwentick | |||
| On an abstract level, XML Schema increases the limited expressive power of
Document Type Definitions (DTDs) by extending them with a recursive typing
mechanism. However, an investigation of the XML Schema Definitions (XSDs)
occurring in practice reveals that the vast majority of them are structurally
equivalent to DTDs. This might be due to the complexity of the XML Schema
specification and the difficulty to understand the effect of constraints on
typing and validation of schemas. To shed some light on the actual expressive
power of XSDs this paper studies the impact of the Element Declarations
Consistent (EDC) and the Unique Particle Attribution (UPA) rule. An equivalent
formalism based on contextual patterns rather than on recursive types is
proposed which might serve as a light-weight front end for XML Schema. Finally,
the effect of EDC and UPA on the way XML documents can be typed is discussed.
It is argued that a cleaner, more robust, stronger but equally efficient class
is obtained by replacing EDC and UPA with the notion of 1-pass preorder typing:
schemas that allow to determine the type of an element of a streaming document
when its opening tag is met. This notion can be defined in terms of restrained
competition regular expressions and there is again an equivalent syntactical
formalism based on contextual patterns. Keywords: XML schema, expressiveness, formal model | |||
| WEESA: Web engineering for semantic Web applications | | BIBAK | Full-Text | 722-729 | |
| Gerald Reif; Harald Gall; Mehdi Jazayeri | |||
| The success of the Semantic Web crucially depends on the existence of Web
pages that provide machine-understandable meta-data. This meta-data is
typically added in the semantic annotation process which is currently not part
of the Web engineering process. Web engineering, however, proposes
methodologies to design, implement and maintain Web applications but lack the
generation of meta-data. In this paper we introduce a technique to extend
existing Web engineering methodologies to develop semantically annotated Web
pages. The novelty of this approach is the definition of a mapping from XML
Schema to ontologies, called WEESA, that can be used to automatically generate
RDF meta-data from XML content documents. We further show how we integrated the
WEESA mapping into an Apache Cocoon transformer to easily extend XML based Web
applications to semantically annotated Web application. Keywords: Web engineering, ontology, semantic Web, semantic annotation | |||
| A multi-threaded PIPELINED Web server architecture for SMP/SoC machines | | BIBAK | Full-Text | 730-739 | |
| Gyu Sang Choi; Jin-Ha Kim; Deniz Ersoz; Chita R. Das | |||
| Design of high performance Web servers has become a recent research thrust
to meet the increasing demand of network-based services. In this paper, we
propose a new Web server architecture, called multi-threaded PIPELINED Web
server, suitable for Symmetric Multi-Processor (SMP) or System-on-Chip (SoC)
architectures. The proposed PIPELINED model consists of multiple thread pools,
where each thread pool consists of five basic threads and two helper threads.
The main advantages of the proposed model are global information sharing by the
threads, minimal synchronization overhead due to less number of threads, and
non-blocking I/O operations, possible with the helper threads.
We have conducted an in-depth performance analysis of the proposed server model along with four prior Web server models (Multi-Process (MP), Multi-Thread (MT), Single-Process Event-Driven (SPED) and Asynchronous Multi-Process Event-Driven (AMPED)) via simulation using six Web server workloads. The experiments are conducted to investigate the impact of various factors such as the memory size, disk speed and numbers of clients. The simulation results indicate that the proposed PIPELINED Web server architecture shows the best performance across all system and workload parameters compared to the MP, MT, SPED and AMPED models. Although the MT and AMPED models show competitive performance with less number of processors, the advantage of the PIPELINED model becomes obvious as the number of processors or clients in an SMP/SoC machine increases. The MP model shows the worst performance in most of the cases. The results indicate that the proposed server architecture can be used in future large-scale SMP/SoC machines to boost system performance. Keywords: asynchronous multi-process event-driven, multi-process, multi-thread, single
event-driven process, symmetric multi-processor, system-on-chip | |||
| Cataclysm: policing extreme overloads in internet applications | | BIBAK | Full-Text | 740-749 | |
| Bhuvan Urgaonkar; Prashant Shenoy | |||
| In this paper we present the Cataclysm server platform for handling extreme
overloads in hosted Internet applications. The primary contribution of our work
is to develop a low overhead, highly scalable admission control technique for
Internet applications. Cataclysm provides several desirable features, such as
guarantees on response time by conducting accurate size-based admission
control, revenue maximization at multiple time-scales via preferential
admission of important requests and dynamic capacity provisioning, and the
ability to be operational even under extreme overloads. Cataclysm can
transparently trade-off the accuracy of its decision making with the intensity
of the workload allowing it to handle incoming rates of several tens of
thousands of requests/second. We implement a prototype Cataclysm hosting
platform on a Linux cluster and demonstrate the benefits of our integrated
approach using a variety of workloads. Keywords: internet application, overload, sentry | |||
| Design for verification for asynchronously communicating Web services | | BIBAK | Full-Text | 750-759 | |
| Aysu Betin-Can; Tevfik Bultan; Xiang Fu | |||
| We present a design for verification approach to developing reliable web
services. We focus on composite web services which consist of asynchronously
communicating peers. Our goal is to automatically verify properties of
interactions among such peers. We propose a design pattern that eases the
development of such web services and enables a modular, assume-guarantee style
verification strategy. In the proposed design pattern, each peer is associated
with a behavioral interface description which specifies how that peer will
interact with other peers. Using these peer interfaces we automatically
generate BPEL specifications to publish for interoperability. Assuming that the
participating peers behave according to their interfaces, we verify safety and
liveness properties about the global behavior of the composite web service
during behavior verification. During interface verification, we check that each
peer implementation conforms to its interface. Using the modularity in the
proposed design pattern, we are able to perform the interface verification of
each peer and the behavior verification as separate steps. Our experiments show
that, using this modular approach, one can automatically and efficiently verify
web service implementations. Keywords: BPEL, asynchronous communication, composite web services, design patterns | |||