| What's new on the web?: the evolution of the web from a search engine perspective | | BIBAK | Full-Text | 1-12 | |
| Alexandros Ntoulas; Junghoo Cho; Christopher Olston | |||
| We seek to gain improved insight into how Web search engines should cope
with the evolving Web, in an attempt to provide users with the most up-to-date
results possible. For this purpose we collected weekly snapshots of some 150
Web sites over the course of one year,and measured the evolution of content and
link structure. Our measurements focus on aspects of potential interest to
search engine designers: the evolution of link structure over time, the rate of
creation of new pages and new distinct content on the Web, and the rate of
change of the content of existing pages under search-centric measures of degree
of change.
Our findings indicate a rapid turnover rate of Web pages, i.e.,high rates of birth and death, coupled with an even higher rate of turnover in the hyperlinks that connect them. For pages that persist over time we found that, perhaps surprisingly, the degree of content shift as measured using TF.IDF cosine distance does not appear to be consistently correlated with the frequency of content updating. Despite this apparent non-correlation, the rate of content shift of a given page is likely to remain consistent over time. That is, pages that change a great deal in one week will likely change by a similarly large degree in the following week. Conversely, pages that experience little change will continue to experience little change. We conclude the paper with a discussion of the potential implications of our results for the design of effective Web search engines. Keywords: change prediction, degree of change, link structure evolution, rate of
change, search engines, web characterization, web evolution, web pages | |||
| Understanding user goals in web search | | BIBAK | Full-Text | 13-19 | |
| Daniel E. Rose; Danny Levinson | |||
| Previous work on understanding user web search behavior has focused on how
people search and what they are searching for, but not why they are searching.
In this paper, we describe a framework for understanding the underlying goals
of user searches, and our experience in using the framework to manually
classify queries from a web search engine. Our analysis suggests that so-called
navigational" searches are less prevalent than generally believed while a
previously unexplored "resource-seeking" goal may account for a large fraction
of web searches. We also illustrate how this knowledge of user search goals
might be used to improve future web search engines. Keywords: information retrieval, query classification, user behavior, user goals, web
search | |||
| Impact of search engines on page popularity | | BIBAK | Full-Text | 20-29 | |
| Junghoo Cho; Sourashis Roy | |||
| Recent studies show that a majority of Web page accesses are referred by
search engines. In this paper we study the widespread use of Web search engines
and its impact on the ecology of the Web. In particular, we study how much
impact search engines have on the popularity evolution of Web pages. For
example, given that search engines return currently popular" pages at the top
of search results, are we somehow penalizing newly created pages that are not
very well known yet? Are popular pages getting even more popular and new pages
completely ignored? We first show that this unfortunate trend indeed exists on
the Web through an experimental study based on real Web data. We then
analytically estimate how much longer it takes for a new page to attract a
large number of Web users when search engines return only popular pages at the
top of search results. Our result shows that search engines can have an
immensely worrisome impact on the discovery of new Web pages. Keywords: change in PageRank, PageRank, random surfer model, search engine's impact,
web evolution | |||
| Anti-aliasing on the web | | BIBAK | Full-Text | 30-39 | |
| Jasmine Novak; Prabhakar Raghavan; Andrew Tomkins | |||
| It is increasingly common for users to interact with the web using a number
of different aliases. This trend is a double-edged sword. On one hand, it is a
fundamental building block in approaches to online privacy. On the other hand,
there are economic and social consequences to allowing each user an arbitrary
number of free aliases. Thus, there is great interest in understanding the
fundamental issues in obscuring the identities behind aliases.
However, most work in the area has focused on linking aliases through analysis of lower-level properties of interactions such as network routes. We show that aliases that actively post text on the web can be linked together through analysis of that text. We study a large number of users posting on bulletin boards, and develop algorithms to anti-alias those users: we can with a high degree of success identify when two aliases belong to the same individual. Our results show that such techniques are surprisingly effective, leading us to conclude that guaranteeing privacy among aliases that post actively requires mechanisms that do not yet exist. Keywords: alias detection, aliases, bulletin boards, personas, privacy, pseudonyms | |||
| Securing web application code by static analysis and runtime protection | | BIBAK | Full-Text | 40-52 | |
| Yao-Wen Huang; Fang Yu; Christian Hang; Chung-Hung Tsai; Der-Tsai Lee; Sy-Yen Kuo | |||
| Security remains a major roadblock to universal acceptance of the Web for
many kinds of transactions, especially since the recent sharp increase in
remotely exploitable vulnerabilities have been attributed to Web application
bugs. Many verification tools are discovering previously unknown
vulnerabilities in legacy C programs, raising hopes that the same success can
be achieved with Web applications. In this paper, we describe a sound and
holistic approach to ensuring Web application security. Viewing Web application
vulnerabilities as a secure information flow problem, we created a
lattice-based static analysis algorithm derived from type systems and
typestate, and addressed its soundness. During the analysis, sections of code
considered vulnerable are instrumented with runtime guards, thus securing Web
applications in the absence of user intervention. With sufficient annotations,
runtime overhead can be reduced to zero. We also created a tool named.
WebSSARI (Web application Security by Static Analysis and Runtime Inspection) to test our algorithm, and used it to verify 230 open-source Web application projects on SourceForge.net, which were selected to represent projects of different maturity, popularity, and scale. 69 contained vulnerabilities. After notifying the developers, 38 acknowledged our findings and stated their plans to provide patches. Our statistics also show that static analysis reduced potential runtime overhead by 98.4%. Keywords: information flow, noninterference, program security, security
vulnerabilities, type systems, verification, web application security | |||
| Trust-serv: model-driven lifecycle management of trust negotiation policies for web services | | BIBAK | Full-Text | 53-62 | |
| Halvard Skogsrud; Boualem Benatallah; Fabio Casati | |||
| A scalable approach to trust negotiation is required in Web service
environments that have large and dynamic requester populations. We introduce
Trust-Serv, a model-driven trust negotiation framework for Web services. The
framework employs a model for trust negotiation that is based on state
machines, extended with security abstractions. Our policy model supports
lifecycle management, an important trait in the dynamic environments that
characterize Web services. In particular, we provide a set of change operations
to modify policies, and migration strategies that permit ongoing negotiations
to be migrated to new policies without being disrupted. Experimental results
show the performance benefit of these strategies. The proposed approach has
been implemented as a container-centric mechanism that is transparent to the
Web services and to the developers of Web services, simplifying Web service
development and management as well as enabling scalable deployments. Keywords: conceptual modeling, lifecycle management, trust negotiation, web services | |||
| SmartBack: supporting users in back navigation | | BIBAK | Full-Text | 63-71 | |
| Natasa Milic-Frayling; Rachel Jones; Kerry Rodden; Gavin Smyth; Alan Blackwell; Ralph Sommerer | |||
| This paper presents the design and user evaluation of SmartBack, a feature
that complements the standard Back button by enabling users to jump directly to
key pages in their navigation session, making common navigation activities more
efficient. Defining key pages was informed by the findings of a user study that
involved detailed monitoring of Web usage and analysis of Web browsing in terms
of navigation trails. The pages accessible through SmartBack are determined
automatically based on the structure of the user's navigation trails or page
association with specific user's activities, such as search or browsing
bookmarked sites. We discuss implementation decisions and present results of a
usability study in which we deployed the SmartBack prototype and monitored
usage for a month in both corporate and home settings. The results show that
the feature brings qualitative improvement to the browsing experience of
individuals who use it. Keywords: back navigation, browsing, navigation, revisitation, usability study, web
trails, web usage | |||
| Web accessibility: a broader view | | BIBAK | Full-Text | 72-79 | |
| John T. Richards; Vicki L. Hanson | |||
| Web accessibility is an important goal. However, most approaches to its
attainment are based on unrealistic economic models in which Web content
developers are required to spend too much for which they receive too little. We
believe this situation is due, in part, to the overly narrow definitions given
both to those who stand to benefit from enhanced access to the Web and what is
meant by this enhanced access. In this paper, we take a broader view,
discussing a complementary approach that costs developers less and provides
greater advantages to a larger community of users. While we have quite specific
aims in our technical work, we hope it can also serve as an example of how the
technical conversation regarding Web accessibility can move beyond the narrow
confines of limited adaptations for small populations. Keywords: standards, user interface, web accessibility | |||
| Hearsay: enabling audio browsing on hypertext content | | BIBAK | Full-Text | 80-89 | |
| I. V. Ramakrishnan; Amanda Stent; Guizhen Yang | |||
| In this paper we present HearSay, a system for browsing hypertext Web
documents via audio. The HearSay system is based on our novel approach to
automatically creating audio browsable content from hypertext Web documents. It
combines two key technologies: (1) automatic partitioning of Web documents
through tightly coupled structural and semantic analysis, which transforms raw
HTML documents into semantic structures so as to facilitate audio browsing; and
(2) VoiceXML, an already standardized technology which we adopt to represent
voice dialogs automatically created from the XML output of partitioning. This
paper describes the software components of HearSay and presents an initial
system evaluation. Keywords: HTML, VoiceXML, World Wide Web, audio browser, semantic analysis, structural
analysis, user interface | |||
| Unsupervised learning of soft patterns for generating definitions from online news | | BIBAK | Full-Text | 90-99 | |
| Hang Cui; Min-Yen Kan; Tat-Seng Chua | |||
| Breaking news often contains timely definitions and descriptions of current
terms, organizations and personalities. We utilize such web sources to
construct definitions for such terms. Previous work has identified definitions
using hand-crafted rules or supervised learning that constructs rigid, hard
text patterns. In contrast, we demonstrate a new approach that uses flexible,
soft matching patterns to characterize definition sentences. Our soft patterns
are able to effectively accommodate the diversity of definition sentence
structure exhibited in news. We use pseudo-relevance feedback to automatically
label sentences for use in soft pattern generation. The application of our
unsupervised method significantly improves baseline systems on both the
standardized TREC corpus as well as crawled online news articles by 27% and
30%, respectively, in terms of F measure. When applied to a state-of-art
definition generation system recently fielded in the TREC 2003 definitional
question answering task, it improves the performance by 14%. Keywords: definition generation, definitional question answering, pseudo-relevance
feedback, soft patterns, unsupervised learning | |||
| Web-scale information extraction in knowitall: (preliminary results) | | BIBAK | Full-Text | 100-110 | |
| Oren Etzioni; Michael Cafarella; Doug Downey; Stanley Kok; Ana-Maria Popescu; Tal Shaked; Stephen Soderland; Daniel S. Weld; Alexander Yates | |||
| Manually querying search engines in order to accumulate a large body of
factual information is a tedious, error-prone process of piecemeal search.
Search engines retrieve and rank potentially relevant documents for human
perusal, but do not extract facts, assess confidence, or fuse information from
multiple documents. This paper introduces KnowItAll, a system that aims to
automate the tedious process of extracting large collections of facts from the
web in an autonomous,domain-independent, and scalable manner.
The paper describes preliminary experiments in which an instance of KnowItAll, running for four days on a single machine, was able to automatically extract 54,753 facts. KnowItAll associates a probability with each fact enabling it to trade off precision and recall. The paper analyzes KnowItAll's architecture and reports on lessons learned for the design of large-scale information extraction systems. Keywords: information extraction, mutual information, pmi, search | |||
| Is question answering an acquired skill? | | BIBAK | Full-Text | 111-120 | |
| Ganesh Ramakrishnan; Soumen Chakrabarti; Deepa Paranjpe; Pushpak Bhattacharya | |||
| We present a question answering (QA) system which learns how to detect and
rank answer passages by analyzing questions and their answers (QA pairs)
provided as training data. We built our system in only a few person-months
using off-the-shelf components: a part-of-speech tagger, a shallow parser, a
lexical network, and a few well-known supervised learning algorithms. In
contrast, many of the top TREC QA systems are large group efforts, using
customized ontologies, question classifiers, and highly tuned ranking
functions. Our ease of deployment arises from using generic, trainable
algorithms that exploit simple feature extractors on QA pairs. With TREC QA
data, our system achieves mean reciprocal rank (MRR) that compares favorably
with the best scores in recent years, and generalizes from one corpus to
another. Our key technique is to recover, from the question, fragments of what
might have been posed as a structured query, had a suitable schema been
available. comprises selectors: tokens that are likely to appear (almost)
unchanged in an answer passage. The other fragment contains question tokens
which give clues about the answer type, and are expected to be replaced in the
answer passage by tokens which specialize or instantiate the desired answer
type. Selectors are like constants in where-clauses in relational queries, and
answer types are like column names. We present new algorithms for locating
selectors and answer type clues and using them in scoring passages with respect
to a question. Keywords: machine learning, question answering | |||
| Session level techniques for improving web browsing performance on wireless links | | BIBAK | Full-Text | 121-130 | |
| Pablo Rodriguez; Sarit Mukherjee; Sampath Ramgarajan | |||
| Recent observations through experiments that we have performed in current
third generation wireless networks have revealed that the achieved throughput
over wireless links varies widely depending on the application. In particular,
the throughput achieved by file transfer application (FTP) and web browsing
application (HTTP) are quite different. The throughput achieved over a HTTP
session is much lower than that achieved over an FTP session. The reason for
the lower HTTP throughput is that the HTTP protocol is affected by the large
Round-Trip Time (RTT) across Wireless links. HTTP transfers require multiple
TCP connections and DNS lookups before a HTTP page can be displayed. Each TCP
connection requires several RTTs to fully open the TCP send window and each DNS
lookup requires several RTTs before resolving the domain name to IP mapping.
These TCP/DNS RTTs significantly degrade the performance of HTTP over wireless
links. To overcome these problems, we have developed session level optimization
techniques to enhance HTTP download mechanisms. These techniques (a) minimize
the number of DNS lookups over the wireless link and (b) minimize the number of
TCP connections opened by the browser. These optimizations bridge the mismatch
caused by wireless links between application-level protocols (such as HTTP) and
transport-level protocols (such as TCP). Our solutions do not require any
client-side software and can be deployed transparently on a service provider
network to provide 30-50% decrease in end-to-end user perceived latency and
50-100% increase in data throughput across wireless links for HTTP sessions. Keywords: optimizations, web, wireless | |||
| Flexible on-device service object replication with replets | | BIBAK | Full-Text | 131-142 | |
| Dong Zhou; Nayeem Islam; Ali Ismael | |||
| An increasingly large amount of Web applications employ service objects such
as Servlets to generate dynamic and personalized content. Existing caching
infrastructures are not well suited for caching such content in mobile
environments because of disconnection and weak connection. One possible
approach to this problem is to replicate Web-related application logic to
client devices. The challenges to this approach are to deal with client devices
that exhibit huge divergence in resource availabilities, to support
applications that have different data sharing and coherency requirements, and
to accommodate the same application under different deployment environments.
The Replet system targets these challenges. It uses client, server and application capability and preference information (CPI) to direct the replication of service objects to client devices: from the selection of a device for replication and populating the device with client-specific data, to choosing an appropriate replica to serve a given request and maintaining the desired state consistency among replicas. The Replet system exploits on-device replication to enable client-, server- and application-specific cost metrics for replica invocation and synchronization. We have implemented a prototype in the context of Servlet-based Web applications. Our experiment and simulation results demonstrate the viability and significant benefits of CPI-driven on-device service object replication. Keywords: capability, preference, reconfiguration, replication, service,
synchronization | |||
| Improving web browsing performance on wireless pdas using thin-client computing | | BIBAK | Full-Text | 143-154 | |
| Albert M. Lai; Jason Nieh; Bhagyashree Bohra; Vijayarka Nandikonda; Abhishek P. Surana; Suchita Varshneya | |||
| Web applications are becoming increasingly popular for mobile wireless PDAs.
However, web browsing on these systems can be quite slow. An alternative
approach is handheld thin-client computing, in which the web browser and
associated application logic run on a server, which then sends simple screen
updates to the PDA for display. To assess the viability of this thin-client
approach, we compare the web browsing performance of thin clients against fat
clients that run the web browser locally on a PDA. Our results show that thin
clients can provide better web browsing performance compared to fat clients,
both in terms of speed and ability to correctly display web content.
Surprisingly, thin clients are faster even when having to send more data over
the network. We characterize and analyze different design choices in various
thin-client systems and explain why these approaches can yield superior web
browsing performance on mobile wireless PDAs. Keywords: thin-client computing, web performance, wireless and mobility | |||
| XVM: a bridge between XML data and its behavior | | BIBAK | Full-Text | 155-163 | |
| Quanzhong Li; Michelle Y. Kim; Edward So; Steve Wood | |||
| XML has become one of the core technologies for contemporary business
applications, especially web-based applications. To facilitate processing of
diverse XML data, we propose an extensible, integrated XML processing
architecture, the XML Virtual Machine (XVM), which connects XML data with their
behaviors. At the same time, the XVM is also a framework for developing and
deploying XML-based applications. Using component-based techniques, the XVM
supports arbitrary granularity and provides a high degree of modularity and
reusability. XVM components are dynamically loaded and composed during XML data
processing. Using the XVM, both client-side and server-side XML applications
can be developed and deployed in an integrated way. We also present an XML
application container built on top of the XVM along with several sample
applications to demonstrate the applicability of the XVM framework. Keywords: XML, XML applications, XML processing, XVM, components, web applications | |||
| SchemaPath, a minimal extension to XML schema for conditional constraints | | BIBAK | Full-Text | 164-174 | |
| Claudio Sacerdoti Coen; Paolo Marinelli; Fabio Vitali | |||
| In the past few years, a number of constraint languages for XML documents
has been proposed. They are cumulatively called schema languages or validation
languages and they comprise, among others, DTD, XML Schema, RELAX NG,
Schematron, DSD, xlinkit. One major point of discrimination among schema
languages is the support of co-constraints, or co-occurrence constraints, e.g.,
requiring that attribute A is present if and only if attribute B is (or is not)
present in the same element. Although there is no way in XML Schema to express
these requirements, they are in fact frequently used in many XML document
types, usually only expressed in plain human-readable text, and validated by
means of special code modules by the relevant applications. In this paper we
propose SchemaPath, a light extension of XML Schema to handle conditional
constraints on XML documents. Two new constructs have been added to XML Schema:
conditions -- based on XPath patterns -- on type assignments for elements and
attributes; and a new simple type, xsd:error, for the direct expression of
negative constraints (e.g. it is prohibited for attribute A to be present if
attribute B is also present). A proof-of-concept implementation is provided. A
Web interface is publicly accessible for experiments and assessments of the
real expressiveness of the proposed extension. Keywords: co-constraints, schema languages, SchemaPath, XML | |||
| Composite events for XML | | BIBAK | Full-Text | 175-183 | |
| Martin Bernauer; Gerti Kappel; Gerhard Kramler | |||
| Recently, active behavior has received attention in the XML field to
automatically react to occurred events. Aside from proprietary approaches for
enriching XML with active behavior, the W3C standardized the Document Object
Model (DOM) Event Module for the detection of events in XML documents. When
using any of these approaches, however, it is often impossible to decide which
event to react upon because not a single event but a combination of multiple
events, i.e., a composite event determines a situation to react upon. The paper
presents the first approach for detecting composite events in XML documents by
addressing the peculiarities of XML events which are caused by their
hierarchical order in addition to their temporal order. It also provides for
the detection of satisfied multiplicity constraints defined by XML schemas.
Thereby the approach enables applications operating on XML documents to react
to composite events which have richer semantics. Keywords: active behavior, composite event, event algebra, event-condition-action
rule, XML | |||
| LiveClassifier: creating hierarchical text classifiers through web corpora | | BIBAK | Full-Text | 184-192 | |
| Chien-Chung Huang; Shui-Lung Chuang; Lee-Feng Chien | |||
| Many Web information services utilize techniques of information extraction
(IE) to collect important facts from the Web. To create more advanced services,
one possible method is to discover thematic information from the collected
facts through text classification. However, most conventional text
classification techniques rely on manual-labelled corpora and are thus
ill-suited to cooperate with Web information services with open domains. In
this work, we present a system named LiveClassifier that can automatically
train classifiers through Web corpora based on user-defined topic hierarchies.
Due to its flexibility and convenience, LiveClassifier can be easily adapted
for various purposes. New Web information services can be created to fully
exploit it; human users can use it to create classifiers for their personal
applications. The effectiveness of classifiers created by LiveClassifier is
well supported by empirical evidence. Keywords: text classification, topic hierarchy, web mining | |||
| Using urls and table layout for web classification tasks | | BIBAK | Full-Text | 193-202 | |
| L. K. Shih; D. R. Karger | |||
| We propose new features and algorithms for automating Web-page
classification tasks such as content recommendation and ad blocking. We show
that the automated classification of Web pages can be much improved if, instead
of looking at their textual content, we consider each links's URL and the
visual placement of those links on a referring page. These features are
unusual: rather than being scalar measurements like word counts they are tree
structured -- describing the position of the item in a tree. We develop a model
and algorithm for machine learning using such tree-structured features. We
apply our methods in automated tools for recognizing and blocking Web
advertisements and for recommending "interesting" news stories to a reader.
Experiments show that our algorithms are both faster and more accurate than
those based on the text content of Web documents. Keywords: classification, news recommendation, tree structures, web applications | |||
| Learning block importance models for web pages | | BIBAK | Full-Text | 203-211 | |
| Ruihua Song; Haifeng Liu; Ji-Rong Wen; Wei-Ying Ma | |||
| Previous work shows that a web page can be partitioned into multiple
segments or blocks, and often the importance of those blocks in a page is not
equivalent. Also, it has been proven that differentiating noisy or unimportant
blocks from pages can facilitate web mining, search and accessibility. However,
no uniform approach and model has been presented to measure the importance of
different segments in web pages. Through a user study, we found that people do
have a consistent view about the importance of blocks in web pages. In this
paper, we investigate how to find a model to automatically assign importance
values to blocks in a web page. We define the block importance estimation as a
learning problem. First, we use a vision-based page segmentation algorithm to
partition a web page into semantic blocks with a hierarchical structure. Then
spatial features (such as position and size) and content features (such as the
number of images and links) are extracted to construct a feature vector for
each block. Based on these features, learning algorithms are used to train a
model to assign importance to different segments in the web page. In our
experiments, the best model can achieve the performance with Micro-F1 79% and
Micro-Accuracy 85.9%, which is quite close to a person's view. Keywords: block importance model, classification, page segmentation, web mining | |||
| Staging transformations for multimodal web interaction management | | BIBAK | Full-Text | 212-223 | |
| Michael Narayan; Christopher Williams; Saverio Perugini; Naren Ramakrishnan | |||
| Multimodal interfaces are becoming increasingly ubiquitous with the advent
of mobile devices, accessibility considerations, and novel software
technologies that combine diverse interaction media. In addition to improving
access and delivery capabilities, such interfaces enable flexible and
personalized dialogs with websites, much like a conversation between humans. In
this paper, we present a software framework for multimodal web interaction
management that supports mixed-initiative dialogs between users and websites. A
mixed-initiative dialog is one where the user and the website take turns
changing the flow of interaction. The framework supports the functional
specification and realization of such dialogs using staging transformations --
a theory for representing and reasoning about dialogs based on partial input.
It supports multiple interaction interfaces, and offers sessioning, caching,
and co-ordination functions through the use of an interaction manager. Two case
studies are presented to illustrate the promise of this approach. Keywords: mixed-initiative interaction, out-of-turn interaction, partial evaluation,
program transformations, web dialogs | |||
| Enforcing strict model-view separation in template engines | | BIBAK | Full-Text | 224-233 | |
| Terence John Parr | |||
| The mantra of every experienced web application developer is the same: thou
shalt separate business logic from display. Ironically, almost all template
engines allow violation of this separation principle, which is the very impetus
for HTML template engine development. This situation is due mostly to a lack of
formal definition of separation and fear that enforcing separation emasculates
a template's power. I show that not only is strict separation a worthy design
principle, but that we can enforce separation while providing a potent template
engine. I demonstrate my StringTemplate engine, used to build jGuru.com and
other commercial sites, at work solving some nontrivial generational tasks.
My goal is to formalize the study of template engines, thus, providing a common nomenclature, a means of classifying template generational power, and a way to leverage interesting results from formal language theory. I classify three types of restricted templates analogous to Chomsky's type 1..3 grammar classes and formally define separation including the rules that embody separation. Because this paper provides a clear definition of model-view separation, template engine designers may no longer blindly claim enforcement of separation. Moreover, given theoretical arguments and empirical evidence, programmers no longer have an excuse to entangle model and view. Keywords: model-view-controller, template engine, web application | |||
| A flexible framework for engineering "my" portals | | BIBAK | Full-Text | 234-243 | |
| Fernando Bellas; Daniel Fernández; Abel Muiño | |||
| There exist many portal servers that support the construction of "My"
portals that is portals that allow the user to have one or more personal pages
composed of a number of personalizable services. The main drawback of current
portal servers is their lack of generality and adaptability. This paper
presents the design of MyPersonalizer a J2EE-based framework for engineering My
portals. The framework is structured according to the Model-View-Controller and
Layers architectural patterns providing generic adaptable model and controller
layers that implement the typical use cases of a My portal. MyPersonalizer
allows for a good separation of roles in the development team: graphical
designers (without programming skills) develop the portal view by writing JSP
pages while software engineers implement service plugins and specify framework
configuration. Keywords: design patterns, j2ee, portal technology, web application frameworks and
architectures, web engineering | |||
| Semantic email | | BIBAK | Full-Text | 244-254 | |
| Luke McDowell; Oren Etzioni; Alon Halevy; Henry Levy | |||
| This paper investigates how the vision of the Semantic Web can be carried
over to the realm of email. We introduce a general notion of semantic email, in
which an email message consists of an RDF query or update coupled with
corresponding explanatory text. Semantic email opens the door to a wide range
of automated, email-mediated applications with formally guaranteed properties.
In particular, this paper introduces a broad class of semantic email processes.
For example consider the process of sending an email to a program committee
asking who will attend the PC dinner automatically collecting the responses and
tallying them up. We define both logical and decision-theoretic models where an
email process is modeled as a set of updates to a data set on which we specify
goals via certain constraints or utilities. We then describe a set of inference
problems that arise while trying to satisfy these goals and analyze their
computational tractability. In particular we show that for the logical model it
is possible to automatically infer which email responses are acceptable w.r.t.
a set of constraints in polynomial time and for the decision-theoretic model it
is possible to compute the optimal message-handling policy in polynomial time.
Finally we discuss our publicly available implementation of semantic email and
outline research challenges in this realm. Keywords: decision-theoretic, formal model, satisfiability, semantic web | |||
| How to make a semantic web browser | | BIBAK | Full-Text | 255-265 | |
| D. A. Quan; R. Karger | |||
| Two important architectural choices underlie the success of the Web:
numerous, independently operated servers speak a common protocol, and a single
type of client the Web browser provides point-and-click access to the content
and services on these decentralized servers. However, because HTML marries
content and presentation into a single representation, end users are often
stuck with inappropriate choices made by the Web site designer of how to work
with and view the content. RDF metadata on the Semantic Web does not have this
limitation: users can gain direct access to information and control over how it
is presented. This principle forms the basis for our Semantic Web browser an
end user application that automatically locates metadata and assembles
point-and-click interfaces from a combination of relevant information,
ontological specifications, and presentation knowledge, all described in RDF
and retrieved dynamically from the Semantic Web. Because data and services are
accessed directly through a standalone client and not through a central point
of access (e.g., a portal), new content and services can be consumed as soon as
they become available. In this way we take advantage of an important
sociological force that encourages the production of new Semantic Web content
while remaining faithful to the decentralized nature of the Web. Keywords: bioinformatics, rdf, semantic web, user interface, web services | |||
| Parsing owl dl: trees or triples? | | BIBAK | Full-Text | 266-275 | |
| Sean K. Bechhofer; Jeremy J. Carroll | |||
| The Web Ontology Language (OWL) defines three classes of documents: Lite,
DL, and Full. All RDF/XML documents are OWL Full documents, some OWL Full
documents are also OWL DL documents, and some OWL DL documents are also OWL
Lite documents. This paper discusses parsing and species recognition -- that is
the process of determining whether a given document falls into the OWL Lite, DL
or Full class. We describe two alternative approaches to this task, one based
on abstract syntax trees, the other on RDF triples, and compare their key
characteristics. Keywords: owl, parsing, rdf, semantic web | |||
| A method for transparent admission control and request scheduling in e-commerce web sites | | BIBAK | Full-Text | 276-286 | |
| Sameh Elnikety; Erich Nahum; John Tracey; Willy Zwaenepoel | |||
| This paper presents a method for admission control and request scheduling
for multiply-tiered e-commerce Web sites, achieving both stable behavior during
overload and improved response times. Our method externally observes execution
costs of requests online, distinguishing different request types, and performs
overload protection and preferential scheduling using relatively simple
measurements and a straight forward control mechanism. Unlike previous
proposals, which require extensive changes to the server or operating system,
our method requires no modifications to the host O.S., Web server, application
server or database. Since our method is external, it can be implemented in a
proxy. We present such an implementation, called Gatekeeper, using it with
standard software components on the Linux operating system. We evaluate the
proxy using the industry standard TPC-W workload generator in a typical
three-tiered e-commerce environment. We show consistent performance during
overload and throughput increases of up to 10 percent. Response time improves
by up to a factor of 14, with only a 15 percent penalty to large jobs. Keywords: admission control, dynamic web content, load control, request scheduling,
web servers | |||
| A smart hill-climbing algorithm for application server configuration | | BIBAK | Full-Text | 287-296 | |
| Bowei Xi; Zhen Liu; Mukund Raghavachari; Cathy H. Xia; Li Zhang | |||
| The overwhelming success of the Web as a mechanism for facilitating
information retrieval and for conducting business transactions has led to an
increase in the deployment of complex enterprise applications. These
applications typically run on Web Application Servers, which assume the burden
of managing many tasks, such as concurrency, memory management, database
access, etc., required by these applications. The performance of an Application
Server depends heavily on appropriate configuration. Configuration is a
difficult and error-prone task dueto the large number of configuration
parameters and complex interactions between them. We formulate the problem of
finding an optimal configuration for a given application as a black-box
optimization problem. We propose a smart hill-climbing algorithm using ideas of
importance sampling and Latin Hypercube Sampling (LHS). The algorithm is
efficient in both searching and random sampling. It consists of estimating a
local function, and then, hill-climbing in the steepest descent direction. The
algorithm also learns from past searches and restarts in a smart and selective
fashion using the idea of importance sampling. We have carried out extensive
experiments with an on-line brokerage application running in a WebSphere
environment. Empirical results demonstrate that our algorithm is more efficient
than and superior to traditional heuristic methods. Keywords: automatic tuning, gradient method, importance sampling, simulated annealing,
system configuration | |||
| Challenges and practices in deploying web acceleration solutions for distributed enterprise systems | | BIBAK | Full-Text | 297-308 | |
| Wen-Syan Li; Wang-Pin Hsiung; Oliver Po; Koji Hino; Kasim Selcuk Candan; Divyakant Agrawal | |||
| For most Web-based applications, contents are created dynamically based on
the current state of a business, such as product prices and inventory, stored
in database systems. These applications demand personalized content and track
user behavior while maintaining application integrity. Many of such practices
are not compatible with Web acceleration solutions. Consequently, although many
web acceleration solutions have shown promising performance improvement and
scalability, architecting and engineering distributed enterprise Web
applications to utilize available content delivery networks remains a
challenge. In this paper, we examine the challenge to accelerate J2EE-based
enterprise web applications. We list obstacles and recommend some practices to
transform typical database-driven J2EE applications to cache friendly Web
applications where Web acceleration solutions can be applied. Furthermore, such
transformation should be done without modification to the underlying
application business logic and without sacrificing functions that are essential
to e-commerce. We take the J2EE reference software, the Java PetStore, as a
case study. By using the proposed guideline, we are able to cache more than 90%
of the content in the PetStore and scale up the Web site more than 20 times. Keywords: application server, dynamic content, edge server, fragment, j2ee,
reliability, scalability, web acceleration | |||
| Ranking the web frontier | | BIBAK | Full-Text | 309-318 | |
| Nadav Eiron; Kevin S. McCurley; John A. Tomlin | |||
| The celebrated PageRank algorithm has proved to be a very effective paradigm
for ranking results of web search algorithms. In this paper we refine this
basic paradigm to take into account several evolving prominent features of the
web, and propose several algorithmic innovations. First, we analyze features of
the rapidly growing "frontier" of the web, namely the part of the web that
crawlers are unable to cover for one reason or another. We analyze the effect
of these pages and find it to be significant. We suggest ways to improve the
quality of ranking by modeling the growing presence of "link rot" on the web as
more sites and pages fall out of maintenance. Finally we suggest new methods of
ranking that are motivated by the hierarchical structure of the web, are more
efficient than PageRank, and may be more resistant to direct manipulation. Keywords: hypertext, PageRank, ranking | |||
| Link fusion: a unified link analysis framework for multi-type interrelated data objects | | BIBAK | Full-Text | 319-327 | |
| Wensi Xi; Benyu Zhang; Zheng Chen; Yizhou Lu; Shuicheng Yan; Wei-Ying Ma; Edward Allan Fox | |||
| Web link analysis has proven to be a significant enhancement for quality
based web search. Most existing links can be classified into two categories:
intra-type links (e.g., web hyperlinks), which represent the relationship of
data objects within a homogeneous data type (web pages), and inter-type links
(e.g., user browsing log) which represent the relationship of data objects
across different data types (users and web pages). Unfortunately, most link
analysis research only considers one type of link. In this paper, we propose a
unified link analysis framework, called "link fusion", which considers both the
inter- and intra- type link structure among multiple-type inter-related data
objects and brings order to objects in each data type at the same time. The
PageRank and HITS algorithms are shown to be special cases of our unified link
analysis framework. Experiments on an instantiation of the framework that makes
use of the user data and web pages extracted from a proxy log show that our
proposed algorithm could improve the search effectiveness over the HITS and
DirectHit algorithms by 24.6% and 38.2% respectively. Keywords: data fusion, information retrieval, link analysis algorithms, link fusion | |||
| Sic transit gloria telae: towards an understanding of the web's decay | | BIBAK | Full-Text | 328-337 | |
| Ziv Bar-Yossef; Andrei Z. Broder; Ravi Kumar; Andrew Tomkins | |||
| The rapid growth of the web has been noted and tracked extensively. Recent
studies have however documented the dual phenomenon: web pages have small half
lives, and thus the web exhibits rapid death as well. Consequently, page
creators are faced with an increasingly burdensome task of keeping links
up-to-date, and many are falling behind. In addition to just individual pages,
collections of pages or even entire neighborhoods of the web exhibit
significant decay, rendering them less effective as information resources. Such
neighborhoods are identified only by frustrated searchers, seeking a way out of
these stale neighborhoods, back to more up-to-date sections of the web;
measuring the decay of a page purely on the basis of dead links on the page is
too naive to reflect this frustration. In this paper we formalize a strong
notion of a decay measure and present algorithms for computing it efficiently.
We explore this measure by presenting a number of validations, and use it to
identify interesting artifacts on today's web. We then describe a number of
applications of such a measure to search engines, web page maintainers,
ontologists, and individual users. Keywords: 404 return code, dead links, link analysis, web decay, web information
retrieval | |||
| Using link analysis to improve layout on mobile devices | | BIBAK | Full-Text | 338-344 | |
| Xinyi Yin; Wee Sun Lee | |||
| Delivering web pages to mobile phones or personal digital assistants has
become possible with the latest wireless technology. However, mobile devices
have very small screen sizes and memory capacities. Converting web pages for
delivery to a mobile device is an exciting new problem. In this paper, we
propose to use a ranking algorithm similar to Google's PageRank algorithm to
rank the content objects within a web page. This allows the extraction of only
important parts of web pages for delivery to mobile devices. Experiments show
that the new method is effective. In experiments on pages from randomly
selected websites, the system needed to extract and deliver only 39% of the
objects in a web page in order to provide 85% of a viewer's desired viewing
content. This provides significant savings in the wireless traffic and
downloading time while providing a satisfactory reading experience on the
mobile device. Keywords: html, link analysis, pda (personal digital assistant), www (world wide web) | |||
| An evaluation of binary XML encoding optimizations for fast stream based XML processing | | BIBAK | Full-Text | 345-354 | |
| R. J. Bayardo; D. Gruhl; V. Josifovski; J. Myllymaki | |||
| This paper provides an objective evaluation of the performance impacts of
binary XML encodings, using a fast stream-based XQuery processor as our
representative application. Instead of proposing one binary format and
comparing it against standard XML parsers, we investigate the individual
effects of several binary encoding techniques that are shared by many
proposals. Our goal is to provide a deeper understanding of the performance
impacts of binary XML encodings in order to clarify the ongoing and often
contentious debate over their merits, particularly in the domain of high
performance XML stream processing. Keywords: XML binary formats, XPath processing | |||
| Optimization of html automatically generated by wysiwyg programs | | BIBAK | Full-Text | 355-364 | |
| Jacqueline Spiesser; Les Kitchen | |||
| Automatically generated HTML, as produced by WYSIWYG programs, typically
contains much repetitive and unnecessary markup. This paper identifies aspects
of such HTML that may be altered while leaving a semantically equivalent
document, and proposes techniques to achieve optimizing modifications. These
techniques include attribute re-arrangement via dynamic programming, the use of
style classes, and dead-code removal. These techniques produce documents as
small as 33% of original size. The size decreases obtained are still
significant when the techniques are used in combination with conventional
text-based compression. Keywords: dynamic programming, haskell, html optimization, wysiwyg | |||
| Building a companion website in the semantic web | | BIBAK | Full-Text | 365-373 | |
| Timothy J. Miles-Board; Christopher P. Bailey; Wendy Hall; Leslie A. Carr | |||
| A problem facing many textbook authors (including one of the authors of this
paper) is the inevitable delay between new advances in the subject area and
their incorporation in a new (paper) edition of the textbook. This means that
some textbooks are quickly considered out of date, particularly in active
technological areas such as the Web, even though the ideas presented in the
textbook are still valid and important to the community. This paper describes
our approach to building a companion website for the textbook Hypermedia and
the Web: An Engineering Approach. We use Bloom's taxonomy of educational
objectives to critically evaluate a number of authoring and presentation
techniques used in existing companion websites, and adapt these techniques to
create our own companion website using Semantic Web technologies in order to
overcome the identified weaknesses. Finally, we discuss a potential model of
future companion websites, in the context of an e-publishing, e-commerce
Semantic Web services scenario. Keywords: bloom's taxonomy, companion website, electronic publishing, semantic web,
textbook | |||
| A hybrid approach for searching in the semantic web | | BIBAK | Full-Text | 374-383 | |
| Cristiano Rocha; Daniel Schwabe; Marcus Poggi Aragao | |||
| This paper presents a search architecture that combines classical search
techniques with spread activation techniques applied to a semantic model of a
given domain. Given an ontology, weights are assigned to links based on certain
properties of the ontology, so that they measure the strength of the relation.
Spread activation techniques are used to find related concepts in the ontology
given an initial set of concepts and corresponding initial activation values.
These initial values are obtained from the results of classical search applied
to the data associated with the concepts in the ontology. Two test cases were
implemented, with very positive results. It was also observed that the proposed
hybrid spread activation, combining the symbolic and the sub-symbolic
approaches, achieved better results when compared to each of the approaches
alone. Keywords: network analysis, ontologies, semantic associations, semantic search,
semantic web, spread activation algorithms | |||
| CS AKTive space: representing computer science in the semantic web | | BIBAK | Full-Text | 384-392 | |
| m. c. schraefel; Nigel R. Shadbolt; Nicholas Gibbins; Stephen Harris; Hugh Glaser | |||
| We present a Semantic Web application that we call CS AKTive Space. The
application exploits a wide range of semantically heterogeneous and distributed
content relating to Computer Science research in the UK. This content is
gathered on a continuous basis using a variety of methods including harvesting
and scraping as well as adopting a range models for content acquisition. The
content currently comprises around ten million RDF triples and we have
developed storage, retrieval and maintenance methods to support its management.
The content is mediated through an ontology constructed for the application
domain and incorporates components from other published ontologies. CS AKTive
Space supports the exploration of patterns and implications inherent in the
content and exploits a variety of visualisations and multi dimensional
representations. Knowledge services supported in the application include
investigating communities of practice: who is working, researching or
publishing with whom. This work illustrates a number of substantial challenges
for the Semantic Web. These include problems of referential integrity,
tractable inference and interaction support. We review our approaches to these
issues and discuss relevant related work. Keywords: ontologies, semantic web, semantic web challenge groups | |||
| Shilling recommender systems for fun and profit | | BIBAK | Full-Text | 393-402 | |
| Shyong K. Lam; John Riedl | |||
| Recommender systems have emerged in the past several years as an effective
way to help people cope with the problem of information overload. One
application in which they have become particularly common is in e-commerce,
where recommendation of items can often help a customer find what she is
interested in and, therefore can help drive sales. Unscrupulous producers in
the never-ending quest for market penetration may find it profitable to shill
recommender systems by lying to the systems in order to have their products
recommended more often than those of their competitors. This paper explores
four open questions that may affect the effectiveness of such shilling attacks:
which recommender algorithm is being used, whether the application is producing
recommendations or predictions, how detectable the attacks are by the operator
of the system, and what the properties are of the items being attacked. The
questions are explored experimentally on a large data set of movie ratings.
Taken together, the results of the paper suggest that new ways must be used to
evaluate and detect shilling attacks on recommender systems. Keywords: collaborative filtering, recommender systems, shilling | |||
| Propagation of trust and distrust | | BIBAK | Full-Text | 403-412 | |
| R. Guha; Ravi Kumar; Prabhakar Raghavan; Andrew Tomkins | |||
| A (directed) network of people connected by ratings or trust scores, and a
model for propagating those trust scores, is a fundamental building block in
many of today's most successful e-commerce and recommendation systems. We
develop a framework of trust propagation schemes, each of which may be
appropriate in certain circumstances, and evaluate the schemes on a large trust
network consisting of 800K trust scores expressed among 130K people. We show
that a small number of expressed trusts/distrust per individual allows us to
predict trust between any two people in the system with high accuracy. Our work
appears to be the first to incorporate distrust in a computational trust
propagation setting. Keywords: distrust, trust propagation, web of trust | |||
| A community-aware search engine | | BIBAK | Full-Text | 413-421 | |
| Rodrigo B. Almeida; Virgilio A. F. Almeida | |||
| Current search technologies work in a "one size fits all" fashion.
Therefore, the answer to a query is independent of specific user information
need. In this paper we describe a novel ranking technique for personalized
search services that combines content-based and community-based evidences. The
community-based information is used in order to provide context for queries and
is influenced by the current interaction of the user with the service. Our
algorithm is evaluated using data derived from an actual service available on
the Web an online bookstore. We show that the quality of content-based ranking
strategies can be improved by the use of community information as another
evidential source of relevance. In our experiments the improvements reach up to
48% in terms of average precision. Keywords: data mining, searching and ranking | |||
| Managing versions of web documents in a transaction-time web server | | BIBAK | Full-Text | 422-432 | |
| Curtis E. Dyreson; Hui-ling Lin; Yingxia Wang | |||
| This paper presents a transaction-time HTTP server, called TTApache that
supports document versioning. A document often consists of a main file
formatted in HTML or XML and several included files such as images and
stylesheets. A change to any of the files associated with a document creates a
new version of that document. To construct a document version history,
snapshots of the document's files are obtained over time. Transaction times are
associated with each file version to record the version's lifetime. The
transaction time is the system time of the edit that created the version.
Accounting for transaction time is essential to supporting audit queries that
delve into past document versions and differential queries that pinpoint
differences between two versions. TTApache performs automatic versioning when a
document is read thereby removing the burden of versioning from document
authors. Since some versions may be created but never read, TTApache
distinguishes between known and assumed versions of a document. TTApache has a
simple query language to retrieve desired versions. A browser can request a
specific version, or the entire history of a document. Queries can also rewrite
links and references to point to current or past versions. Over time, the
version history of a document continually grows. To free space, some versions
can be vacuumed. Vacuuming a version however changes the semantics of requests
for that version. This paper presents several policies for vacuuming versions
and strategies for accounting for vacuumed versions in queries. Keywords: observant system, transaction time, versioning | |||
| Fine-grained, structured configuration management for web projects | | BIBAK | Full-Text | 433-442 | |
| Tien Nhut Nguyen; Ethan Vincent Munson; Cheng Thao | |||
| Researchers in Web engineering have regularly noted that existing Web
application development environments provide little support for managing the
evolution of Web applications. Key limitations of Web development environments
include line-oriented change models that inadequately represent Web document
semantics and in ability to model changes to link structure or the set of
objects making up the Web application. Developers may find it difficult to
grasp how the overall structure of the Web application has changed over time
and may respond by using ad hoc solutions that lead to problems of maintain
ability, quality and reliability. Web applications are software artifacts, and
as such, can benefit from advanced version control and software configuration
management (SCM)technologies from software engineering. We have modified an
integrated development environment to manage the evolution and maintenance of
Web applications. The resulting environment is distinguished by its
fine-grained version control framework, fine-grained Web content change
management, and product versioning configuration management, in which a Web
project can be organized at the logical level and its structure and components
are versioned in a fine-grained manner as well. This paper describes the
motivation for this environment as well as its user interfaces, features, and
implementation. Keywords: software configuration management, version control, web engineering | |||
| Automatic detection of fragments in dynamically generated web pages | | BIBAK | Full-Text | 443-454 | |
| Lakshmish Ramaswamy; Arun Iyengar; Ling Liu; Fred Douglis | |||
| Dividing web pages into fragments has been shown to provide significant
benefits for both content generation and caching. In order for a web site to
use fragment-based content generation, however, good methods are needed for
dividing web pages into fragments. Manual fragmentation of web pages is
expensive, error prone, and unscalable. This paper proposes a novel scheme to
automatically detect and flag fragments that are cost-effective cache units in
web sites serving dynamic content. We consider the fragments to be interesting
if they are shared among multiple documents or they have different lifetime or
personalization characteristics. Our approach has three unique features. First,
we propose a hierarchical and fragment-aware model of the dynamic web pages and
a data structure that is compact and effective for fragment detection. Second,
we present an efficient algorithm to detect maximal fragments that are shared
among multiple documents. Third, we develop a practical algorithm that
effectively detects fragments based on their lifetime and personalization
characteristics. We evaluate the proposed scheme through a series of
experiments, showing the benefits and costs of the algorithms. We also study
the impact of adopting the fragments detected by our system on disk space
utilization and network bandwidth consumption. Keywords: L-P fragments, dynamic content caching, fragment detection, fragment-based
caching, shared fragments | |||
| Incremental formalization of document annotations through ontology-based paraphrasing | | BIBAK | Full-Text | 455-461 | |
| Jim Blythe; Yolanda Gil | |||
| For the manual semantic markup of documents to become wide-spread, users
must be able to express annotations that conform to ontologies (or schemas)
that have shared meaning. However, a typical user is unlikely to be familiar
with the details of the terms as defined by the ontology authors. In addition,
the idea to be expressed may not fit perfectly within a pre-defined ontology.
The ideal tool should help users find a partial formalization that closely
follows the ontology where possible but deviates from the formal representation
where needed. We describe an implemented approach to help users create
semi-structured semantic annotations for a document according to an extensible
OWL ontology. In our approach, users enter a short sentence in free text to
describe all or part of a document, and the system presents a set of potential
paraphrases of the sentence that are generated from valid expressions in the
ontology, from which the user chooses the closest match. We use a combination
of off-the-shelf parsing tools and breadth-first search of expressions in the
ontology to help users create valid annotations starting from free text. The
user can also define new terms to augment the ontology, so the potential
matches can improve over time. Keywords: document annotation, knowledge acquisition, semantic markup | |||
| Towards the self-annotating web | | BIBAK | Full-Text | 462-471 | |
| Philipp Cimiano; Siegfried Handschuh; Steffen Staab | |||
| The success of the Semantic Web depends on the availability of ontologies as
well as on the proliferation of web pages annotated with metadata conforming to
these ontologies. Thus, a crucial question is where to acquire these metadata
from. In this paper we propose PANKOW (Pattern-based Annotation through
Knowledge on the Web), a method which employs an unsupervised, pattern-based
approach to categorize instances with regard to an ontology. The approach is
evaluated against the manual annotations of two human subjects. The approach is
implemented in OntoMat, an annotation tool for the Semantic Web and shows very
promising results. Keywords: information extraction, metadata, semantic annotation, semantic web | |||
| Web taxonomy integration using support vector machines | | BIBAK | Full-Text | 472-481 | |
| Dell Zhang; Wee Sun Lee | |||
| We address the problem of integrating objects from a source taxonomy into a
master taxonomy. This problem is not only currently pervasive on the web, but
also important to the emerging semantic web. A straightforward approach to
automating this process would be to train a classifier for each category in the
master taxonomy, and then classify objects from the source taxonomy into these
categories. In this paper we attempt to use a powerful classification method,
Support Vector Machine (SVM), to attack this problem. Our key insight is that
the availability of the source taxonomy data could be helpful to build better
classifiers in this scenario, therefore it would be beneficial to do
transductive learning rather than inductive learning, i.e., learning to
optimize classification performance on a particular set of test examples.
Noticing that the categorizations of the master and source taxonomies often
have some semantic overlap, we propose a method, Cluster Shrinkage (CS), to
further enhance the classification by exploiting such implicit knowledge. Our
experiments with real-world web data show substantial improvements in the
performance of taxonomy integration. Keywords: classification, ontology mapping, semantic web, support vector machines,
taxonomy integration, transductive learning | |||
| Newsjunkie: providing personalized newsfeeds via analysis of information novelty | | BIBAK | Full-Text | 482-490 | |
| Evgeniy Gabrilovich; Susan Dumais; Eric Horvitz | |||
| We present a principled methodology for filtering news stories by formal
measures of information novelty, and show how the techniques can be used to
custom-tailor news feeds based on information that a user has already reviewed.
We review methods for analyzing novelty and then describe Newsjunkie, a system
that personalizes news for users by identifying the novelty of stories in the
context of stories they have already reviewed. Newsjunkie employs
novelty-analysis algorithms that represent articles as words and named
entities. The algorithms analyze inter- and intra-document dynamics by
considering how information evolves over time from article to article, as well
as within individual articles. We review the results of a user study undertaken
to gauge the value of the approach over legacy time-based review of newsfeeds,
and also to compare the performance of alternate distance metrics that are used
to estimate the dissimilarity between candidate new articles and sets of
previously reviewed articles. Keywords: news, novelty detection, personalization | |||
| Information diffusion through blogspace | | BIBAK | Full-Text | 491-501 | |
| Daniel Gruhl; R. Guha; David Liben-Nowell; Andrew Tomkins | |||
| We study the dynamics of information propagation in environments of
low-overhead personal publishing, using a large collection of weblogs over time
as our example domain. We characterize and model this collection at two levels.
First, we present a macroscopic characterization of topic propagation through
our corpus, formalizing the notion of long-running "chatter" topics consisting
recursively of "spike" topics generated by outside world events, or more
rarely, by resonances within the community. Second, we present a microscopic
characterization of propagation from individual to individual, drawing on the
theory of infectious diseases to model the flow. We propose, validate, and
employ an algorithm to induce the underlying propagation network from a
sequence of posts, and report on the results. Keywords: blogs, information propagation, memes, topic characterization, topic
structure, viral propagation, viruses | |||
| Automatic web news extraction using tree edit distance | | BIBAK | Full-Text | 502-511 | |
| D. C. Reis; P. B. Golgher; A. S. Silva; A. F. Laender | |||
| The Web poses itself as the largest data repository ever available in the
history of humankind. Major efforts have been made in order to provide
efficient access to relevant information within this huge repository of data.
Although several techniques have been developed to the problem of Web data
extraction, their use is still not spread, mostly because of the need for high
human intervention and the low quality of the extraction results.
In this paper, we present a domain-oriented approach to Web data extraction and discuss its application to automatically extracting news from Web sites. Our approach is based on a highly efficient tree structure analysis that produces very effective results. We have tested our approach with several important Brazilian on-line news sites and achieved very precise results, correctly extracting 87.71% of the news in a set of 4088 pages distributed among 35 different sites. Keywords: data extraction, edit distance, schema inference, web | |||
| Accurate, scalable in-network identification of p2p traffic using application signatures | | BIBAK | Full-Text | 512-521 | |
| Subhabrata Sen; Oliver Spatscheck; Dongmei Wang | |||
| The ability to accurately identify the network traffic associated with
different P2P applications is important to a broad range of network operations
including application-specific traffic engineering, capacity planning,
provisioning, service differentiation,etc. However, traditional traffic to
higher-level application mapping techniques such as default server TCP or UDP
network-port based disambiguation is highly inaccurate for some P2P
applications.
In this paper, we provide an efficient approach for identifying the P2P application traffic through application level signatures. We first identify the application level signatures by examining some available documentations, and packet-level traces. We then utilize the identified signatures to develop online filters that can efficiently and accurately track the P2P traffic even on high-speed network links. We examine the performance of our application-level identification approach using five popular P2P protocols. Our measurements show that our technique achieves less than 5% false positive and false negative ratios in most cases. We also show that our approach only requires the examination of the very first few packets (less than 10packets) to identify a P2P connection, which makes our approach highly scalable. Our technique can significantly improve the P2P traffic volume estimates over what pure network port based approaches provide. For instance, we were able to identify 3 times as much traffic for the popular Kazaa P2P protocol, compared to the traditional port-based approach. Keywords: application-level signatures, online application classification, p2p,
traffic analysis | |||
| Characterization of a large web site population with implications for content delivery | | BIBAK | Full-Text | 522-533 | |
| L. Bent; M. Rabinovich; G. M. Voelker; Z. Xiao | |||
| This paper presents a systematic study of the properties of a large number
of Web sites hosted by a major ISP. To our knowledge, ours is the first
comprehensive study of a large server farm that contains thousands of
commercial Web sites. We also perform a simulation analysis to estimate
potential performance benefits of content delivery networks (CDNs) for these
Web sites. We make several interesting observations about the current usage of
Web technologies and Web site performance characteristics. First, compared with
previous client workload studies, the Web server farm workload contains a much
higher degree of uncacheable responses and responses that require mandatory
cache validations. A significant reason for this is that cookie use is
prevalent among our population, especially among more popular sites. However,
we found an indication of wide-spread indiscriminate usage of cookies, which
unnecessarily impedes the use of many content delivery optimizations. We also
found that most Web sites do not utilize the cache-control features of the HTTP
1.1 protocol, resulting in suboptimal performance. Moreover, the implicit
expiration time in client caches for responses is constrained by the maximum
values allowed in the Squid proxy. Finally, our simulation results indicate
that most Web sites benefit from the use of a CDN. The amount of the benefit
depends on site popularity, and, somewhat surprisingly, a CDN may increase the
peak to average request ratio at the origin server because the CDN can decrease
the average request rate more than the peak request rate. Keywords: content distribution, cookie, http, measurement, performance, web caching,
workload characterization | |||
| Analyzing client interactivity in streaming media | | BIBAK | Full-Text | 534-543 | |
| Cristiano P. Costa; Italo S. Cunha; Alex Borges; Claudiney V. Ramos; Marcus M. Rocha; Jussara M. Almeida; Berthier Ribeiro-Neto | |||
| This paper provides an extensive analysis of pre-stored streaming media
workloads, focusing on the client interactive behavior. We analyze four
workloads that fall into three different domains, namely, education,
entertainment video and entertainment audio. Our main goals are: (a) to
identify qualitative similarities and differences in the typical client
behavior for the three workload classes and (b) to provide data for generating
realistic synthetic workloads. Keywords: streaming media, workload characterization | |||
| Augmenting semantic web service descriptions with compositional specification | | BIBAK | Full-Text | 544-552 | |
| Monika Solanki; Antonio Cau; Hussein Zedan | |||
| Current ontological specifications for semantically describing properties of
Web services are limited to their static interface description. Normally for
proving properties of service compositions, mapping input/output parameters and
specifying the pre/post conditions are found to be sufficient. However these
properties are assertions only on the initial and final states of the service
respectively. They do not help in specifying/verifying ongoing behaviour of an
individual service or a composed system. We propose a framework for enriching
semantic service descriptions with two compositional assertions: assumption and
commitment that facilitate reasoning about service composition and verification
of their integration. The technique is based on Interval Temporal Logic (ITL):
a sound formalism for specifying and proving temporal properties of systems.
Our approach utilizes the recently proposed Semantic Web Rule Language. Keywords: assumption, commitment, interval temporal logics, owl, owl-s, semantic web
services, swrl, web services | |||
| Meteor-s web service annotation framework | | BIBAK | Full-Text | 553-562 | |
| Abhijit A. Patil; Swapna A. Oundhakar; Amit P. Sheth; Kunal Verma | |||
| The World Wide Web is emerging not only as an infrastructure for data, but
also for a broader variety of resources that are increasingly being made
available as Web services. Relevant current standards like UDDI, WSDL, and SOAP
are in their fledgling years and form the basis of making Web services a
workable and broadly adopted technology. However, realizing the fuller scope of
the promise of Web services and associated service oriented architecture will
requite further technological advances in the areas of service interoperation,
service discovery, service composition, and process orchestration. Semantics,
especially as supported by the use of ontologies, and related Semantic Web
technologies, are likely to provide better qualitative and scalable solutions
to these requirements. Just as semantic annotation of data in the Semantic Web
is the first critical step to better search, integration and analytics over
heterogeneous data, semantic annotation of Web services is an equally critical
first step to achieving the above promise. Our approach is to work with
existing Web services technologies and combine them with ideas from the
Semantic Web to create a better framework for Web service discovery and
composition. In this paper we present MWSAF (METEOR-S Web Service Annotation
Framework), a framework for semi-automatically marking up Web service
descriptions with ontologies. We have developed algorithms to match and
annotate WSDL files with relevant ontologies. We use domain ontologies to
categorize Web services into domains. An empirical study of our approach is
presented to help evaluate its performance. Keywords: ontology, semantic annotation of web services, semantic web services, web
services discovery, wsdl | |||
| Foundations for service ontologies: aligning OWL-S to dolce | | BIBAK | Full-Text | 563-572 | |
| Peter Mika; Daniel Oberle; Aldo Gangemi; Marta Sabou | |||
| Clarity in semantics and a rich formalization of this semantics are
important requirements for ontologies designed to be deployed in large-scale,
open, distributed systems such as the envisioned Semantic Web This is
especially important for the description of Web Services, which should enable
complex tasks involving multiple agents. As one of the first initiatives of the
Semantic Web community for describing Web Services, OWL-S attracts a lot of
interest even though it is still under development. We identify problematic
aspects of OWL-S and suggest enhancements through alignment to a foundational
ontology. Another contribution of our work is the Core Ontology of Services
that tries to fill the epistemological gap between the foundational ontology
and OWL-S. It can be reused to align other Web Service description languages as
well. Finally, we demonstrate the applicability of our work by aligning OWL-S'
standard example called CongoBuy. Keywords: core ontology of services, daml-s, descriptions and situations, dolce,
owl-s, semantic web, web services | |||
| Mining models of human activities from the web | | BIBAK | Full-Text | 573-582 | |
| Mike Perkowitz; Matthai Philipose; Kenneth Fishkin; Donald J. Patterson | |||
| The ability to determine what day-to-day activity (such as cooking pasta,
taking a pill, or watching a video) a person is performing is of interest in
many application domains. A system that can do this requires models of the
activities of interest, but model construction does not scale well: humans must
specify low-level details, such as segmentation and feature selection of sensor
data, and high-level structure, such as spatio-temporal relations between
states of the model, for each and every activity. As a result, previous
practical activity recognition systems have been content to model a tiny
fraction of the thousands of human activities that are potentially useful to
detect. In this paper, we present an approach to sensing and modeling
activities that scales to a much larger class of activities than before. We
show how a new class of sensors, based on Radio Frequency Identification (RFID)
tags, can directly yield semantic terms that describe the state of the physical
world. These sensors allow us to formulate activity models by translating
labeled activities, such as 'cooking pasta', into probabilistic collections of
object terms, such as 'pot'. Given this view of activity models as text
translations, we show how to mine definitions of activities in an unsupervised
manner from the web. We have used our technique to mine definitions for over
20,000 activities. We experimentally validate our approach using data gathered
from actual human activity as well as simulated data. Keywords: activity inference, activity models, rfid, web mining | |||
| TeXQuery: a full-text search extension to XQuery | | BIBAK | Full-Text | 583-594 | |
| S. Amer-Yahia; C. Botev; J. Shanmugasundaram | |||
| One of the key benefits of XML is its ability to represent a mix of
structured and unstructured (text) data. Although current XML query languages
such as XPath and XQuery can express rich queries over structured data, they
can only express very rudimentary queries over text data. We thus propose
TeXQuery, which is a powerful full-text search extension to XQuery. TeXQuery
provides a rich set of fully composable full-text search primitives,such as
Boolean connectives, phrase matching, proximity distance, stemming and
thesauri. TeXQuery also enables users to seamlessly query over both structured
and text data by embedding TeXQuery primitives in XQuery, and vice versa.
Finally, TeXQuery supports a flexible scoring construct that can be used to
score query results based on full-text predicates. TeXQuery is the precursor of
the full-text language extensions to XPath 2.0 and XQuery 1.0 currently being
developed by the W3C. Keywords: full-text search, XQuery | |||
| The WebGraph framework I: compression techniques | | BIBAK | Full-Text | 595-602 | |
| P. Boldi; S. Vigna | |||
| Studying web graphs is often difficult due to their large size.
Recently,several proposals have been published about various techniques that
allow to store a web graph in memory in a limited space, exploiting the inner
redundancies of the web. The WebGraph framework is a suite of codes, algorithms
and tools that aims at making it easy to manipulate large web graphs. This
papers presents the compression techniques used in WebGraph, which are centred
around referentiation and intervalisation (which in turn are dual to each
other). WebGraph can compress the WebBase graph (118 Mnodes, 1 Glinks) in as
little as 3.08 bits per link, and its transposed version in as little as 2.89
bits per link. Keywords: compression, web graph | |||
| XQuery at your web service | | BIBAK | Full-Text | 603-611 | |
| Nicola Onose; Jerome Simeon | |||
| XML messaging is at the heart of Web services, providing the flexibility
required for their deployment, composition, and maintenance. Yet, current
approaches to Web services development hide the messaging layer behind Java or
C# APIs, preventing the application to get direct access to the underlying XML
information. To address this problem, we advocate the use of a native XML
language, namely XQuery, as an integral part of the Web services development
infrastructure. The main contribution of the paper is a binding between WSDL,
the Web Services Description Language, and XQuery. The approach enables the use
of XQuery for both Web services deployment and composition. We present a simple
command-line tool that can be used to automatically deploy a Web service from a
given XQuery module, and extend the XQuery language itself with a statement for
accessing one or more Web services. The binding provides tight-coupling between
WSDL and XQuery, yielding additional benefits, notably: the ability to use WSDL
as an interface language for XQuery, and the ability to perform static typing
on XQuery programs that include Web service calls. Last but not least, the
proposal requires only minimal changes to the existing infrastructure. We
report on our experience implementing this approach in the Galax XQuery
processor. Keywords: XML, XQuery, interface, modules, web services, wsdl | |||
| Adapting databases and WebDAV protocol | | BIBA | Full-Text | 612-620 | |
| Bita Shadgar; Ian Holyer | |||
| The ability of the Web to share data regardless of geographical location raises a new issue called remote authoring. With the Internet and Web browsers being independent of hardware, it becomes possible to build Web-enabled database applications. Many approaches are provided to integrate databases into the Web environment, which use the Web's protocol i.e. HTTP to transfer the data between clients and servers. However, those methods are affected by the HTTP shortfalls with regard to remote authoring. This paper introduces and discusses a new methodology for remote authoring of databases, which is based on the WebDAV protocol. It is a seamless and effective methodology for accessing and authoring databases, particularly in that it naturally benefits from the WebDAV advantages such as metadata and access control. These features establish a standard way of accessing database metadata, and increase the database security, while speeding up the database connection. | |||
| Analysis of interacting BPEL web services | | BIBAK | Full-Text | 621-630 | |
| Xiang Fu; Tevfik Bultan; Jianwen Su | |||
| This paper presents a set of tools and techniques for analyzing interactions
of composite web services which are specified in BPEL and communicate through
asynchronous XML messages. We model the interactions of composite web services
as conversations, the global sequence of messages exchanged by the web
services. As opposed to earlier work, our tool-set handles rich data
manipulation via XPath expressions. This allows us to verify designs at a more
detailed level and check properties about message content. We present a
framework where BPEL specifications of web services are translated to an
intermediate representation, followed by the translation of the intermediate
representation to a verification language. As an intermediate representation we
use guarded automata augmented with unbounded queues for incoming messages,
where the guards are expressed as XPath expressions. As the target verification
language we use Promela, input language of the model checker SPIN. Since SPIN
model checker is a finite-state verification tool we can only achieve partial
verification by fixing the sizes of the input queues in the translation. We
propose the concept of synchronizability to address this problem. We show that
if a composite web service is synchronizable, then its conversation set remains
same when asynchronous communication is replaced with synchronous
communication. We give a set of sufficient conditions that guarantee
synchronizability and that can be checked statically. Based on our
synchronizability results, we show that a large class of composite web services
with unbounded input queues can be completely verified using a finite state
model checker such as SPIN. Keywords: BPEL, asynchronous communication, conversation, model checking, spin,
synchronizability, web service, XPath | |||
| Index structures and algorithms for querying distributed RDF repositories | | BIBAK | Full-Text | 631-639 | |
| Heiner Stuckenschmidt; Richard Vdovjak; Geert-Jan Houben; Jeen Broekstra | |||
| A technical infrastructure for storing, querying and managing RDFdata is a
key element in the current semantic web development. Systems like Jena, Sesame
or the ICS-FORTH RDF Suite are widely used for building semantic web
applications. Currently, none of these systems supports the integrated querying
of distributed RDF repositories. We consider this a major shortcoming since the
semantic web is distributed by nature. In this paper we present an architecture
for querying distributed RDF repositories by extending the existing Sesame
system. We discuss the implications of our architecture and propose an index
structure as well as algorithms for query processing and optimization in such a
distributed context. Keywords: RDF querying, index structures, optimization | |||
| REMINDIN': semantic query routing in peer-to-peer networks based on social metaphors | | BIBAK | Full-Text | 640-649 | |
| Christoph Tempich; Steffen Staab; Adrian Wranik | |||
| In peer-to-peer networks, finding the appropriate answer for an information
request, such as the answer to a query for RDF(S) data, depends on selecting
the right peer in the network. We here investigate how social metaphors can be
exploited effectively and efficiently to solve this task. To this end, we
define a method for query routing, REMINDIN', that lets (i) peers observe which
queries are successfully answered by other peers, (ii), memorizes this
observation, and, (iii), subsequently uses this information in order to select
peers to forward requests to.
REMINDIN' has been implemented for the SWAP peer-to-peer platform as well as for a simulation environment. We have used the simulation environment in order to investigate how successful variations of REMINDIN' are and how they compare to baseline strategies in terms of number of messages forwarded in the network and statements appropriately retrieved. Keywords: ontologies, peer selection, peer-to-peer, query routing | |||
| RDFPeers: a scalable distributed RDF repository based on a structured peer-to-peer network | | BIBAK | Full-Text | 650-657 | |
| Min Cai; Martin Frank | |||
| Centralized Resource Description Framework (RDF) repositories have
limitations both in their failure tolerance and in their scalability. Existing
Peer-to-Peer (P2P) RDF repositories either cannot guarantee to find query
results, even if these results exist in the network, or require up-front
definition of RDF schemas and designation of super peers. We present a scalable
distributed RDF repository (RDFPeers) that stores each triple at three places
in a multi-attribute addressable network by applying globally known hash
functions to its subject predicate and object. Thus all nodes know which node
is responsible for storing triple values they are looking for and both
exact-match and range queries can be efficiently routed to those nodes.
RDFPeers has no single point of failure nor elevated peers and does not require
the prior definition of RDF schemas. Queries are guaranteed to find matched
triples in the network if the triples exist. In RDFPeers both the number of
neighbors per node and the number of routing hops for inserting RDF triples and
for resolving most queries are logarithmic to the number of nodes in the
network. We further performed experiments that show that the triple-storing
load in RDFPeers differs by less than an order of magnitude between the most
and the least loaded nodes for real-world RDF data. Keywords: distributed RDF repositories, peer-to-peer, semantic web | |||
| A hierarchical monothetic document clustering algorithm for summarization and browsing search results | | BIBAK | Full-Text | 658-665 | |
| Krishna Kummamuru; Rohit Lotlikar; Shourya Roy; Karan Singal; Raghu Krishnapuram | |||
| Organizing Web search results into a hierarchy of topics and sub-topics
facilitates browsing the collection and locating results of interest. In this
paper, we propose a new hierarchical monothetic clustering algorithm to build a
topic hierarchy for a collection of search results retrieved in response to a
query. At every level of the hierarchy, the new algorithm progressively
identifies topics in a way that maximizes the coverage while maintaining
distinctiveness of the topics. We refer the proposed algorithm to as DisCover.
Evaluating the quality of a topic hierarchy is a non-trivial task, the ultimate
test being user judgment. We use several objective measures such as coverage
and reach time for an empirical comparison of the proposed algorithm with two
other monothetic clustering algorithms to demonstrate its superiority. Even
though our algorithm is slightly more computationally intensive than one of the
algorithms, it generates better hierarchies. Our user studies also show that
the proposed algorithm is superior to the other algorithms as a summarizing and
browsing tool. Keywords: automatic taxonomy generation, clustering, data mining, search,
summarization | |||
| Mining anchor text for query refinement | | BIBAK | Full-Text | 666-674 | |
| Reiner Kraft; Jason Zien | |||
| When searching large hypertext document collections, it is often possible
that there are too many results available for ambiguous queries. Query
refinement is an interactive process of query modification that can be used to
narrow down the scope of search results. We propose a new method for
automatically generating refinements or related terms to queries by mining
anchor text for a large hypertext document collection. We show that the usage
of anchor text as a basis for query refinement produces high quality refinement
suggestions that are significantly better in terms of perceived usefulness
compared to refinements that are derived using the document content.
Furthermore, our study suggests that anchor text refinements can also be used
to augment traditional query refinement algorithms based on query logs, since
they typically differ in coverage and produce different refinements. Our
results are based on experiments on an anchor text collection of a large
corporate intranet. Keywords: anchor text, query refinement, rank, web search | |||
| Adaptive web search based on user profile constructed without any effort from users | | BIBAK | Full-Text | 675-684 | |
| Kazunari Sugiyama; Kenji Hatano; Masatoshi Yoshikawa | |||
| Web search engines help users find useful information on the World Wide Web
(WWW). However, when the same query is submitted by different users, typical
search engines return the same result regardless of who submitted the query.
Generally, each user has different information needs for his/her query.
Therefore, the search result should be adapted to users with different
information needs. In this paper, we first propose several approaches to
adapting search results according to each user's need for relevant information
without any user effort, and then verify the effectiveness of our proposed
approaches. Experimental results show that search systems that adapt to each
user's preferences can be achieved by constructing user profiles based on
modified collaborative filtering with detailed analysis of user's browsing
history in one day. Keywords: WWW, information retrieval, user modeling | |||
| Practical semantic analysis of web sites and documents | | BIBAK | Full-Text | 685-693 | |
| Thierry Despeyroux | |||
| As Web sites are now ordinary products, it is necessary to explicit the
notion of quality of a Web site. The quality of a site may be linked to the
easiness of accessibility and also to other criteria such as the fact that the
site is up to date and coherent. This last quality is difficult to insure
because sites may be updated very frequently, may have many authors, may be
partially generated and in this context proof-reading is very difficult. The
same piece of information may be found in different occurrences, but also in
data or meta-data, leading to the need for consistency checking. In this paper
we make a parallel between programs and Web sites. We present some examples of
semantic constraints that one would like to specify (constraints between the
meaning of categories and sub-categories in a thematic directory, consistency
between the organization chart and the rest of the site in an academic site).
We present quickly the Natural Semantics a way to specify the semantics of
programming languages that inspires our works. Natural Semantics itself comes
from both an operational semantics and from logic programming and its
implementation uses Prolog. Then we propose a specification language for
semantic constraints in Web sites that, in conjunction with the well known
"make" program, permits to generate some site verification tools by compiling
the specification into Prolog code. We apply our method to a large XML document
which is the scientific part of our institute activity report, tracking errors
or inconsistencies and also constructing some indicators that can be used by
the management of the institute. Keywords: XML, consistency, content management, formal semantics, information system,
knowledge management, logic programming, quality, web engineering, web site
evolution, web sites | |||
| Web customization using behavior-based remote executing agents | | BIBAK | Full-Text | 694-703 | |
| Eugene Hung; Joseph Pasquale | |||
| ReAgents are remotely executing agents that customize Web browsing for
non-standard clients. A reAgent is essentially a one-shot" mobile agent that
acts as an extension of a client dynamically launched by the client to run on
its behalf at a remote more advantageous location. ReAgents simplify the use of
mobile agent technology by transparently handling data migration and run-time
network communications and provide a general interface for programmers to more
easily implement their application-specific customizing logic. This is made
possible by the identification of useful remote behaviors i.e. common patterns
of actions that exploit the ability to process and communicate remotely.
Examples of such behaviors are transformers monitors cachers and collators. In
this paper we identify a set of useful reAgent behaviors for interacting with
Web services via a standard browser describe how to program and use reAgents
and show that the overhead of using reAgents is low and outweighed by its
benefits. Keywords: dynamic deployment, remote agents, web customization | |||
| A possible simplification of the semantic web architecture | | BIBAK | Full-Text | 704-713 | |
| Bernardo Cuenca Grau | |||
| In the semantic Web architecture, Web ontology languages are built on top of
RDF(S). However, serious difficulties have arisen when trying to layer
expressive ontology languages, like OWL, on top of RDF-Schema. Although these
problems can be avoided, OWL (and the whole semantic Web architecture) becomes
much more complex than it should be. In this paper, a possible simplification
of the semantic Web architecture is suggested, which has several important
advantages with respect to the layering currently accepted by the W3C Ontology
Working Group. Keywords: description logics, ontology web language (OWL), resource description
framework (RDF), resource description framework schema (RDF-schema), semantic
web | |||
| A combined approach to checking web ontologies | | BIBAK | Full-Text | 714-722 | |
| J. S. Dong; C. H. Lee; H. B. Lee; Y. F. Li; H. Wang | |||
| The understanding of Semantic Web documents is built upon ontologies that
define concepts and relationships of data. Hence, the correctness of ontologies
is vital. Ontology reasoners such as RACER and FaCT have been developed to
reason ontologies with a high degree of automation. However, complex
ontology-related properties may not be expressible within the current web
ontology languages, consequently they may not be checkable by RACER and FaCT.
We propose to use the software engineering techniques and tools, i.e., Z/EVES
and Alloy Analyzer, to complement the ontology tools for checking Semantic Web
documents.
In this approach, Z/EVES is first applied to remove trivial syntax and type errors of the ontologies. Next, RACER is used to identify any ontological inconsistencies, whose origins can be traced by Alloy Analyzer. Finally Z/EVES is used again to express complex ontology-related properties and reveal errors beyond the modeling capabilities of the current web ontology languages. We have successfully applied this approach to checking a set of military plan ontologies. Keywords: alloy, daml+oil, ontologies, racer, semantic web, z | |||
| A proposal for an owl rules language | | BIBAK | Full-Text | 723-731 | |
| Ian Horrocks; Peter F. Patel-Schneider | |||
| Although the OWLWeb Ontology Language adds considerable expressive power to
the Semantic Web it does have expressive limitations, particularly with respect
to what can be said about properties. We present ORL (OWL Rules Language), a
Horn clause rules extension to OWL that overcomes many of these limitations.
ORL extends OWL in a syntactically and semantically coherent manner: the basic
syntax for ORL rules is an extension of the abstract syntax for OWL DL and
OWLLite; ORL rules are given formal meaning via an extension of the OWLDL
model-theoretic semantics; ORL rules are given an XML syntax based on the OWL
XML presentation syntax; and a mapping from ORL rules to RDF graphs is given
based on the OWL RDF/XML exchange syntax. We discuss the expressive power of
ORL, showing that the ontology consistency problem is undecidable, provide
several examples of ORLusage, and discuss how reasoning support for ORL might
be provided. Keywords: model-theoretic semantics, representation, semantic web | |||