| Query-free news search | | BIBAK | Full-Text | 1-10 | |
| Monika Henzinger; Bay-Wei Chang; Brian Milch; Sergey Brin | |||
| Many daily activities present information in the form of a stream of text,
and often people can benefit from additional information on the topic
discussed. TV broadcast news can be treated as one such stream of text; in this
paper we discuss finding news articles on the web that are relevant to news
currently being broadcast.
We evaluated a variety of algorithms for this problem, looking at the impact of inverse document frequency, stemming, compounds, history, and query length on the relevance and coverage of news articles returned in real time during a broadcast. We also evaluated several postprocessing techniques for improving the precision, including reranking using additional terms, reranking by document similarity, and filtering on document similarity. For the best algorithm, 84%-91% of the articles found were relevant, with at least 64% of the articles being on the exact topic of the broadcast. In addition, a relevant article was found for at least 70% of the topics. Keywords: query-free search, web information retrieval | |||
| Improving pseudo-relevance feedback in web information retrieval using web page segmentation | | BIBAK | Full-Text | 11-18 | |
| Shipeng Yu; Deng Cai; Ji-Rong Wen; Wei-Ying Ma | |||
| In contrast to traditional document retrieval, a web page as a whole is not
a good information unit to search because it often contains multiple topics and
a lot of irrelevant information from navigation, decoration, and interaction
part of the page. In this paper, we propose a VIsion-based Page Segmentation
(VIPS) algorithm to detect the semantic content structure in a web page.
Compared with simple DOM based segmentation method, our page segmentation
scheme utilizes useful visual cues to obtain a better partition of a page at
the semantic level. By using our VIPS algorithm to assist the selection of
query expansion terms in pseudo-relevance feedback in web information
retrieval, we achieve 27% performance improvement on Web Track dataset. Keywords: page segmentation, query expansion, relevance feedback, web information
retrieval | |||
| Predictive caching and prefetching of query results in search engines | | BIBAK | Full-Text | 19-28 | |
| Ronny Lempel; Shlomo Moran | |||
| We study the caching of query result pages in Web search engines. Popular
search engines receive millions of queries per day, and efficient policies for
caching query results may enable them to lower their response time and reduce
their hardware requirements. We present PDC (probability driven cache), a novel
scheme tailored for caching search results, that is based on a probabilistic
model of search engine users. We then use a trace of over seven million queries
submitted to the search engine AltaVista to evaluate PDC, as well as
traditional LRU and SLRU based caching schemes. The trace driven simulations
show that PDC outperforms the other policies. We also examine the prefetching
of search results, and demonstrate that prefetching can increase cache hit
ratios by 50% for large caches, and can double the hit ratios of small caches.
When integrating prefetching into PDC, we attain hit ratios of over 0.53. Keywords: caching, query processing and optimization | |||
| Model-theoretic semantics for the web | | BIBAK | Full-Text | 29-38 | |
| James Farrugia | |||
| Model-theoretic semantics is a formal account of the interpretations of
legitimate expressions of a language. It is increasingly being used to provide
Web markup languages with well-defined semantics. But a discussion of its roles
and limitations for the Semantic Web has not yet received a coherent and
detailed treatment. This paper takes the first steps towards such a treatment.
The major result is an introductory explication of key ideas that are usually
only implicit in existing accounts of semantics for the Web. References to more
detailed accounts of these ideas are also provided. The benefit of this
explication is increased awareness among Web users of some important issues
inherent in using model-theoretic semantics for Web markup languages. Keywords: model-theoretic semantics, semantics, web markup languages | |||
| Three theses of representation in the semantic web | | BIBAK | Full-Text | 39-47 | |
| Ian Horrocks; Peter F. Patel-Schneider | |||
| The Semantic Web is vitally dependent on a formal meaning for the constructs
of its languages. For Semantic Web languages to work well together their formal
meanings must employ a common view (or thesis) of representation, otherwise it
will not be possible to reconcile documents written in different languages. The
thesis of representation underlying RDF and RDFS is particularly troublesome in
this regard, as it has several unusual aspects, both semantic and syntactic. A
more-standard thesis of representation would result in the ability to reuse
existing results and tools in the Semantic Web. Keywords: model-theoretic semantics, representation, semantic web | |||
| Description logic programs: combining logic programs with description logic | | BIBAK | Full-Text | 48-57 | |
| Benjamin N. Grosof; Ian Horrocks; Raphael Volz; Stefan Decker | |||
| We show how to interoperate, semantically and inferentially, between the
leading Semantic Web approaches to rules (RuleML Logic Programs) and ontologies
(OWL/DAML+OIL Description Logic) via analyzing their expressive intersection.
To do so, we define a new intermediate knowledge representation (KR) contained
within this intersection: Description Logic Programs (DLP), and the closely
related Description Horn Logic (DHL) which is an expressive fragment of
first-order logic (FOL). DLP provides a significant degree of expressiveness,
substantially greater than the RDF-Schema fragment of Description Logic. We
show how to perform DLP-fusion: the bidirectional translation of premises and
inferences (including typical kinds of queries) from the DLP fragment of DL to
LP, and vice versa from the DLP fragment of LP to DL. In particular, this
translation enables one to "build rules on top of ontologies": it enables the
rule KR to have access to DL ontological definitions for vocabulary primitives
(e.g., predicates and individual constants) used by the rules. Conversely, the
DLP-fusion technique likewise enables one to "build ontologies on top of
rules": it enables ontological definitions to be supplemented by rules, or
imported into DL from rules. It also enables available efficient LP inferencing
algorithms/implementations to be exploited for reasoning over large-scale DL
ontologies. Keywords: RDF, XML, description logic, inferencing, information integration,
interoperability, knowledge representation, logic programs, model-theoretic
semantics, ontologies, rules, semantic web, translation | |||
| Dynamic service reconfiguration for wireless web access | | BIBAK | Full-Text | 58-67 | |
| Siu-Nam Chuang; Alvin T. S. Chan; Jiannong Cao; Ronnie Cheung | |||
| This paper describes a dynamic service reconfiguration model where the proxy
is composed of a chain of service objects called mobilets (pronounced as
mo-be-lets), which can be deployed onto the network actively. This model offers
flexibility because the chain of mobilets can be dynamically reconfigured to
adapt to the vigorous changes in the characteristics of the wireless
environment, without interrupting the service provision for other mobile nodes.
Furthermore, mobilets can also be migrated to a new proxy server when the
mobile node moves to a different network domain. We have realized the dynamic
service reconfiguration model by crafting its design into a programmable
infrastructure that forms the baseline architecture of the WebPADS (short for
Web Proxy for Actively Deployable Services) system. Keywords: active services, dynamic service reconfiguration, wireless environment
adaptation, wireless web access | |||
| Web browsing performance of wireless thin-client computing | | BIBAK | Full-Text | 68-79 | |
| S. Jae Yang; Jason Nieh; Shilpa Krishnappa; Aparna Mohla; Mahdi Sajjadpour | |||
| Web applications are becoming increasingly popular for mobile wireless
systems. However, wireless networks can have high packet loss rates, which can
degrade web browsing performance on wireless systems. An alternative approach
is wireless thin-client computing, in which the web browser runs on a remote
thin server with a more reliable wired connection to the Internet. A mobile
client then maintains a connection to the thin server to receive display
updates over the lossy wireless network. To assess the viability of this
thin-client approach, we compare the web browsing performance of thin clients
against fat clients that run the web browser locally in lossy wireless
networks. Our results show that thin clients can operate quite effectively over
lossy networks. Compared to fat clients running web browsers locally, our
results show surprisingly that thin clients can be faster and more resilient on
web applications over lossy wireless LANs despite having to send more data over
the network. We characterize and analyze different design choices in various
thin-client systems and explain why these approaches can yield superior web
browsing performance in lossy wireless networks. Keywords: thin-client computing, web performance, wireless and mobility | |||
| Sensor-enhanced mobile web clients: an XForms approach | | BIBAK | Full-Text | 80-89 | |
| John Barton; Tim Kindberg; Hui Dai; Nissanka B. Priyantha; Fahd Al-bin-ali | |||
| This paper describes methods for service selection and service access for
mobile, sensor-enhanced web clients such as wireless cameras or wireless PDAs
with sensor devices attached. The clients announce their data-creating
capabilities in "Produce" headers sent to servers; servers respond with forms
that match these capabilities. Clients fill in these forms with sensor data as
well as text or file data. The resultant system enables clients to access
dynamically discovered services spontaneously, as their users engage in
everyday nomadic activities. Keywords: MIME types, browsers, forms, mobile computing, sensors, ubiquitous computing | |||
| Text joins in an RDBMS for web data integration | | BIBAK | Full-Text | 90-101 | |
| Luis Gravano; Panagiotis G. Ipeirotis; Nick Koudas; Divesh Srivastava | |||
| The integration of data produced and collected across autonomous,
heterogeneous web services is an increasingly important and challenging
problem. Due to the lack of global identifiers, the same entity (e.g., a
product) might have different textual representations across databases. Textual
data is also often noisy because of transcription errors, incomplete
information, and lack of standard formats. A fundamental task during data
integration is matching of strings that refer to the same entity. In this
paper, we adopt the widely used and established cosine similarity metric from
the information retrieval field in order to identify potential string matches
across web sources. We then use this similarity metric to characterize this key
aspect of data integration as a join between relations on textual attributes,
where the similarity of matches exceeds a specified threshold. Computing an
exact answer to the text join can be expensive. For query processing
efficiency, we propose a sampling-based join approximation strategy for
execution in a standard, unmodified relational database management system
(RDBMS), since more and more web sites are powered by RDBMSs with a web-based
front end. We implement the join inside an RDBMS, using SQL queries, for
scalability and robustness reasons. Finally, we present a detailed performance
evaluation of an implementation of our algorithm within a commercial RDBMS,
using real-life data sets. Our experimental results demonstrate the efficiency
and accuracy of our techniques. Keywords: approximate text matching, data cleaning, text indexing | |||
| Dynamic maintenance of web indexes using landmarks | | BIBAK | Full-Text | 102-111 | |
| Lipyeow Lim; Min Wang; Sriram Padmanabhan; Jeffrey Scott Vitter; Ramesh Agarwal | |||
| Recent work on incremental crawling has enabled the indexed document
collection of a search engine to be more synchronized with the changing World
Wide Web. However, this synchronized collection is not immediately searchable,
because the keyword index is rebuilt from scratch less frequently than the
collection can be refreshed. An inverted index is usually used to index
documents crawled from the web. Complete index rebuild at high frequency is
expensive. Previous work on incremental inverted index updates have been
restricted to adding and removing documents. Updating the inverted index for
previously indexed documents that have changed has not been addressed.
In this paper, we propose an efficient method to update the inverted index for previously indexed documents whose contents have changed. Our method uses the idea of landmarks together with the diff algorithm to significantly reduce the number of postings in the inverted index that need to be updated. Our experiments verify that our landmark-diff method results in significant savings in the number of update operations on the inverted index. Keywords: inverted files, update processing | |||
| High-performance spatial indexing for location-based services | | BIBAK | Full-Text | 112-117 | |
| Jussi Myllymaki; James Kaufman | |||
| Much attention has been accorded to Location-Based Services and location
tracking, a necessary component in active, trigger-based LBS applications.
Tracking the location of a large population of moving objects requires very
high update and query performance of the underlying spatial index. In this
paper we investigate the performance and scalability of three main-memory based
spatial indexing methods under dynamic update and query loads: an R-tree, a
ZB-tree, and an array/hashtable method. By leveraging the LOCUS performance
evaluation testbed and the City Simulator dynamic spatial data generator, we
are able to demonstrate the scalability of these methods and determine the
maximum population size supported by each method, a useful parameter for
capacity planning by wireless carriers. Keywords: location-based services, mobile computing, spatial indexing | |||
| Efficient and robust streaming provisioning in VPNs | | BIBAK | Full-Text | 118-127 | |
| Z. Morley Mao; David Johnson; Oliver Spatscheck; Jacobus E. van der Merwe; Jia Wang | |||
| Today, most large companies maintain virtual private networks (VPNs) to
connect their remote locations into a single secure network. VPNs can be quite
large covering more than 1000 locations and in most cases use standard Internet
protocols and services. Such VPNs are implemented using a diverse set of
technologies such as Frame Relay, MPLS, or IPSEC to achieve the goal of privacy
and performance isolation from the public Internet.
Using VPNs to distribute live content has recently received tremendous interest. For example, a VPN could be used to broadcast a CEO-employee town hall meeting. To distribute this type of content economically without overloading the network, the deployment of streaming caches or splitters is most likely required. In this paper, we address the problem of optimally placing such streaming splitters or caches to broadcast to a given set of VPN endpoints under the constraints typically found within a VPN. In particular, we introduce an efficient algorithm with complexity O(V), V being the number of routers in the VPN. This guarantees the optimal cache placement if interception is used for redirection. We prove that the general problem is NP-hard and introduce multiple heuristics for efficient and robust cache placement suitable under different constraints. At the expense of increased implementation complexity, each heuristic solution provides additional saving in the number of caches required. We evaluate proposed solutions using extensive simulations. In particular, we show our flow-based solution is very close to the optimal. Keywords: VPNs, streaming server placement | |||
| On admission control for profit maximization of networked service providers | | BIBAK | Full-Text | 128-137 | |
| Akshat Verma; Sugata Ghosal | |||
| Variability and diverseness among incoming requests to a service hosted on a
finite capacity resource necessitates sophisticated request admission control
techniques for providing guaranteed quality of service (QoS). We propose in
this paper a service time based online admission control methodology for
maximizing profits of a service provider. The proposed methodology chooses a
subset of incoming requests such that the revenue of the provider is maximized.
Admission control decision in our proposed system is based upon an estimate of
the service time of the request, QoS bounds, prediction of arrivals and service
times of requests to come in the short-term future, and rewards associated with
servicing a request within its QoS bounds. Effectiveness of the proposed
admission control methodology is demonstrated using experiments with a
content-based messaging middleware service. Keywords: admission control, profit maximization, quality of service (QoS), service
level agreement (SLA), service time estimation, short-term prediction, shortest
remaining job first (SRJF), web service | |||
| Design, implementation, and evaluation of a client characterization driven web server | | BIBAK | Full-Text | 138-147 | |
| Balachander Krishnamurthy; Yin Zhang; Craig E. Wills; Kashi Vishwanath | |||
| In earlier work we proposed a way for a Web server to detect connectivity
information about clients accessing it in order to take tailored actions for a
client request. This paper describes the design, implementation, and evaluation
of such a working system. A Web site has a strong incentive to reduce the
'time-to-glass' to retain users who may otherwise lose interest and leave the
site. We have performed a measurement study from multiple client sites around
the world with various levels of connectivity to the Internet communicating
with modified Apache Web servers under our control. The results show that
clients can be classified in a correct and stable manner and that
user-perceived latency can be reduced via tailored actions. Our measurements
show that classification and determination of server actions are done without
significant overhead on the Web server. We explore a variety of modified
actions ranging from selecting a lower quality version of the resource to
altering the manner of content delivery. By studying numerous performance
related factors in a single unified framework and examining both individual
actions as well as combination of actions, our modified Web server
implementation shows the efficacy of various server actions. Keywords: apache server, client classification, content delivery, httperf, server
adaptation, web performance | |||
| Web application security assessment by fault injection and behavior monitoring | | BIBAK | Full-Text | 148-159 | |
| Yao-Wen Huang; Shih-Kun Huang; Tsung-Po Lin; Chung-Hung Tsai | |||
| As a large and complex application platform, the World Wide Web is capable
of delivering a broad range of sophisticated applications. However, many Web
applications go through rapid development phases with extremely short
turnaround time, making it difficult to eliminate vulnerabilities. Here we
analyze the design of Web application security assessment mechanisms in order
to identify poor coding practices that render Web applications vulnerable to
attacks such as SQL injection and cross-site scripting. We describe the use of
a number of software-testing techniques (including dynamic analysis, black-box
testing, fault injection, and behavior monitoring), and suggest mechanisms for
applying these techniques to Web applications. Real-world situations are used
to test a tool we named the Web Application Vulnerability and Error Scanner
(WAVES, an open-source project available at http://waves.sourceforge.net) and
to compare it with other tools. Our results show that WAVES is a feasible
platform for assessing Web application security. Keywords: black-box testing, complete crawling, fault injection, security assessment,
web application testing | |||
| The HP time vault service: exploiting IBE for timed release of confidential information | | BIBAK | Full-Text | 160-169 | |
| Marco Casassa Mont; Keith Harrison; Martin Sadler | |||
| Digital information is increasingly more and more important to enable
interactions and transactions on the Internet. On the other hand, leakages of
sensitive information can have harmful effects for people, enterprises and
governments.
This paper focuses on the problems of dealing with timed release of confidential information and simplifying its access once public: it is a common issue in the industry, government and day-to-day life. We introduce the "HP Time Vault Service", based on the emerging Identifier-based Encryption (IBE) cryptography schema. IBE (public) encryption keys specify the disclosure time. These keys are used to encrypt confidential information. An independent time server generates and publishes IBE decryption keys correspondent to the current time, at predefined intervals. We discuss the advantages of this approach against current approaches based on traditional cryptography. A web-service based prototype is described, as a proof of concept. Keywords: disclosure policies, identifier-based encryption, privacy, security,
timed-release, web service | |||
| Content extraction signatures using XML digital signatures and custom transforms on-demand | | BIBAK | Full-Text | 170-177 | |
| Laurence Bull; Peter Stanski; David McG. Squire | |||
| Content Extraction Signatures (CES) enable selective disclosure of
verifiable content, provide privacy for blinded content, and enable the signer
to specify the content the document owner is allowed to extract or blind.
Combined, these properties give what we call CES functionality. In this paper
we describe our work in developing custom transform algorithms to expand the
functionality of an XML Signature to include CES functionality in XML Signature
Core Validation.
We also describe a custom revocation mechanism and our implementation for non-XML content where the custom transforms are dynamically loaded demonstrating that custom signing and verification is not constrained to a 'closed system'. Through the use of dynamic loading we show that a verifier can still verify an XML Signature-compliant signature even though a custom signature was produced. Keywords: .Net framework XML signature API, XML signature custom transforms, XML
signatures, content extraction signatures, dynamic signature verification | |||
| SemTag and seeker: bootstrapping the semantic web via automated semantic annotation | | BIBAK | Full-Text | 178-186 | |
| Stephen Dill; Nadav Eiron; David Gibson; Daniel Gruhl; R. Guha; Anant Jhingran; Tapas Kanungo; Sridhar Rajagopalan; Andrew Tomkins; John A. Tomlin; Jason Y. Zien | |||
| This paper describes Seeker, a platform for large-scale text analytics, and
SemTag, an application written on the platform to perform automated semantic
tagging of large corpora. We apply SemTag to a collection of approximately 264
million web pages, and generate approximately 434 million automatically
disambiguated semantic tags, published to the web as a label bureau providing
metadata regarding the 434 million annotations. To our knowledge, this is the
largest scale semantic tagging effort to date.
We describe the Seeker platform, discuss the architecture of the SemTag application, describe a new disambiguation algorithm specialized to support ontological disambiguation of large-scale data, evaluate the algorithm, and present our final results with information about acquiring and making use of the semantic tags. We argue that automated large scale semantic tagging of ambiguous content can bootstrap and accelerate the creation of the semantic web. Keywords: automated semantic tagging, data mining, information retrieval, large text
datasets, text analytics | |||
| Data extraction and label assignment for web databases | | BIBAK | Full-Text | 187-196 | |
| Jiying Wang; Fred H. Lochovsky | |||
| Many tools have been developed to help users query, extract and integrate
data from web pages generated dynamically from databases, i.e., from the Hidden
Web. A key prerequisite for such tools is to obtain the schema of the
attributes of the retrieved data. In this paper, we describe a system called,
DeLa, which reconstructs (part of) a "hidden" back-end web database. It does
this by sending queries through HTML forms, automatically generating regular
expression wrappers to extract data objects from the result pages and restoring
the retrieved data into an annotated (labelled) table. The whole process needs
no human involvement and proves to be fast (less than one minute for wrapper
induction for each site) and accurate (over 90% correctness for data extraction
and around 80% correctness for label assignment). Keywords: HTML forms, automatic wrapper induction, data annotation, hidden web,
information integration, web information extraction | |||
| The chatty web: emergent semantics through gossiping | | BIBAK | Full-Text | 197-206 | |
| Karl Aberer; Philippe Cudré-Mauroux; Manfred Hauswirth | |||
| This paper describes a novel approach for obtaining semantic
interoperability among data sources in a bottom-up, semi-automatic manner
without relying on pre-existing, global semantic models. We assume that large
amounts of data exist that have been organized and annotated according to local
schemas. Seeing semantics as a form of agreement, our approach enables the
participating data sources to incrementally develop global agreement in an
evolutionary and completely decentralized process that solely relies on
pair-wise, local interactions: Participants provide translations between
schemas they are interested in and can learn about other translations by
routing queries (gossiping). To support the participants in assessing the
semantic quality of the achieved agreements we develop a formal framework that
takes into account both syntactic and semantic criteria. The assessment process
is incremental and the quality ratings are adjusted along with the operation of
the system. Ultimately, this process results in global agreement, i.e., the
semantics that all participants understand. We discuss strategies to
efficiently find translations and provide results from a case study to justify
our claims. Our approach applies to any system which provides a communication
infrastructure (existing websites or databases, decentralized systems, P2P
systems) and offers the opportunity to study semantic interoperability as a
global phenomenon in a network of information sharing parties. Keywords: self-organization, semantic agreements, semantic integration | |||
| DOM-based content extraction of HTML documents | | BIBAK | Full-Text | 207-214 | |
| Suhit Gupta; Gail Kaiser; David Neistadt; Peter Grimm | |||
| Web pages often contain clutter (such as pop-up ads, unnecessary images and
extraneous links) around the body of an article that distracts a user from
actual content. Extraction of "useful and relevant" content from web pages has
many applications, including cell phone and PDA browsing, speech rendering for
the visually impaired, and text summarization. Most approaches to removing
clutter or making content more readable involve changing font size or removing
HTML and data components such as images, which takes away from a webpage's
inherent look and feel. Unlike "Content Reformatting", which aims to reproduce
the entire webpage in a more convenient form, our solution directly addresses
"Content Extraction". We have developed a framework that employs easily
extensible set of techniques that incorporate advantages of previous work on
content extraction. Our key insight is to work with the DOM trees, rather than
with raw HTML markup. We have implemented our approach in a publicly available
Web proxy to extract content from HTML web pages. Keywords: DOM trees, HTML documents, accessibility, content extraction, reformatting,
speech rendering | |||
| Fractal summarization for mobile devices to access large documents on the web | | BIBAK | Full-Text | 215-224 | |
| Christopher C. Yang; Fu Lee Wang | |||
| Wireless access with mobile (or handheld) devices is a promising addition to
the WWW and traditional electronic business. Mobile devices provide convenience
and portable access to the huge information space on the Internet without
requiring users to be stationary with network connection. However, the limited
screen size, narrow network bandwidth, small memory capacity and low computing
power are the shortcomings of handheld devices. Loading and visualizing large
documents on handheld devices become impossible. The limited resolution
restricts the amount of information to be displayed. The download time is
intolerably long. In this paper, we introduce the fractal summarization model
for document summarization on handheld devices. Fractal summarization is
developed based on the fractal theory. It generates a brief skeleton of summary
at the first stage, and the details of the summary on different levels of the
document are generated on demands of users. Such interactive summarization
reduces the computation load in comparing with the generation of the entire
summary in one batch by the traditional automatic summarization, which is ideal
for wireless access. Three-tier architecture with the middle-tier conducting
the major computation is also discussed. Visualization of summary on handheld
devices is also investigated. Keywords: document summarization, fractal summarization, handheld devices, mobile
commerce | |||
| Detecting web page structure for adaptive viewing on small form factor devices | | BIBAK | Full-Text | 225-233 | |
| Yu Chen; Wei-Ying Ma; Hong-Jiang Zhang | |||
| Mobile devices have already been widely used to access the Web. However,
because most available web pages are designed for desktop PC in mind, it is
inconvenient to browse these large web pages on a mobile device with a small
screen. In this paper, we propose a new browsing convention to facilitate
navigation and reading on a small-form-factor device. A web page is organized
into a two level hierarchy with a thumbnail representation at the top level for
providing a global view and index to a set of sub-pages at the bottom level for
detail information. A page adaptation technique is also developed to analyze
the structure of an existing web page and split it into small and logically
related units that fit into the screen of a mobile device. For a web page not
suitable for splitting, auto-positioning or scrolling-by-block is used to
assist the browsing as an alternative. Our experimental results show that our
proposed browsing convention and developed page adaptation scheme greatly
improve the user's browsing experiences on a device with a small display. Keywords: adaptive hypermedia, content adaptation, mobile browser | |||
| Supporting management reporting: a writable web case study | | BIBAK | Full-Text | 234-243 | |
| Timothy Miles-Board; Leslie Carr; Simon Kampa; Wendy Hall | |||
| The World-Wide Web was originally developed as a shared, writable, hypertext
medium, a facility that is still widely needed.
We have recently developed a Web-based management reporting system for a legal firm in an attempt to improve the efficiency and management of their overall business process. This paper shares our experiences in relating the firm's specific writing and issue tracking tasks to existing Web, open hypermedia, and Semantic Web research, and describes why we chose to develop a new solution -- a set of open hypermedia components collectively called the Management Reporting System -- rather than employ an existing system. Keywords: hypertext writing, management reporting, open hypermedia, structural
computing | |||
| Scholarly publishing and argument in hyperspace | | BIBAK | Full-Text | 244-250 | |
| Victoria Uren; Simon Buckingham Shum; Gangmin Li; John Domingue; Enrico Motta | |||
| The World Wide Web is opening up access to documents and data for scholars.
However it has not yet impacted on one of the primary activities in research:
assessing new findings in the light of current knowledge and debating it with
colleagues. The ClaiMaker system uses a directed graph model with similarities
to hypertext, in which new ideas are published as nodes, which other
contributors can build on or challenge in a variety of ways by linking to them.
Nodes and links have semantic structure to facilitate the provision of
specialist services for interrogating and visualizing the emerging network. By
way of example, this paper is grounded in a ClaiMaker model to illustrate how
new claims can be described in this structured way. Keywords: collaborative web, modeling debate, scholarly interpretation, scientific
publishing | |||
| Mining topic-specific concepts and definitions on the web | | BIBAK | Full-Text | 251-260 | |
| Bing Liu; Chee Wee Chin; Hwee Tou Ng | |||
| Traditionally, when one wants to learn about a particular topic, one reads a
book or a survey paper. With the rapid expansion of the Web, learning in-depth
knowledge about a topic from the Web is becoming increasingly important and
popular. This is also due to the Web's convenience and its richness of
information. In many cases, learning from the Web may even be essential because
in our fast changing world, emerging topics appear constantly and rapidly.
There is often not enough time for someone to write a book on such topics. To
learn such emerging topics, one can resort to research papers. However,
research papers are often hard to understand by non-researchers, and few
research papers cover every aspect of the topic. In contrast, many Web pages
often contain intuitive descriptions of the topic. To find such Web pages, one
typically uses a search engine. However, current search techniques are not
designed for in-depth learning. Top ranking pages from a search engine may not
contain any description of the topic. Even if they do, the description is
usually incomplete since it is unlikely that the owner of the page has good
knowledge of every aspect of the topic. In this paper, we attempt a novel and
challenging task, mining topic-specific knowledge on the Web. Our goal is to
help people learn in-depth knowledge of a topic systematically on the Web. The
proposed techniques first identify those sub-topics or salient concepts of the
topic, and then find and organize those informative pages, containing
definitions and descriptions of the topic and sub-topics, just like those in a
book. Experimental results using 28 topics show that the proposed techniques
are highly effective. Keywords: definition mining, domain concept mining, information integration, knowledge
compilation, web content mining | |||
| Extrapolation methods for accelerating PageRank computations | | BIBAK | Full-Text | 261-270 | |
| Sepandar D. Kamvar; Taher H. Haveliwala; Christopher D. Manning; Gene H. Golub | |||
| We present a novel algorithm for the fast computation of PageRank, a
hyperlink-based estimate of the ''importance'' of Web pages. The original
PageRank algorithm uses the Power Method to compute successive iterates that
converge to the principal eigenvector of the Markov matrix representing the Web
link graph. The algorithm presented here, called Quadratic Extrapolation,
accelerates the convergence of the Power Method by periodically subtracting off
estimates of the nonprincipal eigenvectors from the current iterate of the
Power Method. In Quadratic Extrapolation, we take advantage of the fact that
the first eigenvalue of a Markov matrix is known to be 1 to compute the
nonprincipal eigenvectors using successive iterates of the Power Method.
Empirically, we show that using Quadratic Extrapolation speeds up PageRank
computation by 25-300% on a Web graph of 80 million nodes, with minimal
overhead. Our contribution is useful to the PageRank community and the
numerical linear algebra community in general, as it is a fast method for
determining the dominant eigenvector of a matrix that is too large for standard
fast methods to be practical. Keywords: PageRank, eigenvector computation, link analysis | |||
| Scaling personalized web search | | BIBAK | Full-Text | 271-279 | |
| Glen Jeh; Jennifer Widom | |||
| Recent web search techniques augment traditional text matching with a global
notion of "importance" based on the linkage structure of the web, such as in
Google's PageRank algorithm. For more refined searches, this global notion of
importance can be specialized to create personalized views of importance -- for
example, importance scores can be biased according to a user-specified set of
initially-interesting pages. Computing and storing all possible personalized
views in advance is impractical, as is computing personalized views at query
time, since the computation of each view requires an iterative computation over
the web graph. We present new graph-theoretical results, and a new technique
based on these results, that encode personalized views as partial vectors.
Partial vectors are shared across multiple personalized views, and their
computation and storage costs scale well with the number of views. Our approach
enables incremental computation, so that the construction of personalized views
from partial vectors is practical at query time. We present efficient dynamic
programming algorithms for computing partial vectors, an algorithm for
constructing personalized views from partial vectors, and experimental results
demonstrating the effectiveness and scalability of our techniques. Keywords: PageRank, web search | |||
| Adaptive on-line page importance computation | | BIBAK | Full-Text | 280-290 | |
| Serge Abiteboul; Mihai Preda; Gregory Cobena | |||
| The computation of page importance in a huge dynamic graph has recently
attracted a lot of attention because of the web. Page importance, or page rank
is defined as the fixpoint of a matrix equation. Previous algorithms compute it
off-line and require the use of a lot of extra CPU as well as disk resources
(e.g. to store, maintain and read the link matrix). We introduce a new
algorithm OPIC that works on-line, and uses much less resources. In particular,
it does not require storing the link matrix. It is on-line in that it
continuously refines its estimate of page importance while the web/graph is
visited. Thus it can be used to focus crawling to the most interesting pages.
We prove the correctness of OPIC. We present Adaptive OPIC that also works
on-line but adapts dynamically to changes of the web. A variant of this
algorithm is now used by Xyleme.
We report on experiments with synthetic data. In particular, we study the convergence and adaptiveness of the algorithms for various scheduling strategies for the pages to visit. We also report on experiments based on crawls of significant portions of the web. Keywords: hyperlink, page importance, web graph | |||
| SHOCK: communicating with computational messages and automatic private profiles | | BIBAK | Full-Text | 291-300 | |
| Rajan M. Lukose; Eytan Adar; Joshua R. Tyler; Caesar Sengupta | |||
| A computationally enhanced message contains some embedded programmatic
components that are interpreted and executed automatically upon receipt. Unlike
ordinary text email or instant messages, they make possible a number of useful
applications. In this paper, we describe a general and flexible messaging
system called SHOCK that extends the functionality of prior computational email
systems by allowing XML-encoded SHOCK messages to interact with an
automatically created profile of a user. These profiles consist of information
about the most common tasks users perform, such as their Web browsing behavior,
their conventional email usage, etc. Since users are sensitive about such data,
the system is designed with privacy as a central design goal, and employs a
distributed peer-to-peer architecture to achieve it. The system is largely
implemented with commodity Web technologies and provides both a Web interface
as well as one that is tightly integrated with users ordinary email clients.
With SHOCK, users can send highly targeted messages without violating others
privacy, and engage in structured conversation appropriate to the context
without disrupting their existing work practices. We describe our
implementation in detail, the most useful novel applications of the system, and
our experiences with the system in a pilot field test. Keywords: collaborative systems, networking and distributed web applications, privacy
and preferences | |||
| P2Cast: peer-to-peer patching scheme for VoD service | | BIBAK | Full-Text | 301-309 | |
| Yang Guo; Kyoungwon Suh; Jim Kurose; Don Towsley | |||
| Providing video on demand (VoD) service over the Internet in a scalable way
is a challenging problem. In this paper, we propose P2Cast -- an architecture
that uses a peer-to-peer approach to cooperatively stream video using patching
techniques, while only relying on unicast connections among peers. We address
the following two key technical issues in P2Cast: (1) constructing an
application overlay appropriate for streaming; and (2) providing continuous
stream playback (without glitches) in the face of disruption from an early
departing client. Our simulation experiments show that P2Cast can serve many
more clients than traditional client-server unicast service, and that it
generally out-performs multicast-based patching if clients can cache more than
of a stream's initial portion. We handle disruptions by delaying the start of
playback and applying the shifted forwarding technique. A threshold on the
length of time during which arriving clients are served in a single session in
P2Cast serves as a knob to adjust the balance between the scalability and the
clients' viewing quality in P2Cast. Keywords: patching, peer-to-peer networks, performance evaluation, video on-demand
service | |||
| DEW: DNS-enhanced web for faster content delivery | | BIBA | Full-Text | 310-320 | |
| Balachander Krishnamurthy; Richard Liston; Michael Rabinovich | |||
| With a key component of latency on the Web being connection set up between clients and Web servers, several ways to avoid connections have been explored. While the work in recent years on Content Distribution Networks (CDNs) have moved some content 'closer' to users at the cost of increasing DNS traffic, they have not fully exploited the available unused potential of existing protocols. We explore ways by which a variety of Web responses can be piggybacked on DNS messages. While we evaluated our idea in the Web context, the approach is generic and not restricted to Web responses. We propose an architecture for HTTP piggybacking in DNS messages and carry out a detailed performance analysis based on a trace-driven simulation study. Our architecture requires minimal extensions to existing protocols, utilizing only the allowed optional fields for these extensions. It is fully compatible and can coexist with the current Web. | |||
| A system for principled matchmaking in an electronic marketplace | | BIBAK | Full-Text | 321-330 | |
| Tommaso Di Noia; Eugenio Di Sciascio; Francesco M. Donini; Marina Mongiello | |||
| More and more resources are becoming available on the Web, and there is a
growing need for infrastructures that, based on advertised descriptions, are
able to semantically match demands with supplies.
We formalize general properties a matchmaker should have, then we present a matchmaking facilitator, compliant with desired properties. The system embeds a NeoClassic reasoner, whose structural subsumption algorithm has been modified to allow match categorization into potential and partial, and ranking of matches within categories. Experiments carried out show the good correspondence between users and system rankings. Keywords: description logics, e-commerce, knowledge representation, matchmaking | |||
| A software framework for matchmaking based on semantic web technology | | BIBAK | Full-Text | 331-339 | |
| Lei Li; Ian Horrocks | |||
| An important objective of the Semantic Web is to make Electronic Commerce
interactions more flexible and automated. To achieve this, standardization of
ontologies, message content and message protocols will be necessary.
In this paper we investigate how Semantic and Web Services technologies can be used to support service advertisement and discovery in e-commerce. In particular, we describe the design and implementation of a service matchmaking prototype which uses a DAML-S based ontology and a Description Logic reasoner to compare ontology based service descriptions. We also present the results of initial experiments testing the performance of this prototype implementation in a realistic agent based e-commerce scenario. Keywords: ontologies, semantic web, web services | |||
| SweetDeal: representing agent contracts with exceptions using XML rules, ontologies, and process descriptions | | BIBAK | Full-Text | 340-349 | |
| Benjamin N. Grosof; Terrence C. Poon | |||
| SweetDeal is a rule-based approach to representation of business contracts
that enables software agents to create, evaluate, negotiate, and execute
contracts with substantial automation and modularity. It builds upon the
situated courteous logic programs knowledge representation in RuleML, the
emerging standard for Semantic Web XML rules. Here, we newly extend the
SweetDeal approach by also incorporating process knowledge descriptions whose
ontologies are represented in DAML+OIL (emerging standard for Semantic Web
ontologies) thereby enabling more complex contracts with behavioral provisions,
especially for handling exception conditions (e.g., late delivery or
non-payment) that might arise during the execution of the contract. This
provides a foundation for representing and automating deals about services --
in particular, about Web Services, so as to help search, select, and compose
them. Our system is also the first to combine emerging Semantic Web standards
for knowledge representation of rules (RuleML) with ontologies (DAML+OIL) for a
practical e-business application domain, and further to do so with process
knowledge. This also newly fleshes out the evolving concept of Semantic Web
Services. A prototype (soon public) is running. Keywords: DAML+OIL, OWL, RDF, XML, business process automation, declarative,
description logic, electronic commerce, electronic contracts, intelligent
software agents, knowledge representation, knowledge-based, logic programs,
ontologies, process descriptions, process knowledge, rules, semantic web,
semantic web services, web services | |||
| A new paradigm for ranking pages on the world wide web | | BIBAK | Full-Text | 350-355 | |
| John A. Tomlin | |||
| This paper describes a new paradigm for modeling traffic levels on the world
wide web (WWW) using a method of entropy maximization. This traffic is subject
to the conservation conditions of a circulation flow in the entire WWW, an
aggregation of the WWW, or a subgraph of the WWW (such as an intranet or
extranet). We specifically apply the primal and dual solutions of this model to
the (static) ranking of web sites. The first of these uses an imputed measure
of total traffic through a web page, the second provides an analogy of local
"temperature", allowing us to quantify the "HOTness" of a page. Keywords: entropy, optimization, search engines, static ranking | |||
| Adaptive ranking of web pages | | BIBAK | Full-Text | 356-365 | |
| Ah Chung Tsoi; Gianni Morini; Franco Scarselli; Markus Hagenbuchner; Marco Maggini | |||
| In this paper, we consider the possibility of altering the PageRank of web
pages, from an administrator's point of view, through the modification of the
PageRank equation. It is shown that this problem can be solved using the
traditional quadratic programming techniques. In addition, it is shown that the
number of parameters can be reduced by clustering web pages together through
simple clustering techniques. This problem can be formulated and solved using
quadratic programming techniques. It is demonstrated experimentally on a
relatively large web data set, viz., the WT10G, that it is possible to modify
the PageRanks of the web pages through the proposed method using a set of
linear constraints. It is also shown that the PageRank of other pages may be
affected; and that the quality of the result depends on the clustering
technique used. It is shown that our results compared well with those obtained
by a HITS based method. Keywords: PageRank, adaptive PageRank determinations, learning PageRank, quadratic
programming applications, search engine | |||
| Searching the workplace web | | BIBA | Full-Text | 366-375 | |
| Ronald Fagin; Ravi Kumar; Kevin S. McCurley; Jasmine Novak; D. Sivakumar; John A. Tomlin; David P. Williamson | |||
| The social impact from the World Wide Web cannot be underestimated, but technologies used to build the Web are also revolutionizing the sharing of business and government information within intranets. In many ways the lessons learned from the Internet carry over directly to intranets, but others do not apply. In particular, the social forces that guide the development of intranets are quite different, and the determination of a "good answer" for intranet search is quite different than on the Internet. In this paper we study the problem of intranet search. Our approach focuses on the use of rank aggregation, and allows us to examine the effects of different heuristics on ranking of search results. | |||
| Peer-to-peer architecture for content-based music retrieval on acoustic data | | BIBAK | Full-Text | 376-383 | |
| Cheng Yang | |||
| In traditional peer-to-peer search networks, operations focus on properly
labeled files such as music or video, and the actual search is often limited to
text tags. The explosive growth of available multimedia documents in recent
years calls for more flexible search capabilities, namely search by content.
Most content-based search algorithms are computationally intensive, making them
inappropriate for a peer-to-peer environment. In this paper, we discuss a
content-based music retrieval algorithm that can be decomposed and parallelized
efficiently. We present a peer-to-peer architecture for such a system that
makes use of spare resources among subscribers, with protocols that dynamically
redistribute load in order to maximize throughput and minimize inconvenience to
subscribers. Our framework can be extended beyond the music retrieval domain
and adapted to other scenarios where resource pooling is desired, as long as
the underlying algorithm satisfies certain conditions. Keywords: acoustic data, content-based music retrieval, distributed, load balancing,
peer-to-peer, resource pooling | |||
| Towards a multimedia formatting vocabulary | | BIBAK | Full-Text | 384-393 | |
| Jacco van Ossenbruggen; Lynda Hardma; Joost Geurts; Lloyd Rutledge | |||
| Time-based, media-centric Web presentations can be described declaratively
in the XML world through the development of languages such as SMIL. It is
difficult, however, to fully integrate them in a complete document
transformation processing chain. In order to achieve the desired processing of
data-driven, time-based, media-centric presentations, the text-flow based
formatting vocabularies used by style languages such as XSL, CSS and DSSSL need
to be extended. The paper presents a selection of use cases which are used to
derive a list of requirements for a multimedia style and transformation
formatting vocabulary. The boundaries of applicability of existing text-based
formatting models for media-centric transformations are analyzed. The paper
then discusses the advantages and disadvantages of a fully-fledged time-based
multimedia formatting model. Finally, the discussion is illustrated by
describing the key properties of the example multimedia formatting vocabulary
currently implemented in the back-end of our Cuypers multimedia transformation
engine. Keywords: Cuypers, document transformation, formatting objects, hyper-media,
multimedia | |||
| Architecture of a quality based intelligent proxy (QBIX) for MPEG-4 videos | | BIBAK | Full-Text | 394-402 | |
| Peter Schojer; Laszlo Böszörmenyi; Hermann Hellwagner; Bernhard Penz; Stefan Podlipnig | |||
| Due to the increasing availability and use of digital video data on the Web,
video caching will be an important performance factor in the future WWW. We
propose an architecture of a video proxy cache that integrates modern
multimedia and communication standards. Especially we describe features of the
MPEG-4 and MPEG-7 multimedia standards that can be helpful for a video proxy
cache. QBIX supports real-time adaptation in the compressed and in the
decompressed domain. It uses adaptation to improve the cache replacement
strategies in the proxy, but also to realize media gateway functionality driven
by the clients' terminal capabilities. Keywords: LRU, MPEG-4, MPEG-7, RTP, RTSP, media adaptation, media gateway,
replacement, video caching, video proxy | |||
| Conversation specification: a new approach to design and analysis of e-service composition | | BIBAK | Full-Text | 403-410 | |
| Tevfik Bultan; Xiang Fu; Richard Hull; Jianwen Su | |||
| This paper introduces a framework for modeling and specifying the global
behavior of e-service compositions. Under this framework, peers (individual
e-services) communicate through asynchronous messages and each peer maintains a
queue for incoming messages. A global "watcher" keeps track of messages as they
occur. We propose and study a central notion of a "conversation", which is a
sequence of (classes of) messages observed by the watcher. We consider the case
where the peers are represented by Mealy machines (finite state machines with
input and output). The sets of conversations exhibit unexpected behaviors. For
example, there exists a composite e-service based on Mealy peers whose set of
conversations is not context free (and not regular). (The set of conversations
is always context sensitive.) One cause for this is the queuing of messages; we
introduce an operator "prepone" that simulates queue delays from a global
perspective and show that the set of conversations of each Mealy e-service is
closed under prepone. We illustrate that the global prepone fails to completely
capture the queue delay effects and refine prepone to a "local" version on
conversations seen by individual peers. On the other hand, Mealy
implementations of a composite e-service will always generate conversations
whose "projections" are consistent with individual e-services. We use
projection-join to reflect such situations. However, there are still Mealy
peers whose set of conversations is not the local prepone and projection-join
closure of any regular language. Therefore, we propose conversation
specifications as a formalism to define the conversations allowed by an
e-service composition. We give two technical results concerning the interplay
between the local behaviors of Mealy peers and the global behaviors of their
compositions. One result shows that for each regular language, its local
prepone and projection-join closure corresponds to the set of conversations by
some Mealy peers effectively constructed from. The second result gives a
condition on the shape of a composition which guarantees that the set of
conversations that can be realized is the local prepone and projection-join
closure of a regular language. Keywords: communicating finite sate automata, conversation specification, e-service
composition | |||
| Quality driven web services composition | | BIBAK | Full-Text | 411-421 | |
| Liangzhao Zeng; Boualem Benatallah; Marlon Dumas; Jayant Kalagnanam; Quan Z. Sheng | |||
| The process-driven composition of Web services is emerging as a promising
approach to integrate business applications within and across organizational
boundaries. In this approach, individual Web services are federated into
composite Web services whose business logic is expressed as a process model.
The tasks of this process model are essentially invocations to functionalities
offered by the underlying component services. Usually, several component
services are able to execute a given task, although with different levels of
pricing and quality. In this paper, we advocate that the selection of component
services should be carried out during the execution of a composite service,
rather than at design-time. In addition, this selection should consider
multiple criteria (e.g., price, duration, reliability), and it should take into
account global constraints and preferences set by the user (e.g., budget
constraints). Accordingly, the paper proposes a global planning approach to
optimally select component services during the execution of a composite
service. Service selection is formulated as an optimization problem which can
be solved using efficient linear programming methods. Experimental results show
that this global planning approach outperforms approaches in which the
component services are selected individually for each task in a composite
service. Keywords: QoS, service composition, web services | |||
| A foundation for tool based mobility support for visually impaired web users | | BIBAK | Full-Text | 422-430 | |
| Yeliz Yesilada; Robert Stevens; Carole Goble | |||
| Users make journeys through the Web. Web travel encompasses the tasks of
orientation and navigation, the environment and the purpose of the journey. The
ease of travel, its mobility, varies from page to page and site to site. For
visually impaired users, in particular, mobility is reduced; the objects that
support travel are inaccessible or missing altogether. Web development tools
need to include support to increase mobility. We present a framework for
finding and classifying travel objects within Web pages. The evaluation carried
out has shown that this framework supports a systematic and consistent method
for assessing travel upon the Web. We propose that such a framework can provide
the foundation for a semi-automated tool for the support of travel upon the
Web. Keywords: mobility, mobility support tool, travel, travel objects, visual impairment | |||
| On deep annotation | | BIBAK | Full-Text | 431-438 | |
| Siegfried Handschuh; Steffen Staab; Raphael Volz | |||
| The success of the Semantic Web crucially depends on the easy creation,
integration and use of semantic data. For this purpose, we consider an
integration scenario that defies core assumptions of current metadata
construction methods. We describe a framework of metadata creation when web
pages are generated from a database and the database owner is cooperatively
participating in the Semantic Web. This leads us to the definition of ontology
mapping rules by manual semantic annotation and the usage of the mapping rules
and of web services for semantic queries. In order to create metadata, the
framework combines the presentation layer with the data description layer -- in
contrast to "conventional" annotation, which remains at the presentation layer.
Therefore, we refer to the framework as deep annotation 1.We consider deep
annotation as particularly valid because, (i), web pages generated from
databases outnumber static web pages, (ii), annotation of web pages may be a
very intuitive way to create semantic data from a database and, (iii), data
from databases should not be materialized as RDF files, it should remain where
it can be handled most efficiently -- in its databases. Keywords: annotation, information integration, mapping and merging, metadata, semantic
web, wrapping | |||
| An infrastructure for searching, reusing and evolving distributed ontologies | | BIBAK | Full-Text | 439-448 | |
| A. Maedche; B. Motik; L. Stojanovic; R. Studer; R. Volz | |||
| The vision of the Semantic Web can only be realized through proliferation of
well-known ontologies describing different domains. To enable interoperability
in the Semantic Web, it will be necessary to break these ontologies down into
smaller, well-focused units that may be reused. Currently, three problems arise
in that scenario. Firstly, it is difficult to locate ontologies to be reused,
thus leading to many ontologies modeling the same thing. Secondly, current
tools do not provide means for reusing existing ontologies while building new
ontologies. Finally, ontologies are rarely static, but are being adapted to
changing requirements. Hence, an infrastructure for management of ontology
changes, taking into account dependencies between ontologies is needed. In this
paper we present such an infrastructure addressing the aforementioned problems. Keywords: ontology evolution, ontology registry, ontology reuse | |||
| Application specific data replication for edge services | | BIBAK | Full-Text | 449-460 | |
| Lei Gao; Mike Dahlin; Amol Nayate; Jiandan Zheng; Arun Iyengar | |||
| The emerging edge services architecture promises to improve the availability
and performance of web services by replicating servers at geographically
distributed sites. A key challenge in such systems is data replication and
consistency so that edge server code can manipulate shared data without
incurring the availability and performance penalties that would be incurred by
accessing a traditional centralized database. This paper explores using a
distributed object architecture to build an edge service system for an
e-commerce application, an online bookstore represented by the TPC-W benchmark.
We take advantage of application specific semantics to design distributed
objects to manage a specific subset of shared information using simple and
effective consistency models. Our experimental results show that by slightly
relaxing consistency within individual distributed objects, we can build an
edge service system that is highly available and efficient. For example, in one
experiment we find that our object-based edge server system provides a factor
of five improvement in response time over a traditional centralized cluster
architecture and a factor of nine improvement over an edge service system that
distributes code but retains a centralized database. Keywords: availability, data replication, distributed objects, edge services,
performance, wide area networks (WAN) | |||
| Evaluation of edge caching/offloading for dynamic content delivery | | BIBAK | Full-Text | 461-471 | |
| Chun Yuan; Yu Chen; Zheng Zhang | |||
| As dynamic content becomes increasingly dominant, it becomes an important
research topic as how the edge resources such as client-side proxies, which are
otherwise underutilized for such content, can be put into use. However, it is
unclear what will be the best strategy and the design/deployment tradeoffs lie
therein. In this paper, using one representative e-commerce benchmark, we
report our experience of an extensive investigation of different offloading and
caching options. Our results point out that, while great benefits can be
reached in general, advanced offloading strategies can be overly complex and
even counter-productive. In contrast, simple augmentation at proxies to enable
fragment caching and page composition achieves most of the benefit without
compromising important considerations such as security. Keywords: dynamic content, edge caching, offloading | |||
| Modeling redirection in geographically diverse server sets | | BIBAK | Full-Text | 472-481 | |
| Lisa Amini; Anees Shaikh; Henning Schulzrinne | |||
| Internet server selection mechanisms attempt to optimize, subject to a
variety of constraints, the distribution of client requests to a geographically
and topologically diverse pool of servers. Research on server selection has
thus far focused primarily on techniques for choosing a server from a group
administered by single entity, like a content distribution network provider. In
a federated, multi-provider computing system, however, selection must occur
over distributed server sets deployed by the participating providers, without
the benefit of the full information available in the single-provider case.
Intelligent server set selection algorithms will require a model of the
expected performance clients would receive from a candidate server set.
In this paper, we study whether the complex policies and dynamics of intelligent server selection can be effectively modeled in order to predict client performance for server sets. We introduce a novel server set distance metric, and use it in a measurement study of several million server selection transactions to develop simple models of existing server selection schemes. We then evaluate these models in terms of their ability to accurately predict performance for a second, larger set of distributed clients. We show that our models are able to predict performance within 20ms for over 90% of the observed samples. Our analysis demonstrates that although existing deployments use a variety of complex and dynamic server selection criteria, most of which are proprietary, these schemes can be modeled with surprising accuracy. Keywords: content distribution network (CDN), performance, server selection, web
traffic redirection | |||
| Offering open hypermedia services to the WWW: a step-by-step approach for developers | | BIBAK | Full-Text | 482-489 | |
| Nikos Karousos; Ippokratis Pandis; Siegfried Reich; Manolis Tzagarakis | |||
| Hypermedia systems and more specifically open hypermedia systems (OHS)
provide a rich set of implementations of different hypertext flavors such as
navigational hypertext, spatial hypertext or taxonomic hypertext. Additionally,
these systems offer component-based modular architectures and address
interoperability between hypertext domains. Despite multiple efforts of
integrating Web clients, a widespread adoption of OHS technology by Web
developers has not taken place. In this paper it is argued that Web Services --
which offer a component model for Web applications -- can be integrated in
OHSs. An architectural integration is proposed, a step-by-step process is
outlined and an example of integration is provided. This very approach is aimed
to benefit both worlds: the Web community with new rich hypermedia
functionality that extends the current navigational hypermedia, and the OHS
community by opening its tools and platforms to the many developer groups of
the Web community. Keywords: babylon system, hypermedia services, open hypermedia systems, web services | |||
| Xspect: bridging open hypermedia and XLink | | BIBAK | Full-Text | 490-499 | |
| Bent G. Christensen; Frank Allan Hansen; Niels Olof Bouvin | |||
| This paper evaluates the XLink format in comparison with other linking
formats. The comparison is based on Xspect, an implementation of XLink. Xspect
handles transformation between an open hypermedia format (OHIF) and XLink, and
the paper discusses this isomorphic transformation and generalises it to
include another open hypermedia format, FOHM. The Xspect system, based on XSLT
and JavaScript, provides users with an interface to browse and merge linkbases.
Xspect supports navigational hypermedia in the form of links inserted on the
fly into Web pages, as well as guided tours presented as SVG. Xspect has two
implementations: one server-side and one running on the client. Both
implementation provide the user with an interface for the creation of
annotations. The main result of the paper is a critique of XLink. XLink is
shown to be a format well suited for navigational hypermedia, but lacking in
more advanced constructs. More problematic are the issues regarding large-scale
use, such as evaluating validity and credibility of linkbases, and ensuring
general support for a format as flexible as XLink. Keywords: FOHM, OHIF, SVG, XLink, XPointer, Xspect, annotations, open hypermedia | |||
| The XML web: a first study | | BIBAK | Full-Text | 500-510 | |
| Laurent Mignet; Denilson Barbosa; Pierangelo Veltri | |||
| Although originally designed for large-scale electronic publishing, XML
plays an increasingly important role in the exchange of data on the Web. In
fact, it is expected that XML will become the lingua franca of the Web,
eventually replacing HTML. Not surprisingly, there has been a great deal of
interest on XML both in industry and in academia. Nevertheless, to date no
comprehensive study on the XML Web (i.e., the subset of the Web made of XML
documents only) nor on its contents has been made. This paper is the first
attempt at describing the XML Web and the documents contained in it. Our
results are drawn from a sample of a repository of the publicly available XML
documents on the Web, consisting of about 200,000 documents. Our results show
that, despite its short history, XML already permeates the Web, both in terms
of generic domains and geographically. Also, our results about the contents of
the XML Web provide valuable input for the design of algorithms, tools and
systems that use XML in one form or another. Keywords: XML documents, XML web, statistical analysis, structural properties | |||
| A matrix density based algorithm to hierarchically co-cluster documents and words | | BIBA | Full-Text | 511-518 | |
| Bhushan Mandhani; Sachindra Joshi; Krishna Kummamuru | |||
| This paper proposes an algorithm to hierarchically cluster documents. Each
cluster is actually a cluster of documents and an associated cluster of words,
thus a document-word co-cluster. Note that, the vector model for documents
creates the document-word matrix, of which every co-cluster is a submatrix. One
would intuitively expect a submatrix made up of high values to be a good
document cluster, with the corresponding word cluster containing its most
distinctive features. Our algorithm looks to exploit this. We have defined
matrix density, and our algorithm basically uses matrix density considerations
in its working.
The algorithm is a partitional-agglomerative algorithm. The partitioning step involves the identification of dense submatrices so that the respective row sets partition the row set of the complete matrix. The hierarchical agglomerative step involves merging the most "similar" submatrices until we are down to the required number of clusters (if we want a flat clustering) or until we have just the single complete matrix left (if we are interested in a hierarchical arrangement of documents). It also generates apt labels for each cluster or hierarchy node. The similarity measure between clusters that we use here for the merging cleverly uses the fact that the clusters here are co-clusters, and is a key point of difference from existing agglomerative algorithms. We will refer to the proposed algorithm as RPSA (Rowset Partitioning and Submatrix Agglomeration). We have compared it as a clustering algorithm with Spherical K-Means and Spectral Graph Partitioning. We have also evaluated some hierarchies generated by the algorithm. | |||
| Mining the peanut gallery: opinion extraction and semantic classification of product reviews | | BIBAK | Full-Text | 519-528 | |
| Kushal Dave; Steve Lawrence; David M. Pennock | |||
| The web contains a wealth of product reviews, but sifting through them is a
daunting task. Ideally, an opinion mining tool would process a set of search
results for a given item, generating a list of product attributes (quality,
features, etc.) and aggregating opinions about each of them (poor, mixed,
good). We begin by identifying the unique properties of this problem and
develop a method for automatically distinguishing between positive and negative
reviews. Our classifier draws on information retrieval techniques for feature
extraction and scoring, and the results for various metrics and heuristics vary
depending on the testing situation. The best methods work as well as or better
than traditional machine learning. When operating on individual sentences
collected from web searches, performance is limited due to noise and ambiguity.
But in the context of a complete web-based tool and aided by a simple method
for grouping sentences into attributes, the results are qualitatively quite
useful. Keywords: document classification, opinion mining | |||
| Mining newsgroups using networks arising from social behavior | | BIBAK | Full-Text | 529-535 | |
| Rakesh Agrawal; Sridhar Rajagopalan; Ramakrishnan Srikant; Yirong Xu | |||
| Recent advances in information retrieval over hyperlinked corpora have
convincingly demonstrated that links carry less noisy information than text. We
investigate the feasibility of applying link-based methods in new applications
domains. The specific application we consider is to partition authors into
opposite camps within a given topic in the context of newsgroups. A typical
newsgroup posting consists of one or more quoted lines from another posting
followed by the opinion of the author. This social behavior gives rise to a
network in which the vertices are individuals and the links represent
"responded-to" relationships. An interesting characteristic of many newsgroups
is that people more frequently respond to a message when they disagree than
when they agree. This behavior is in sharp contrast to the WWW link graph,
where linkage is an indicator of agreement or common interest. By analyzing the
graph structure of the responses, we are able to effectively classify people
into opposite camps. In contrast, methods based on statistical analysis of text
yield low accuracy on such datasets because the vocabulary used by the two
sides tends to be largely identical, and many newsgroup postings consist of
relatively few words of text. Keywords: data mining, link analysis, newsgroup, social network, text mining, web
mining | |||
| Super-peer-based routing and clustering strategies for RDF-based peer-to-peer networks | | BIBAK | Full-Text | 536-543 | |
| Wolfgang Nejdl; Martin Wolpers; Wolf Siberski; Christoph Schmitz; Mario Schlosser; Ingo Brunkhorst; Alexander Löser | |||
| RDF-based P2P networks have a number of advantages compared with simpler P2P
networks such as Napster, Gnutella or with approaches based on distributed
indices such as CAN and CHORD. RDF-based P2P networks allow complex and
extendable descriptions of resources instead of fixed and limited ones, and
they provide complex query facilities against these metadata instead of simple
keyword-based searches.
In previous papers, we have described the Edutella infrastructure and different kinds of Edutella peers implementing such an RDF-based P2P network. In this paper we will discuss these RDF-based P2P networks as a specific example of a new type of P2P networks, schema-based P2P networks, and describe the use of super-peer based topologies for these networks. Super-peer based networks can provide better scalability than broadcast based networks, and do provide perfect support for inhomogeneous schema-based networks, which support different metadata schemas and ontologies (crucial for the Semantic Web). Furthermore, as we will show in this paper, they are able to support sophisticated routing and clustering strategies based on the metadata schemas, attributes and ontologies used. Especially helpful in this context is the RDF functionality to uniquely identify schemas, attributes and ontologies. The resulting routing indices can be built using dynamic frequency counting algorithms and support local mediation and transformation rules, and we will sketch some first ideas for implementing these advanced functionalities as well. Keywords: distributed RDF repositories, peer-to-peer, schema-based routing, semantic
web | |||
| On labeling schemes for the semantic web | | BIBA | Full-Text | 544-555 | |
| Vassilis Christophides; Dimitris Plexousakis; Michel Scholl; Sotirios Tourtounis | |||
| This paper focuses on the optimization of the navigation through voluminous subsumption hierarchies of topics employed by Portal Catalogs like Netscape Open Directory (ODP). We advocate for the use of labeling schemes for modeling these hierarchies in order to efficiently answer queries such as subsumption check, descendants, ancestors or nearest common ancestor, which usually require costly transitive closure computations. We first give a qualitative comparison of three main families of schemes, namely bit vector, prefix and interval based schemes. We then show that two labeling schemes are good candidates for an efficient implementation of label querying using standard relational DBMS, namely, the Dewey Prefix scheme [6] and an Interval scheme by Agrawal, Borgida and Jagadish [1]. We compare their storage and query evaluation performance for the 16 ODP hierarchies using the PostgreSQL engine. | |||
| Piazza: data management infrastructure for semantic web applications | | BIBAK | Full-Text | 556-567 | |
| Alon Y. Halevy; Zachary G. Ives; Peter Mork; Igor Tatarinov | |||
| The Semantic Web envisions a World Wide Web in which data is described with
rich semantics and applications can pose complex queries. To this point,
researchers have defined new languages for specifying meanings for concepts and
developed techniques for reasoning about them, using RDF as the data model. To
flourish, the Semantic Web needs to be able to accommodate the huge amounts of
existing data and the applications operating on them. To achieve this, we are
faced with two problems. First, most of the world's data is available not in
RDF but in XML; XML and the applications consuming it rely not only on the
domain structure of the data, but also on its document structure. Hence, to
provide interoperability between such sources, we must map between both their
domain structures and their document structures. Second, data management
practitioners often prefer to exchange data through local point-to-point data
translations, rather than mapping to common mediated schemas or ontologies.
This paper describes the Piazza system, which addresses these challenges. Piazza offers a language for mediating between data sources on the Semantic Web, which maps both the domain structure and document structure. Piazza also enables interoperation of XML data with RDF data that is accompanied by rich OWL ontologies. Mappings in Piazza are provided at a local scale between small sets of nodes, and our query answering algorithm is able to chain sets mappings together to obtain relevant data from across the Piazza network. We also describe an implemented scenario in Piazza and the lessons we learned from it. Keywords: XML, peer data management systems, semantic web | |||
| On the bursty evolution of blogspace | | BIBA | Full-Text | 568-576 | |
| Ravi Kumar; Jasmine Novak; Prabhakar Raghavan; Andrew Tomkins | |||
| We propose two new tools to address the evolution of hyperlinked corpora. First, we define time graphs to extend the traditional notion of an evolving directed graph, capturing link creation as a point phenomenon in time. Second, we develop definitions and algorithms for time-dense community tracking, to crystallize the notion of community evolution. We develop these tools in the context of Blogspace, the space of weblogs (or blogs). Our study involves approximately 750K links among 25K blogs. We create a time graph on these blogs by an automatic analysis of their internal time stamps. We then study the evolution of connected component structure and microscopic community structure in this time graph. We show that Blogspace underwent a transition behavior around the end of 2001, and has been rapidly expanding over the past year, not just in metrics of scale, but also in metrics of community structure and connectedness. This expansion shows no sign of abating, although measures of connectedness must plateau within two years. By randomizing link destinations in Blogspace, but retaining sources and timestamps, we introduce a concept of randomized Blogspace. Herein, we observe similar evolution of a giant component, but no corresponding increase in community structure. Having demonstrated the formation of micro-communities over time, we then turn to the ongoing activity within active communities. We extend recent work of Kleinberg [11] to discover dense periods of "bursty" intra-community link creation. | |||
| Make it fresh, make it quick: searching a network of personal webservers | | BIBAK | Full-Text | 577-586 | |
| Mayank Bawa; Roberto J., Jr. Bayardo; Sridhar Rajagopalan; Eugene J. Shekita | |||
| Personal webservers have proven to be a popular means of sharing files and
peer collaboration. Unfortunately, the transient availability and rapidly
evolving content on such hosts render centralized, crawl-based search indices
stale and incomplete. To address this problem, we propose YouSearch, a
distributed search application for personal webservers operating within a
shared context (e.g., a corporate intranet). With YouSearch, search results are
always fast, fresh and complete -- properties we show arise from an
architecture that exploits both the extensive distributed resources available
at the peer webservers in addition to a centralized repository of summarized
network state. YouSearch extends the concept of a shared context within web
communities by enabling peers to aggregate into groups and users to search over
specific groups. In this paper, we describe the challenges, design,
implementation and experiences with a successful intranet deployment of
YouSearch. Keywords: P2P, decentralized systems, information communities, intranet search,
peer-to-peer networks, web search | |||
| Engineering and hosting adaptive freshness-sensitive web applications on data centers | | BIBAK | Full-Text | 587-598 | |
| Wen-Syan Li; Oliver Po; Wang-Pin Hsiung; K. Selçuk Candan; Divyakant Agrawal | |||
| Wide-area database replication technologies and the availability of content
delivery networks allow Web applications to be hosted and served from powerful
data centers. This form of application support requires a complete Web
application suite to be distributed along with the database replicas. A major
advantage of this approach is that dynamic content is served from locations
closer to users, leading into reduced network latency and fast response times.
However, this is achieved at the expense of overheads due to (a) invalidation
of cached dynamic content in the edge caches and (b) synchronization of
database replicas in the data center. These have adverse effects on the
freshness of delivered content. In this paper, we propose a freshness-driven
adaptive dynamic content caching, which monitors the system status and adjusts
caching policies to provide content freshness guarantees. The proposed
technique has been intensively evaluated to validate its effectiveness. The
experimental results show that the freshness-driven adaptive dynamic content
caching technique consistently provides good content freshness. Furthermore,
even a Web site that enables dynamic content caching can further benefit from
our solution, which improves content freshness up to 7 times, especially under
heavy user request traffic and long network latency conditions. Our approach
also provides better scalability and significantly reduced response times up to
70% in the experiments. Keywords: database-driven web applications, dynamic content, freshness, response time,
net-work latency, web acceleration | |||
| Evaluating a new approach to strong web cache consistency with snapshots of collected content | | BIBAK | Full-Text | 599-608 | |
| Mikhail Mikhailov; Craig E. Wills | |||
| The problem of Web cache consistency continues to be an important one.
Current Web caches use heuristic-based policies for determining the freshness
of cached objects, often forcing content providers to unnecessarily mark their
content as uncacheable simply to retain control over it. Server-driven
invalidation has been proposed as a mechanism for providing strong cache
consistency for Web objects, but it requires servers to maintain per-client
state even for infrequently changing objects. We propose an alternative
approach to strong cache consistency, called MONARCH, which does not require
servers to maintain per-client state. In this work we focus on a new approach
for evaluation of MONARCH in comparison with current practice and other cache
consistency policies. This approach uses snapshots of content collected from
real Web sites as input to a simulator. Results of the evaluation show MONARCH
generates little more request traffic than an optimal cache coherency policy. Keywords: cache consistency, change characteristics, collected content, object
composition, object relationships, server invalidation, web caching | |||
| Scalable techniques for memory-efficient CDN simulations | | BIBAK | Full-Text | 609-618 | |
| Purushottam Kulkarni; Prashant Shenoy; Weibo Gong | |||
| Since CDN simulations are known to be highly memory-intensive, in this
paper, we argue the need for reducing the memory requirements of such
simulations. We propose a novel memory-efficient data structure that stores
cache state for a small subset of popular objects accurately and uses
approximations for storing the state for the remaining objects. Since popular
objects receive a large fraction of the requests while less frequently accessed
objects consume much of the memory space, this approach yields large memory
savings and reduces errors. We use bloom filters to store approximate state and
show that careful choice of parameters can substantially reduce the probability
of errors due to approximations. We implement our techniques into a user
library for constructing proxy caches in CDN simulators. Our experimental
results show up to an order of magnitude reduction in memory requirements of
CDN simulations, while incurring a 5-10% error. Keywords: approximate data structures, content distribution networks, simulation, web
proxy cache | |||
| Value-based web caching | | BIBAK | Full-Text | 619-628 | |
| Sean C. Rhea; Kevin Liang; Eric Brewer | |||
| Despite traditional web caching techniques, redundant data is often
transferred over HTTP links. These redundant transfers result from both
resource modification and aliasing. Resource modification causes the data
represented by a single URI to change; often, in transferring the new data,
some old data is retransmitted. Aliasing, in contrast, occurs when the same
data is named by multiple URIs, often in the context of dynamic or advertising
content. Traditional web caching techniques index data by its name and thus
often fail to recognize and take advantage of aliasing.
Despite traditional web caching techniques, redundant data is often transferred over HTTP links. These redundant transfers result from both resource modification and aliasing. Resource modification causes the data represented by a single URI to change; often, in transferring the new data, some old data is retransmitted. Aliasing, in contrast, occurs when the same data is named by multiple URIs, often in the context of dynamic or advertising content. Traditional web caching techniques index data by its name and thus often fail to recognize and take advantage of aliasing. Keywords: HTTP, WWW, aliasing, caching, duplicate suppression, dynamic content,
hypertext transfer protocol, privacy, proxy, redundant transfers, resource
modification, scalability, world wide web | |||
| An XPath-based preference language for P3P | | BIBAK | Full-Text | 629-639 | |
| Rakesh Agrawal; Jerry Kiernan; Ramakrishnan Srikant; Yirong Xu | |||
| The Platform for Privacy Preferences (P3P) is the most significant effort
currently underway to enable web users to gain control over their private
information. The designers of P3P simultaneously designed a preference language
called APPEL to allow users to express their privacy preferences, thus enabling
automatic matching of privacy preferences against P3P policies. Unfortunately
subtle interactions between P3P and APPEL result in serious problems when using
APPEL: Users can only directly specify what is unacceptable in a policy, not
what is acceptable; simple preferences are hard to express; and writing APPEL
preferences is error prone. We show that these problems follow from a
fundamental design choice made by APPEL, and cannot be solved without
completely redesigning the language. Therefore we explore alternatives to APPEL
that can overcome these problems. In particular, we show that XPath serves
quite nicely as a preference language and solves all the above problems. We
identify the minimal subset of XPath that is needed, thus allowing matching
programs to potentially use a smaller memory footprint. We also give an APPEL
to XPath translator that shows that XPath is as expressive as APPEL. Keywords: APPEL, P3P, XPath, XPref, hippocratic databases, preference, privacy-aware
data management | |||
| The Eigentrust algorithm for reputation management in P2P networks | | BIBAK | Full-Text | 640-651 | |
| Sepandar D. Kamvar; Mario T. Schlosser; Hector Garcia-Molina | |||
| Peer-to-peer file-sharing networks are currently receiving much attention as
a means of sharing and distributing information. However, as recent experience
shows, the anonymous, open nature of these networks offers an almost ideal
environment for the spread of self-replicating inauthentic files.
We describe an algorithm to decrease the number of downloads of inauthentic files in a peer-to-peer file-sharing network that assigns each peer a unique global trust value, based on the peer's history of uploads. We present a distributed and secure method to compute global trust values, based on Power iteration. By having peers use these global trust values to choose the peers from whom they download, the network effectively identifies malicious peers and isolates them from the network. In simulations, this reputation system, called EigenTrust, has been shown to significantly decrease the number of inauthentic files on the network, even under a variety of conditions where malicious peers cooperate in an attempt to deliberately subvert the system. Keywords: distributed eigenvector computation, peer-to-peer, reputation | |||
| Similarity measure and instance selection for collaborative filtering | | BIBAK | Full-Text | 652-658 | |
| Chun Zeng; Chun-Xiao Xing; Li-Zhu Zhou | |||
| Collaborative filtering has been very successful in both research and
applications such as information filtering and E-commerce. The k-Nearest
Neighbor (KNN) method is a popular way for its realization. Its key technique
is to find k nearest neighbors for a given user to predict his interests.
However, this method suffers from two fundamental problems: sparsity and
scalability. In this paper, we present our solutions for these two problems. We
adopt two techniques: a matrix conversion method for similarity measure and an
instance selection method. And then we present an improved collaborative
filtering algorithm based on these two methods. In contrast with existing
collaborative algorithms, our method shows its satisfactory accuracy and
performance. Keywords: collaborative filtering, instance selection, similarity measure | |||
| Monitoring the dynamic web to respond to continuous queries | | BIBAK | Full-Text | 659-668 | |
| Sandeep Pandey; Krithi Ramamritham; Soumen Chakrabarti | |||
| Continuous queries are queries for which responses given to users must be
continuously updated, as the sources of interest get updated. Such queries
occur, for instance, during on-line decision making, e.g., traffic flow
control, weather monitoring, etc. The problem of keeping the responses current
reduces to the problem of deciding how often to visit a source to determine if
and how it has been modified, in order to update earlier responses accordingly.
On the surface, this seems to be similar to the crawling problem since crawlers
attempt to keep indexes up-to-date as pages change and users pose search
queries. We show that this is not the case, both due to the inherent
differences between the nature of the two problems as well as the performance
metric. We propose, develop and evaluate a novel multi-phase (Continuous
Adaptive Monitoring) (CAM) solution to the problem of maintaining the currency
of query results. Some of the important phases are: The tracking phase, in
which changes, to an initially identified set of relevant pages, are tracked.
From the observed change characteristics of these pages, a probabilistic model
of their change behavior is formulated and weights are assigned to pages to
denote their importance for the current queries. During the next phase, the
resource allocation phase, based on these statistics, resources, needed to
continuously monitor these pages for changes, are allocated. Given these
resource allocations, the scheduling phase produces an optimal achievable
schedule for the monitoring tasks. An experimental evaluation of our approach
compared to prior approaches for crawling dynamic web pages shows the
effectiveness of CAM for monitoring dynamic changes. For example, by monitoring
just 5% of the page changes, CAM is able to return 90% of the changed
information to the users. The experiments also produce some interesting
observations pertaining to the differences between the two problems of crawling
-- to build an index -- and the problem of change tracking -- to respond to
continuous queries. Keywords: allocation policies, continuous queries | |||
| A large-scale study of the evolution of web pages | | BIBAK | Full-Text | 669-678 | |
| Dennis Fetterly; Mark Manasse; Marc Najork; Janet Wiener | |||
| How fast does the web change? Does most of the content remain unchanged once
it has been authored, or are the documents continuously updated? Do pages
change a little or a lot? Is the extent of change correlated to any other
property of the page? All of these questions are of interest to those who mine
the web, including all the popular search engines, but few studies have been
performed to date to answer them.
One notable exception is a study by Cho and Garcia-Molina, who crawled a set of 720,000 pages on a daily basis over four months, and counted pages as having changed if their MD5 checksum changed. They found that 40% of all web pages in their set changed within a week, and 23% of those pages that fell into the .com domain changed daily. This paper expands on Cho and Garcia-Molina's study, both in terms of coverage and in terms of sensitivity to change. We crawled a set of 150,836,209 HTML pages once every week, over a span of 11 weeks. For each page, we recorded a checksum of the page, and a feature vector of the words on the page, plus various other data such as the page length, the HTTP status code, etc. Moreover, we pseudo-randomly selected 0.1% of all of our URLs, and saved the full text of each download of the corresponding pages. After completion of the crawl, we analyzed the degree of change of each page, and investigated which factors are correlated with change intensity. We found that the average degree of change varies widely across top-level domains, and that larger pages change more often and more severely than smaller ones. This paper describes the crawl and the data transformations we performed on the logs, and presents some statistical observations on the degree of change of different classes of pages. Keywords: degree of change, rate of change, web characterization, web evolution, web
pages | |||
| Efficient URL caching for world wide web crawling | | BIBAK | Full-Text | 679-689 | |
| Andrei Z. Broder; Marc Najork; Janet L. Wiener | |||
| Crawling the web is deceptively simple: the basic algorithm is (a) Fetch a
page (b) Parse it to extract all linked URLs (c) For all the URLs not seen
before, repeat (a)-(c). However, the size of the web (estimated at over 4
billion pages) and its rate of change (estimated at 7% per week) move this plan
from a trivial programming exercise to a serious algorithmic and system design
challenge. Indeed, these two factors alone imply that for a reasonably fresh
and complete crawl of the web, step (a) must be executed about a thousand times
per second, and thus the membership test (c) must be done well over ten
thousand times per second against a set too large to store in main memory. This
requires a distributed architecture, which further complicates the membership
test.
A crucial way to speed up the test is to cache, that is, to store in main memory a (dynamic) subset of the "seen" URLs. The main goal of this paper is to carefully investigate several URL caching techniques for web crawling. We consider both practical algorithms: random replacement, static cache, LRU, and CLOCK, and theoretical limits: clairvoyant caching and infinite cache. We performed about 1,800 simulations using these algorithms with various cache sizes, using actual log data extracted from a massive 33 day web crawl that issued over one billion HTTP requests. Our main conclusion is that caching is very effective -- in our setup, a cache of roughly 50,000 entries can achieve a hit rate of almost 80%. Interestingly, this cache size falls at a critical point: a substantially smaller cache is much less effective while a substantially larger cache brings little additional benefit. We conjecture that such critical points are inherent to our problem and venture an explanation for this phenomenon. Keywords: URL caching, caching, crawling, distributed crawlers, web crawlers, web
graph models | |||
| ρ-Queries: enabling querying for semantic associations on the semantic web | | BIBAK | Full-Text | 690-699 | |
| Kemafor Anyanwu; Amit Sheth | |||
| This paper presents the notion of Semantic Associations as complex
relationships between resource entities. These relationships capture both a
connectivity of entities as well as similarity of entities based on a specific
notion of similarity called ρ-isomorphism. It formalizes these notions for
the RDF data model, by introducing a notion of a Property Sequence as a type.
In the context of a graph model such as that for RDF, Semantic Associations
amount to specific certain graph signatures. Specifically, they refer to
sequences (i.e. directed paths) here called Property Sequences, between
entities, networks of Property Sequences (i.e. undirected paths), or subgraphs
of r-isomorphic Property Sequences.
The ability to query about the existence of such relationships is fundamental to tasks in analytical domains such as national security and business intelligence, where tasks often focus on finding complex yet meaningful and obscured relationships between entities. However, support for such queries is lacking in contemporary query systems, including those for RDF. This paper discusses how querying for Semantic Associations might be enabled on the Semantic Web through the use of an operator ρ. It also discusses two approaches for processing ρ-queries on available persistent RDF stores memory resident RDF data graphs thereby building on current RDF query languages. Keywords: RDF, complex data relationships, graph traversals, semantic associations,
semantic web querying | |||
| Semantic search | | BIBAK | Full-Text | 700-709 | |
| R. Guha; Rob McCool; Eric Miller | |||
| Activities such as Web Services and the Semantic Web are working to create a
web of distributed machine understandable data. In this paper we present an
application called 'Semantic Search' which is built on these supporting
technologies and is designed to improve traditional web searching. We provide
an overview of TAP, the application framework upon which the Semantic Search is
built. We describe two implemented Semantic Search systems which, based on the
denotation of the search query, augment traditional search results with
relevant data aggregated from distributed sources. We also discuss some general
issues related to searching and the Semantic Web and outline how an
understanding of the semantics of the search terms can be used to provide
better results. Keywords: search, semantic web | |||
| Agent-based semantic web services | | BIBAK | Full-Text | 710-717 | |
| Nicholas Gibbins; Stephen Harris; Nigel Shadbolt | |||
| The Web Services world consists of loosely-coupled distributed systems which
adapt to ad-hoc changes by the use of service descriptions that enable
opportunistic service discovery. At present, these service descriptions are
semantically impoverished, being concerned with describing the functional
signature of the services rather than characterising their meaning. In the
Semantic Web community, the DAML Services effort attempts to rectify this by
providing a more expressive way of describing Web services using ontologies.
However, this approach does not separate the domain-neutral communicative
intent of a message (considered in terms of speech acts) from its
domain-specific content, unlike similar developments from the multi-agent
systems community.
In this paper, we describe our experiences of designing and building an ontologically motivated Web Services system for situational awareness and information triage in a simulated humanitarian aid scenario. In particular, we discuss the merits of using techniques from the multi-agent systems community for separating the intentional force of messages from their content, and the implementation of these techniques within the DAML Services model. Keywords: DAML-S, agent communication languages, ontologies, semantic web, web
services | |||
| A framework for coordinated multi-modal browsing with multiple clients | | BIBAK | Full-Text | 718-726 | |
| Alistair Coles; Eric Deliot; Tom Melamed; Kevin Lansard | |||
| As users acquire or gain access to an increasingly diverse range of web
access clients, web applications are adapting their user interfaces to support
multiple modalities on multiple client types. User experiences can be enhanced
by clients with differing capabilities combining to provide a distributed user
interface to applications. Indeed, users will be frustrated if their
interaction with applications is limited to one client at a time.
This paper discusses the requirements for coordinating web interaction across an aggregation of clients. We present a framework for multi-device browsing that provides both coordinated navigation between web resources and coordinated interaction between variants, or representations, of those resources once instantiated in the clients. The framework protects the application from some of the complexities of client aggregation. We show how a small number of enhancements to the XForms and XML Events vocabularies can facilitate coordination between clients and provide an appropriate level of control to applications. We also describe a novel proxy which consolidates HTTP requests from aggregations of clients and reduces the burden that multi-client browsing places on the application. Keywords: multi-modal browsing, web proxy | |||
| A comparative web browser (CWB) for browsing and comparing web pages | | BIBAK | Full-Text | 727-735 | |
| Akiyo Nadamoto; Katsumi Tanaka | |||
| In this paper, we propose a new type of Web browser, called the Comparative
Web Browser (CWB), which concurrently presents multiple Web pages in a way that
enables the content of the Web pages to be automatically synchronized. The
ability to view multiple Web pages at one time is useful when we wish to make a
comparison on the Web, such as when we compare similar products or news
articles from different newspapers. The CWB is characterized by (1) automatic
content-based retrieval of passages from another Web page based on a passage of
the Web page the user is reading, and (2) automatic transformation of a user's
behavior (scrolling, clicking, or moving backward or forward) on a Web page
into a series of behaviors on the other Web pages. The CWB tries to
concurrently present "similar" passages from different Web pages, and for this
purpose our CWB automatically navigates Web pages that contain passages similar
to those of the initial Web page. Furthermore, we propose an enhancement to the
CWB, which enables it to use linkage information to find related documents
based on link structure. Keywords: comparison, content synchronization, passage retrieval, web browser | |||
| Comparing link marker visualization techniques: changes in reading behavior | | BIBAK | Full-Text | 736-745 | |
| Hartmut Obendorf; Harald Weinreich | |||
| Links are one of the most important means for navigation in the World Wide
Web. However, the visualization of and the interaction with Web links have been
scarcely explored, although Links have severe implications on the appearance
and usability of Web pages and the World Wide Web as such.
This paper presents two studies giving first insights of the effects of link visualization techniques on reading habits and performance. The first user study compares different highlighting techniques for link markers and evaluates their effect on reading performance and user acceptance. The second study examines links-on-demand, links that appear when pressing a dedicated key, and discusses their possible effects on reading and browsing habits. The findings of the conducted studies imply that the standard appearance of link markers has seriously underestimated effects on the usability of Web pages. They can significantly reduce the readability of the text, and alternatives should be carefully considered for the design of future Web browsers. Keywords: hypermedia, input interaction technologies, link marking techniques,
links-on-demand, usability survey | |||