HCI Bibliography Home | HCI Conferences | WWW Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
WWW Tables of Contents: 01020304-104-205-105-2060708091011-1

Proceedings of the 2003 International Conference on the World Wide Web

Fullname:Proceedings of the 12th International Conference on World Wide Web
Editors:Gusztáv Hencsey; Bebo White; Yih-Farn Robin Chen; László Kovács; Steve Lawrence
Location:Budapest, Hungary
Dates:2003-May-20 to 2003-May-24
Publisher:ACM
Standard No:ISBN: 1-58113-680-3; ACM DL: Table of Contents hcibib: WWW03
Papers:77
Pages:752
Links:Conference Home Page
  1. Information Retrieval
  2. Foundations of the semantic web
  3. Mobility & wireless access
  4. Information retrieval 2
  5. Provisioning
  6. Data integrity
  7. Establishing the semantic web 1
  8. Adapting content to mobile devices
  9. Writing the web
  10. Link-based ranking 1
  11. Applications and architecture
  12. E-commerce
  13. Link-based ranking 2
  14. Multimedia
  15. Web engineering
  16. Establishing the semantic web 11
  17. Consistency and replication
  18. Open hypermedia and the web
  19. Data mining
  20. Scaling up the semantic web
  21. Dynamic services and analysis
  22. CDNs and caching
  23. Protocols
  24. Web crawling and measurement
  25. Using the semantic web
  26. Improving the browsing experience

Information Retrieval

Query-free news search BIBAKFull-Text 1-10
  Monika Henzinger; Bay-Wei Chang; Brian Milch; Sergey Brin
Many daily activities present information in the form of a stream of text, and often people can benefit from additional information on the topic discussed. TV broadcast news can be treated as one such stream of text; in this paper we discuss finding news articles on the web that are relevant to news currently being broadcast.
   We evaluated a variety of algorithms for this problem, looking at the impact of inverse document frequency, stemming, compounds, history, and query length on the relevance and coverage of news articles returned in real time during a broadcast. We also evaluated several postprocessing techniques for improving the precision, including reranking using additional terms, reranking by document similarity, and filtering on document similarity. For the best algorithm, 84%-91% of the articles found were relevant, with at least 64% of the articles being on the exact topic of the broadcast. In addition, a relevant article was found for at least 70% of the topics.
Keywords: query-free search, web information retrieval
Improving pseudo-relevance feedback in web information retrieval using web page segmentation BIBAKFull-Text 11-18
  Shipeng Yu; Deng Cai; Ji-Rong Wen; Wei-Ying Ma
In contrast to traditional document retrieval, a web page as a whole is not a good information unit to search because it often contains multiple topics and a lot of irrelevant information from navigation, decoration, and interaction part of the page. In this paper, we propose a VIsion-based Page Segmentation (VIPS) algorithm to detect the semantic content structure in a web page. Compared with simple DOM based segmentation method, our page segmentation scheme utilizes useful visual cues to obtain a better partition of a page at the semantic level. By using our VIPS algorithm to assist the selection of query expansion terms in pseudo-relevance feedback in web information retrieval, we achieve 27% performance improvement on Web Track dataset.
Keywords: page segmentation, query expansion, relevance feedback, web information retrieval
Predictive caching and prefetching of query results in search engines BIBAKFull-Text 19-28
  Ronny Lempel; Shlomo Moran
We study the caching of query result pages in Web search engines. Popular search engines receive millions of queries per day, and efficient policies for caching query results may enable them to lower their response time and reduce their hardware requirements. We present PDC (probability driven cache), a novel scheme tailored for caching search results, that is based on a probabilistic model of search engine users. We then use a trace of over seven million queries submitted to the search engine AltaVista to evaluate PDC, as well as traditional LRU and SLRU based caching schemes. The trace driven simulations show that PDC outperforms the other policies. We also examine the prefetching of search results, and demonstrate that prefetching can increase cache hit ratios by 50% for large caches, and can double the hit ratios of small caches. When integrating prefetching into PDC, we attain hit ratios of over 0.53.
Keywords: caching, query processing and optimization

Foundations of the semantic web

Model-theoretic semantics for the web BIBAKFull-Text 29-38
  James Farrugia
Model-theoretic semantics is a formal account of the interpretations of legitimate expressions of a language. It is increasingly being used to provide Web markup languages with well-defined semantics. But a discussion of its roles and limitations for the Semantic Web has not yet received a coherent and detailed treatment. This paper takes the first steps towards such a treatment. The major result is an introductory explication of key ideas that are usually only implicit in existing accounts of semantics for the Web. References to more detailed accounts of these ideas are also provided. The benefit of this explication is increased awareness among Web users of some important issues inherent in using model-theoretic semantics for Web markup languages.
Keywords: model-theoretic semantics, semantics, web markup languages
Three theses of representation in the semantic web BIBAKFull-Text 39-47
  Ian Horrocks; Peter F. Patel-Schneider
The Semantic Web is vitally dependent on a formal meaning for the constructs of its languages. For Semantic Web languages to work well together their formal meanings must employ a common view (or thesis) of representation, otherwise it will not be possible to reconcile documents written in different languages. The thesis of representation underlying RDF and RDFS is particularly troublesome in this regard, as it has several unusual aspects, both semantic and syntactic. A more-standard thesis of representation would result in the ability to reuse existing results and tools in the Semantic Web.
Keywords: model-theoretic semantics, representation, semantic web
Description logic programs: combining logic programs with description logic BIBAKFull-Text 48-57
  Benjamin N. Grosof; Ian Horrocks; Raphael Volz; Stefan Decker
We show how to interoperate, semantically and inferentially, between the leading Semantic Web approaches to rules (RuleML Logic Programs) and ontologies (OWL/DAML+OIL Description Logic) via analyzing their expressive intersection. To do so, we define a new intermediate knowledge representation (KR) contained within this intersection: Description Logic Programs (DLP), and the closely related Description Horn Logic (DHL) which is an expressive fragment of first-order logic (FOL). DLP provides a significant degree of expressiveness, substantially greater than the RDF-Schema fragment of Description Logic. We show how to perform DLP-fusion: the bidirectional translation of premises and inferences (including typical kinds of queries) from the DLP fragment of DL to LP, and vice versa from the DLP fragment of LP to DL. In particular, this translation enables one to "build rules on top of ontologies": it enables the rule KR to have access to DL ontological definitions for vocabulary primitives (e.g., predicates and individual constants) used by the rules. Conversely, the DLP-fusion technique likewise enables one to "build ontologies on top of rules": it enables ontological definitions to be supplemented by rules, or imported into DL from rules. It also enables available efficient LP inferencing algorithms/implementations to be exploited for reasoning over large-scale DL ontologies.
Keywords: RDF, XML, description logic, inferencing, information integration, interoperability, knowledge representation, logic programs, model-theoretic semantics, ontologies, rules, semantic web, translation

Mobility & wireless access

Dynamic service reconfiguration for wireless web access BIBAKFull-Text 58-67
  Siu-Nam Chuang; Alvin T. S. Chan; Jiannong Cao; Ronnie Cheung
This paper describes a dynamic service reconfiguration model where the proxy is composed of a chain of service objects called mobilets (pronounced as mo-be-lets), which can be deployed onto the network actively. This model offers flexibility because the chain of mobilets can be dynamically reconfigured to adapt to the vigorous changes in the characteristics of the wireless environment, without interrupting the service provision for other mobile nodes. Furthermore, mobilets can also be migrated to a new proxy server when the mobile node moves to a different network domain. We have realized the dynamic service reconfiguration model by crafting its design into a programmable infrastructure that forms the baseline architecture of the WebPADS (short for Web Proxy for Actively Deployable Services) system.
Keywords: active services, dynamic service reconfiguration, wireless environment adaptation, wireless web access
Web browsing performance of wireless thin-client computing BIBAKFull-Text 68-79
  S. Jae Yang; Jason Nieh; Shilpa Krishnappa; Aparna Mohla; Mahdi Sajjadpour
Web applications are becoming increasingly popular for mobile wireless systems. However, wireless networks can have high packet loss rates, which can degrade web browsing performance on wireless systems. An alternative approach is wireless thin-client computing, in which the web browser runs on a remote thin server with a more reliable wired connection to the Internet. A mobile client then maintains a connection to the thin server to receive display updates over the lossy wireless network. To assess the viability of this thin-client approach, we compare the web browsing performance of thin clients against fat clients that run the web browser locally in lossy wireless networks. Our results show that thin clients can operate quite effectively over lossy networks. Compared to fat clients running web browsers locally, our results show surprisingly that thin clients can be faster and more resilient on web applications over lossy wireless LANs despite having to send more data over the network. We characterize and analyze different design choices in various thin-client systems and explain why these approaches can yield superior web browsing performance in lossy wireless networks.
Keywords: thin-client computing, web performance, wireless and mobility
Sensor-enhanced mobile web clients: an XForms approach BIBAKFull-Text 80-89
  John Barton; Tim Kindberg; Hui Dai; Nissanka B. Priyantha; Fahd Al-bin-ali
This paper describes methods for service selection and service access for mobile, sensor-enhanced web clients such as wireless cameras or wireless PDAs with sensor devices attached. The clients announce their data-creating capabilities in "Produce" headers sent to servers; servers respond with forms that match these capabilities. Clients fill in these forms with sensor data as well as text or file data. The resultant system enables clients to access dynamically discovered services spontaneously, as their users engage in everyday nomadic activities.
Keywords: MIME types, browsers, forms, mobile computing, sensors, ubiquitous computing

Information retrieval 2

Text joins in an RDBMS for web data integration BIBAKFull-Text 90-101
  Luis Gravano; Panagiotis G. Ipeirotis; Nick Koudas; Divesh Srivastava
The integration of data produced and collected across autonomous, heterogeneous web services is an increasingly important and challenging problem. Due to the lack of global identifiers, the same entity (e.g., a product) might have different textual representations across databases. Textual data is also often noisy because of transcription errors, incomplete information, and lack of standard formats. A fundamental task during data integration is matching of strings that refer to the same entity. In this paper, we adopt the widely used and established cosine similarity metric from the information retrieval field in order to identify potential string matches across web sources. We then use this similarity metric to characterize this key aspect of data integration as a join between relations on textual attributes, where the similarity of matches exceeds a specified threshold. Computing an exact answer to the text join can be expensive. For query processing efficiency, we propose a sampling-based join approximation strategy for execution in a standard, unmodified relational database management system (RDBMS), since more and more web sites are powered by RDBMSs with a web-based front end. We implement the join inside an RDBMS, using SQL queries, for scalability and robustness reasons. Finally, we present a detailed performance evaluation of an implementation of our algorithm within a commercial RDBMS, using real-life data sets. Our experimental results demonstrate the efficiency and accuracy of our techniques.
Keywords: approximate text matching, data cleaning, text indexing
Dynamic maintenance of web indexes using landmarks BIBAKFull-Text 102-111
  Lipyeow Lim; Min Wang; Sriram Padmanabhan; Jeffrey Scott Vitter; Ramesh Agarwal
Recent work on incremental crawling has enabled the indexed document collection of a search engine to be more synchronized with the changing World Wide Web. However, this synchronized collection is not immediately searchable, because the keyword index is rebuilt from scratch less frequently than the collection can be refreshed. An inverted index is usually used to index documents crawled from the web. Complete index rebuild at high frequency is expensive. Previous work on incremental inverted index updates have been restricted to adding and removing documents. Updating the inverted index for previously indexed documents that have changed has not been addressed.
   In this paper, we propose an efficient method to update the inverted index for previously indexed documents whose contents have changed. Our method uses the idea of landmarks together with the diff algorithm to significantly reduce the number of postings in the inverted index that need to be updated. Our experiments verify that our landmark-diff method results in significant savings in the number of update operations on the inverted index.
Keywords: inverted files, update processing
High-performance spatial indexing for location-based services BIBAKFull-Text 112-117
  Jussi Myllymaki; James Kaufman
Much attention has been accorded to Location-Based Services and location tracking, a necessary component in active, trigger-based LBS applications. Tracking the location of a large population of moving objects requires very high update and query performance of the underlying spatial index. In this paper we investigate the performance and scalability of three main-memory based spatial indexing methods under dynamic update and query loads: an R-tree, a ZB-tree, and an array/hashtable method. By leveraging the LOCUS performance evaluation testbed and the City Simulator dynamic spatial data generator, we are able to demonstrate the scalability of these methods and determine the maximum population size supported by each method, a useful parameter for capacity planning by wireless carriers.
Keywords: location-based services, mobile computing, spatial indexing

Provisioning

Efficient and robust streaming provisioning in VPNs BIBAKFull-Text 118-127
  Z. Morley Mao; David Johnson; Oliver Spatscheck; Jacobus E. van der Merwe; Jia Wang
Today, most large companies maintain virtual private networks (VPNs) to connect their remote locations into a single secure network. VPNs can be quite large covering more than 1000 locations and in most cases use standard Internet protocols and services. Such VPNs are implemented using a diverse set of technologies such as Frame Relay, MPLS, or IPSEC to achieve the goal of privacy and performance isolation from the public Internet.
   Using VPNs to distribute live content has recently received tremendous interest. For example, a VPN could be used to broadcast a CEO-employee town hall meeting. To distribute this type of content economically without overloading the network, the deployment of streaming caches or splitters is most likely required.
   In this paper, we address the problem of optimally placing such streaming splitters or caches to broadcast to a given set of VPN endpoints under the constraints typically found within a VPN. In particular, we introduce an efficient algorithm with complexity O(V), V being the number of routers in the VPN. This guarantees the optimal cache placement if interception is used for redirection. We prove that the general problem is NP-hard and introduce multiple heuristics for efficient and robust cache placement suitable under different constraints. At the expense of increased implementation complexity, each heuristic solution provides additional saving in the number of caches required. We evaluate proposed solutions using extensive simulations. In particular, we show our flow-based solution is very close to the optimal.
Keywords: VPNs, streaming server placement
On admission control for profit maximization of networked service providers BIBAKFull-Text 128-137
  Akshat Verma; Sugata Ghosal
Variability and diverseness among incoming requests to a service hosted on a finite capacity resource necessitates sophisticated request admission control techniques for providing guaranteed quality of service (QoS). We propose in this paper a service time based online admission control methodology for maximizing profits of a service provider. The proposed methodology chooses a subset of incoming requests such that the revenue of the provider is maximized. Admission control decision in our proposed system is based upon an estimate of the service time of the request, QoS bounds, prediction of arrivals and service times of requests to come in the short-term future, and rewards associated with servicing a request within its QoS bounds. Effectiveness of the proposed admission control methodology is demonstrated using experiments with a content-based messaging middleware service.
Keywords: admission control, profit maximization, quality of service (QoS), service level agreement (SLA), service time estimation, short-term prediction, shortest remaining job first (SRJF), web service
Design, implementation, and evaluation of a client characterization driven web server BIBAKFull-Text 138-147
  Balachander Krishnamurthy; Yin Zhang; Craig E. Wills; Kashi Vishwanath
In earlier work we proposed a way for a Web server to detect connectivity information about clients accessing it in order to take tailored actions for a client request. This paper describes the design, implementation, and evaluation of such a working system. A Web site has a strong incentive to reduce the 'time-to-glass' to retain users who may otherwise lose interest and leave the site. We have performed a measurement study from multiple client sites around the world with various levels of connectivity to the Internet communicating with modified Apache Web servers under our control. The results show that clients can be classified in a correct and stable manner and that user-perceived latency can be reduced via tailored actions. Our measurements show that classification and determination of server actions are done without significant overhead on the Web server. We explore a variety of modified actions ranging from selecting a lower quality version of the resource to altering the manner of content delivery. By studying numerous performance related factors in a single unified framework and examining both individual actions as well as combination of actions, our modified Web server implementation shows the efficacy of various server actions.
Keywords: apache server, client classification, content delivery, httperf, server adaptation, web performance

Data integrity

Web application security assessment by fault injection and behavior monitoring BIBAKFull-Text 148-159
  Yao-Wen Huang; Shih-Kun Huang; Tsung-Po Lin; Chung-Hung Tsai
As a large and complex application platform, the World Wide Web is capable of delivering a broad range of sophisticated applications. However, many Web applications go through rapid development phases with extremely short turnaround time, making it difficult to eliminate vulnerabilities. Here we analyze the design of Web application security assessment mechanisms in order to identify poor coding practices that render Web applications vulnerable to attacks such as SQL injection and cross-site scripting. We describe the use of a number of software-testing techniques (including dynamic analysis, black-box testing, fault injection, and behavior monitoring), and suggest mechanisms for applying these techniques to Web applications. Real-world situations are used to test a tool we named the Web Application Vulnerability and Error Scanner (WAVES, an open-source project available at http://waves.sourceforge.net) and to compare it with other tools. Our results show that WAVES is a feasible platform for assessing Web application security.
Keywords: black-box testing, complete crawling, fault injection, security assessment, web application testing
The HP time vault service: exploiting IBE for timed release of confidential information BIBAKFull-Text 160-169
  Marco Casassa Mont; Keith Harrison; Martin Sadler
Digital information is increasingly more and more important to enable interactions and transactions on the Internet. On the other hand, leakages of sensitive information can have harmful effects for people, enterprises and governments.
   This paper focuses on the problems of dealing with timed release of confidential information and simplifying its access once public: it is a common issue in the industry, government and day-to-day life.
   We introduce the "HP Time Vault Service", based on the emerging Identifier-based Encryption (IBE) cryptography schema. IBE (public) encryption keys specify the disclosure time. These keys are used to encrypt confidential information. An independent time server generates and publishes IBE decryption keys correspondent to the current time, at predefined intervals.
   We discuss the advantages of this approach against current approaches based on traditional cryptography. A web-service based prototype is described, as a proof of concept.
Keywords: disclosure policies, identifier-based encryption, privacy, security, timed-release, web service
Content extraction signatures using XML digital signatures and custom transforms on-demand BIBAKFull-Text 170-177
  Laurence Bull; Peter Stanski; David McG. Squire
Content Extraction Signatures (CES) enable selective disclosure of verifiable content, provide privacy for blinded content, and enable the signer to specify the content the document owner is allowed to extract or blind. Combined, these properties give what we call CES functionality. In this paper we describe our work in developing custom transform algorithms to expand the functionality of an XML Signature to include CES functionality in XML Signature Core Validation.
   We also describe a custom revocation mechanism and our implementation for non-XML content where the custom transforms are dynamically loaded demonstrating that custom signing and verification is not constrained to a 'closed system'. Through the use of dynamic loading we show that a verifier can still verify an XML Signature-compliant signature even though a custom signature was produced.
Keywords: .Net framework XML signature API, XML signature custom transforms, XML signatures, content extraction signatures, dynamic signature verification

Establishing the semantic web 1

SemTag and seeker: bootstrapping the semantic web via automated semantic annotation BIBAKFull-Text 178-186
  Stephen Dill; Nadav Eiron; David Gibson; Daniel Gruhl; R. Guha; Anant Jhingran; Tapas Kanungo; Sridhar Rajagopalan; Andrew Tomkins; John A. Tomlin; Jason Y. Zien
This paper describes Seeker, a platform for large-scale text analytics, and SemTag, an application written on the platform to perform automated semantic tagging of large corpora. We apply SemTag to a collection of approximately 264 million web pages, and generate approximately 434 million automatically disambiguated semantic tags, published to the web as a label bureau providing metadata regarding the 434 million annotations. To our knowledge, this is the largest scale semantic tagging effort to date.
   We describe the Seeker platform, discuss the architecture of the SemTag application, describe a new disambiguation algorithm specialized to support ontological disambiguation of large-scale data, evaluate the algorithm, and present our final results with information about acquiring and making use of the semantic tags. We argue that automated large scale semantic tagging of ambiguous content can bootstrap and accelerate the creation of the semantic web.
Keywords: automated semantic tagging, data mining, information retrieval, large text datasets, text analytics
Data extraction and label assignment for web databases BIBAKFull-Text 187-196
  Jiying Wang; Fred H. Lochovsky
Many tools have been developed to help users query, extract and integrate data from web pages generated dynamically from databases, i.e., from the Hidden Web. A key prerequisite for such tools is to obtain the schema of the attributes of the retrieved data. In this paper, we describe a system called, DeLa, which reconstructs (part of) a "hidden" back-end web database. It does this by sending queries through HTML forms, automatically generating regular expression wrappers to extract data objects from the result pages and restoring the retrieved data into an annotated (labelled) table. The whole process needs no human involvement and proves to be fast (less than one minute for wrapper induction for each site) and accurate (over 90% correctness for data extraction and around 80% correctness for label assignment).
Keywords: HTML forms, automatic wrapper induction, data annotation, hidden web, information integration, web information extraction
The chatty web: emergent semantics through gossiping BIBAKFull-Text 197-206
  Karl Aberer; Philippe Cudré-Mauroux; Manfred Hauswirth
This paper describes a novel approach for obtaining semantic interoperability among data sources in a bottom-up, semi-automatic manner without relying on pre-existing, global semantic models. We assume that large amounts of data exist that have been organized and annotated according to local schemas. Seeing semantics as a form of agreement, our approach enables the participating data sources to incrementally develop global agreement in an evolutionary and completely decentralized process that solely relies on pair-wise, local interactions: Participants provide translations between schemas they are interested in and can learn about other translations by routing queries (gossiping). To support the participants in assessing the semantic quality of the achieved agreements we develop a formal framework that takes into account both syntactic and semantic criteria. The assessment process is incremental and the quality ratings are adjusted along with the operation of the system. Ultimately, this process results in global agreement, i.e., the semantics that all participants understand. We discuss strategies to efficiently find translations and provide results from a case study to justify our claims. Our approach applies to any system which provides a communication infrastructure (existing websites or databases, decentralized systems, P2P systems) and offers the opportunity to study semantic interoperability as a global phenomenon in a network of information sharing parties.
Keywords: self-organization, semantic agreements, semantic integration

Adapting content to mobile devices

DOM-based content extraction of HTML documents BIBAKFull-Text 207-214
  Suhit Gupta; Gail Kaiser; David Neistadt; Peter Grimm
Web pages often contain clutter (such as pop-up ads, unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. Extraction of "useful and relevant" content from web pages has many applications, including cell phone and PDA browsing, speech rendering for the visually impaired, and text summarization. Most approaches to removing clutter or making content more readable involve changing font size or removing HTML and data components such as images, which takes away from a webpage's inherent look and feel. Unlike "Content Reformatting", which aims to reproduce the entire webpage in a more convenient form, our solution directly addresses "Content Extraction". We have developed a framework that employs easily extensible set of techniques that incorporate advantages of previous work on content extraction. Our key insight is to work with the DOM trees, rather than with raw HTML markup. We have implemented our approach in a publicly available Web proxy to extract content from HTML web pages.
Keywords: DOM trees, HTML documents, accessibility, content extraction, reformatting, speech rendering
Fractal summarization for mobile devices to access large documents on the web BIBAKFull-Text 215-224
  Christopher C. Yang; Fu Lee Wang
Wireless access with mobile (or handheld) devices is a promising addition to the WWW and traditional electronic business. Mobile devices provide convenience and portable access to the huge information space on the Internet without requiring users to be stationary with network connection. However, the limited screen size, narrow network bandwidth, small memory capacity and low computing power are the shortcomings of handheld devices. Loading and visualizing large documents on handheld devices become impossible. The limited resolution restricts the amount of information to be displayed. The download time is intolerably long. In this paper, we introduce the fractal summarization model for document summarization on handheld devices. Fractal summarization is developed based on the fractal theory. It generates a brief skeleton of summary at the first stage, and the details of the summary on different levels of the document are generated on demands of users. Such interactive summarization reduces the computation load in comparing with the generation of the entire summary in one batch by the traditional automatic summarization, which is ideal for wireless access. Three-tier architecture with the middle-tier conducting the major computation is also discussed. Visualization of summary on handheld devices is also investigated.
Keywords: document summarization, fractal summarization, handheld devices, mobile commerce
Detecting web page structure for adaptive viewing on small form factor devices BIBAKFull-Text 225-233
  Yu Chen; Wei-Ying Ma; Hong-Jiang Zhang
Mobile devices have already been widely used to access the Web. However, because most available web pages are designed for desktop PC in mind, it is inconvenient to browse these large web pages on a mobile device with a small screen. In this paper, we propose a new browsing convention to facilitate navigation and reading on a small-form-factor device. A web page is organized into a two level hierarchy with a thumbnail representation at the top level for providing a global view and index to a set of sub-pages at the bottom level for detail information. A page adaptation technique is also developed to analyze the structure of an existing web page and split it into small and logically related units that fit into the screen of a mobile device. For a web page not suitable for splitting, auto-positioning or scrolling-by-block is used to assist the browsing as an alternative. Our experimental results show that our proposed browsing convention and developed page adaptation scheme greatly improve the user's browsing experiences on a device with a small display.
Keywords: adaptive hypermedia, content adaptation, mobile browser

Writing the web

Supporting management reporting: a writable web case study BIBAKFull-Text 234-243
  Timothy Miles-Board; Leslie Carr; Simon Kampa; Wendy Hall
The World-Wide Web was originally developed as a shared, writable, hypertext medium, a facility that is still widely needed.
   We have recently developed a Web-based management reporting system for a legal firm in an attempt to improve the efficiency and management of their overall business process. This paper shares our experiences in relating the firm's specific writing and issue tracking tasks to existing Web, open hypermedia, and Semantic Web research, and describes why we chose to develop a new solution -- a set of open hypermedia components collectively called the Management Reporting System -- rather than employ an existing system.
Keywords: hypertext writing, management reporting, open hypermedia, structural computing
Scholarly publishing and argument in hyperspace BIBAKFull-Text 244-250
  Victoria Uren; Simon Buckingham Shum; Gangmin Li; John Domingue; Enrico Motta
The World Wide Web is opening up access to documents and data for scholars. However it has not yet impacted on one of the primary activities in research: assessing new findings in the light of current knowledge and debating it with colleagues. The ClaiMaker system uses a directed graph model with similarities to hypertext, in which new ideas are published as nodes, which other contributors can build on or challenge in a variety of ways by linking to them. Nodes and links have semantic structure to facilitate the provision of specialist services for interrogating and visualizing the emerging network. By way of example, this paper is grounded in a ClaiMaker model to illustrate how new claims can be described in this structured way.
Keywords: collaborative web, modeling debate, scholarly interpretation, scientific publishing
Mining topic-specific concepts and definitions on the web BIBAKFull-Text 251-260
  Bing Liu; Chee Wee Chin; Hwee Tou Ng
Traditionally, when one wants to learn about a particular topic, one reads a book or a survey paper. With the rapid expansion of the Web, learning in-depth knowledge about a topic from the Web is becoming increasingly important and popular. This is also due to the Web's convenience and its richness of information. In many cases, learning from the Web may even be essential because in our fast changing world, emerging topics appear constantly and rapidly. There is often not enough time for someone to write a book on such topics. To learn such emerging topics, one can resort to research papers. However, research papers are often hard to understand by non-researchers, and few research papers cover every aspect of the topic. In contrast, many Web pages often contain intuitive descriptions of the topic. To find such Web pages, one typically uses a search engine. However, current search techniques are not designed for in-depth learning. Top ranking pages from a search engine may not contain any description of the topic. Even if they do, the description is usually incomplete since it is unlikely that the owner of the page has good knowledge of every aspect of the topic. In this paper, we attempt a novel and challenging task, mining topic-specific knowledge on the Web. Our goal is to help people learn in-depth knowledge of a topic systematically on the Web. The proposed techniques first identify those sub-topics or salient concepts of the topic, and then find and organize those informative pages, containing definitions and descriptions of the topic and sub-topics, just like those in a book. Experimental results using 28 topics show that the proposed techniques are highly effective.
Keywords: definition mining, domain concept mining, information integration, knowledge compilation, web content mining

Link-based ranking 1

Extrapolation methods for accelerating PageRank computations BIBAKFull-Text 261-270
  Sepandar D. Kamvar; Taher H. Haveliwala; Christopher D. Manning; Gene H. Golub
We present a novel algorithm for the fast computation of PageRank, a hyperlink-based estimate of the ''importance'' of Web pages. The original PageRank algorithm uses the Power Method to compute successive iterates that converge to the principal eigenvector of the Markov matrix representing the Web link graph. The algorithm presented here, called Quadratic Extrapolation, accelerates the convergence of the Power Method by periodically subtracting off estimates of the nonprincipal eigenvectors from the current iterate of the Power Method. In Quadratic Extrapolation, we take advantage of the fact that the first eigenvalue of a Markov matrix is known to be 1 to compute the nonprincipal eigenvectors using successive iterates of the Power Method. Empirically, we show that using Quadratic Extrapolation speeds up PageRank computation by 25-300% on a Web graph of 80 million nodes, with minimal overhead. Our contribution is useful to the PageRank community and the numerical linear algebra community in general, as it is a fast method for determining the dominant eigenvector of a matrix that is too large for standard fast methods to be practical.
Keywords: PageRank, eigenvector computation, link analysis
Scaling personalized web search BIBAKFull-Text 271-279
  Glen Jeh; Jennifer Widom
Recent web search techniques augment traditional text matching with a global notion of "importance" based on the linkage structure of the web, such as in Google's PageRank algorithm. For more refined searches, this global notion of importance can be specialized to create personalized views of importance -- for example, importance scores can be biased according to a user-specified set of initially-interesting pages. Computing and storing all possible personalized views in advance is impractical, as is computing personalized views at query time, since the computation of each view requires an iterative computation over the web graph. We present new graph-theoretical results, and a new technique based on these results, that encode personalized views as partial vectors. Partial vectors are shared across multiple personalized views, and their computation and storage costs scale well with the number of views. Our approach enables incremental computation, so that the construction of personalized views from partial vectors is practical at query time. We present efficient dynamic programming algorithms for computing partial vectors, an algorithm for constructing personalized views from partial vectors, and experimental results demonstrating the effectiveness and scalability of our techniques.
Keywords: PageRank, web search
Adaptive on-line page importance computation BIBAKFull-Text 280-290
  Serge Abiteboul; Mihai Preda; Gregory Cobena
The computation of page importance in a huge dynamic graph has recently attracted a lot of attention because of the web. Page importance, or page rank is defined as the fixpoint of a matrix equation. Previous algorithms compute it off-line and require the use of a lot of extra CPU as well as disk resources (e.g. to store, maintain and read the link matrix). We introduce a new algorithm OPIC that works on-line, and uses much less resources. In particular, it does not require storing the link matrix. It is on-line in that it continuously refines its estimate of page importance while the web/graph is visited. Thus it can be used to focus crawling to the most interesting pages. We prove the correctness of OPIC. We present Adaptive OPIC that also works on-line but adapts dynamically to changes of the web. A variant of this algorithm is now used by Xyleme.
   We report on experiments with synthetic data. In particular, we study the convergence and adaptiveness of the algorithms for various scheduling strategies for the pages to visit. We also report on experiments based on crawls of significant portions of the web.
Keywords: hyperlink, page importance, web graph

Applications and architecture

SHOCK: communicating with computational messages and automatic private profiles BIBAKFull-Text 291-300
  Rajan M. Lukose; Eytan Adar; Joshua R. Tyler; Caesar Sengupta
A computationally enhanced message contains some embedded programmatic components that are interpreted and executed automatically upon receipt. Unlike ordinary text email or instant messages, they make possible a number of useful applications. In this paper, we describe a general and flexible messaging system called SHOCK that extends the functionality of prior computational email systems by allowing XML-encoded SHOCK messages to interact with an automatically created profile of a user. These profiles consist of information about the most common tasks users perform, such as their Web browsing behavior, their conventional email usage, etc. Since users are sensitive about such data, the system is designed with privacy as a central design goal, and employs a distributed peer-to-peer architecture to achieve it. The system is largely implemented with commodity Web technologies and provides both a Web interface as well as one that is tightly integrated with users ordinary email clients. With SHOCK, users can send highly targeted messages without violating others privacy, and engage in structured conversation appropriate to the context without disrupting their existing work practices. We describe our implementation in detail, the most useful novel applications of the system, and our experiences with the system in a pilot field test.
Keywords: collaborative systems, networking and distributed web applications, privacy and preferences
P2Cast: peer-to-peer patching scheme for VoD service BIBAKFull-Text 301-309
  Yang Guo; Kyoungwon Suh; Jim Kurose; Don Towsley
Providing video on demand (VoD) service over the Internet in a scalable way is a challenging problem. In this paper, we propose P2Cast -- an architecture that uses a peer-to-peer approach to cooperatively stream video using patching techniques, while only relying on unicast connections among peers. We address the following two key technical issues in P2Cast: (1) constructing an application overlay appropriate for streaming; and (2) providing continuous stream playback (without glitches) in the face of disruption from an early departing client. Our simulation experiments show that P2Cast can serve many more clients than traditional client-server unicast service, and that it generally out-performs multicast-based patching if clients can cache more than of a stream's initial portion. We handle disruptions by delaying the start of playback and applying the shifted forwarding technique. A threshold on the length of time during which arriving clients are served in a single session in P2Cast serves as a knob to adjust the balance between the scalability and the clients' viewing quality in P2Cast.
Keywords: patching, peer-to-peer networks, performance evaluation, video on-demand service
DEW: DNS-enhanced web for faster content delivery BIBAFull-Text 310-320
  Balachander Krishnamurthy; Richard Liston; Michael Rabinovich
With a key component of latency on the Web being connection set up between clients and Web servers, several ways to avoid connections have been explored. While the work in recent years on Content Distribution Networks (CDNs) have moved some content 'closer' to users at the cost of increasing DNS traffic, they have not fully exploited the available unused potential of existing protocols. We explore ways by which a variety of Web responses can be piggybacked on DNS messages. While we evaluated our idea in the Web context, the approach is generic and not restricted to Web responses. We propose an architecture for HTTP piggybacking in DNS messages and carry out a detailed performance analysis based on a trace-driven simulation study. Our architecture requires minimal extensions to existing protocols, utilizing only the allowed optional fields for these extensions. It is fully compatible and can coexist with the current Web.

E-commerce

A system for principled matchmaking in an electronic marketplace BIBAKFull-Text 321-330
  Tommaso Di Noia; Eugenio Di Sciascio; Francesco M. Donini; Marina Mongiello
More and more resources are becoming available on the Web, and there is a growing need for infrastructures that, based on advertised descriptions, are able to semantically match demands with supplies.
   We formalize general properties a matchmaker should have, then we present a matchmaking facilitator, compliant with desired properties.
   The system embeds a NeoClassic reasoner, whose structural subsumption algorithm has been modified to allow match categorization into potential and partial, and ranking of matches within categories. Experiments carried out show the good correspondence between users and system rankings.
Keywords: description logics, e-commerce, knowledge representation, matchmaking
A software framework for matchmaking based on semantic web technology BIBAKFull-Text 331-339
  Lei Li; Ian Horrocks
An important objective of the Semantic Web is to make Electronic Commerce interactions more flexible and automated. To achieve this, standardization of ontologies, message content and message protocols will be necessary.
   In this paper we investigate how Semantic and Web Services technologies can be used to support service advertisement and discovery in e-commerce. In particular, we describe the design and implementation of a service matchmaking prototype which uses a DAML-S based ontology and a Description Logic reasoner to compare ontology based service descriptions. We also present the results of initial experiments testing the performance of this prototype implementation in a realistic agent based e-commerce scenario.
Keywords: ontologies, semantic web, web services
SweetDeal: representing agent contracts with exceptions using XML rules, ontologies, and process descriptions BIBAKFull-Text 340-349
  Benjamin N. Grosof; Terrence C. Poon
SweetDeal is a rule-based approach to representation of business contracts that enables software agents to create, evaluate, negotiate, and execute contracts with substantial automation and modularity. It builds upon the situated courteous logic programs knowledge representation in RuleML, the emerging standard for Semantic Web XML rules. Here, we newly extend the SweetDeal approach by also incorporating process knowledge descriptions whose ontologies are represented in DAML+OIL (emerging standard for Semantic Web ontologies) thereby enabling more complex contracts with behavioral provisions, especially for handling exception conditions (e.g., late delivery or non-payment) that might arise during the execution of the contract. This provides a foundation for representing and automating deals about services -- in particular, about Web Services, so as to help search, select, and compose them. Our system is also the first to combine emerging Semantic Web standards for knowledge representation of rules (RuleML) with ontologies (DAML+OIL) for a practical e-business application domain, and further to do so with process knowledge. This also newly fleshes out the evolving concept of Semantic Web Services. A prototype (soon public) is running.
Keywords: DAML+OIL, OWL, RDF, XML, business process automation, declarative, description logic, electronic commerce, electronic contracts, intelligent software agents, knowledge representation, knowledge-based, logic programs, ontologies, process descriptions, process knowledge, rules, semantic web, semantic web services, web services

Link-based ranking 2

A new paradigm for ranking pages on the world wide web BIBAKFull-Text 350-355
  John A. Tomlin
This paper describes a new paradigm for modeling traffic levels on the world wide web (WWW) using a method of entropy maximization. This traffic is subject to the conservation conditions of a circulation flow in the entire WWW, an aggregation of the WWW, or a subgraph of the WWW (such as an intranet or extranet). We specifically apply the primal and dual solutions of this model to the (static) ranking of web sites. The first of these uses an imputed measure of total traffic through a web page, the second provides an analogy of local "temperature", allowing us to quantify the "HOTness" of a page.
Keywords: entropy, optimization, search engines, static ranking
Adaptive ranking of web pages BIBAKFull-Text 356-365
  Ah Chung Tsoi; Gianni Morini; Franco Scarselli; Markus Hagenbuchner; Marco Maggini
In this paper, we consider the possibility of altering the PageRank of web pages, from an administrator's point of view, through the modification of the PageRank equation. It is shown that this problem can be solved using the traditional quadratic programming techniques. In addition, it is shown that the number of parameters can be reduced by clustering web pages together through simple clustering techniques. This problem can be formulated and solved using quadratic programming techniques. It is demonstrated experimentally on a relatively large web data set, viz., the WT10G, that it is possible to modify the PageRanks of the web pages through the proposed method using a set of linear constraints. It is also shown that the PageRank of other pages may be affected; and that the quality of the result depends on the clustering technique used. It is shown that our results compared well with those obtained by a HITS based method.
Keywords: PageRank, adaptive PageRank determinations, learning PageRank, quadratic programming applications, search engine
Searching the workplace web BIBAFull-Text 366-375
  Ronald Fagin; Ravi Kumar; Kevin S. McCurley; Jasmine Novak; D. Sivakumar; John A. Tomlin; David P. Williamson
The social impact from the World Wide Web cannot be underestimated, but technologies used to build the Web are also revolutionizing the sharing of business and government information within intranets. In many ways the lessons learned from the Internet carry over directly to intranets, but others do not apply. In particular, the social forces that guide the development of intranets are quite different, and the determination of a "good answer" for intranet search is quite different than on the Internet. In this paper we study the problem of intranet search. Our approach focuses on the use of rank aggregation, and allows us to examine the effects of different heuristics on ranking of search results.

Multimedia

Peer-to-peer architecture for content-based music retrieval on acoustic data BIBAKFull-Text 376-383
  Cheng Yang
In traditional peer-to-peer search networks, operations focus on properly labeled files such as music or video, and the actual search is often limited to text tags. The explosive growth of available multimedia documents in recent years calls for more flexible search capabilities, namely search by content. Most content-based search algorithms are computationally intensive, making them inappropriate for a peer-to-peer environment. In this paper, we discuss a content-based music retrieval algorithm that can be decomposed and parallelized efficiently. We present a peer-to-peer architecture for such a system that makes use of spare resources among subscribers, with protocols that dynamically redistribute load in order to maximize throughput and minimize inconvenience to subscribers. Our framework can be extended beyond the music retrieval domain and adapted to other scenarios where resource pooling is desired, as long as the underlying algorithm satisfies certain conditions.
Keywords: acoustic data, content-based music retrieval, distributed, load balancing, peer-to-peer, resource pooling
Towards a multimedia formatting vocabulary BIBAKFull-Text 384-393
  Jacco van Ossenbruggen; Lynda Hardma; Joost Geurts; Lloyd Rutledge
Time-based, media-centric Web presentations can be described declaratively in the XML world through the development of languages such as SMIL. It is difficult, however, to fully integrate them in a complete document transformation processing chain. In order to achieve the desired processing of data-driven, time-based, media-centric presentations, the text-flow based formatting vocabularies used by style languages such as XSL, CSS and DSSSL need to be extended. The paper presents a selection of use cases which are used to derive a list of requirements for a multimedia style and transformation formatting vocabulary. The boundaries of applicability of existing text-based formatting models for media-centric transformations are analyzed. The paper then discusses the advantages and disadvantages of a fully-fledged time-based multimedia formatting model. Finally, the discussion is illustrated by describing the key properties of the example multimedia formatting vocabulary currently implemented in the back-end of our Cuypers multimedia transformation engine.
Keywords: Cuypers, document transformation, formatting objects, hyper-media, multimedia
Architecture of a quality based intelligent proxy (QBIX) for MPEG-4 videos BIBAKFull-Text 394-402
  Peter Schojer; Laszlo Böszörmenyi; Hermann Hellwagner; Bernhard Penz; Stefan Podlipnig
Due to the increasing availability and use of digital video data on the Web, video caching will be an important performance factor in the future WWW. We propose an architecture of a video proxy cache that integrates modern multimedia and communication standards. Especially we describe features of the MPEG-4 and MPEG-7 multimedia standards that can be helpful for a video proxy cache. QBIX supports real-time adaptation in the compressed and in the decompressed domain. It uses adaptation to improve the cache replacement strategies in the proxy, but also to realize media gateway functionality driven by the clients' terminal capabilities.
Keywords: LRU, MPEG-4, MPEG-7, RTP, RTSP, media adaptation, media gateway, replacement, video caching, video proxy

Web engineering

Conversation specification: a new approach to design and analysis of e-service composition BIBAKFull-Text 403-410
  Tevfik Bultan; Xiang Fu; Richard Hull; Jianwen Su
This paper introduces a framework for modeling and specifying the global behavior of e-service compositions. Under this framework, peers (individual e-services) communicate through asynchronous messages and each peer maintains a queue for incoming messages. A global "watcher" keeps track of messages as they occur. We propose and study a central notion of a "conversation", which is a sequence of (classes of) messages observed by the watcher. We consider the case where the peers are represented by Mealy machines (finite state machines with input and output). The sets of conversations exhibit unexpected behaviors. For example, there exists a composite e-service based on Mealy peers whose set of conversations is not context free (and not regular). (The set of conversations is always context sensitive.) One cause for this is the queuing of messages; we introduce an operator "prepone" that simulates queue delays from a global perspective and show that the set of conversations of each Mealy e-service is closed under prepone. We illustrate that the global prepone fails to completely capture the queue delay effects and refine prepone to a "local" version on conversations seen by individual peers. On the other hand, Mealy implementations of a composite e-service will always generate conversations whose "projections" are consistent with individual e-services. We use projection-join to reflect such situations. However, there are still Mealy peers whose set of conversations is not the local prepone and projection-join closure of any regular language. Therefore, we propose conversation specifications as a formalism to define the conversations allowed by an e-service composition. We give two technical results concerning the interplay between the local behaviors of Mealy peers and the global behaviors of their compositions. One result shows that for each regular language, its local prepone and projection-join closure corresponds to the set of conversations by some Mealy peers effectively constructed from. The second result gives a condition on the shape of a composition which guarantees that the set of conversations that can be realized is the local prepone and projection-join closure of a regular language.
Keywords: communicating finite sate automata, conversation specification, e-service composition
Quality driven web services composition BIBAKFull-Text 411-421
  Liangzhao Zeng; Boualem Benatallah; Marlon Dumas; Jayant Kalagnanam; Quan Z. Sheng
The process-driven composition of Web services is emerging as a promising approach to integrate business applications within and across organizational boundaries. In this approach, individual Web services are federated into composite Web services whose business logic is expressed as a process model. The tasks of this process model are essentially invocations to functionalities offered by the underlying component services. Usually, several component services are able to execute a given task, although with different levels of pricing and quality. In this paper, we advocate that the selection of component services should be carried out during the execution of a composite service, rather than at design-time. In addition, this selection should consider multiple criteria (e.g., price, duration, reliability), and it should take into account global constraints and preferences set by the user (e.g., budget constraints). Accordingly, the paper proposes a global planning approach to optimally select component services during the execution of a composite service. Service selection is formulated as an optimization problem which can be solved using efficient linear programming methods. Experimental results show that this global planning approach outperforms approaches in which the component services are selected individually for each task in a composite service.
Keywords: QoS, service composition, web services
A foundation for tool based mobility support for visually impaired web users BIBAKFull-Text 422-430
  Yeliz Yesilada; Robert Stevens; Carole Goble
Users make journeys through the Web. Web travel encompasses the tasks of orientation and navigation, the environment and the purpose of the journey. The ease of travel, its mobility, varies from page to page and site to site. For visually impaired users, in particular, mobility is reduced; the objects that support travel are inaccessible or missing altogether. Web development tools need to include support to increase mobility. We present a framework for finding and classifying travel objects within Web pages. The evaluation carried out has shown that this framework supports a systematic and consistent method for assessing travel upon the Web. We propose that such a framework can provide the foundation for a semi-automated tool for the support of travel upon the Web.
Keywords: mobility, mobility support tool, travel, travel objects, visual impairment

Establishing the semantic web 11

On deep annotation BIBAKFull-Text 431-438
  Siegfried Handschuh; Steffen Staab; Raphael Volz
The success of the Semantic Web crucially depends on the easy creation, integration and use of semantic data. For this purpose, we consider an integration scenario that defies core assumptions of current metadata construction methods. We describe a framework of metadata creation when web pages are generated from a database and the database owner is cooperatively participating in the Semantic Web. This leads us to the definition of ontology mapping rules by manual semantic annotation and the usage of the mapping rules and of web services for semantic queries. In order to create metadata, the framework combines the presentation layer with the data description layer -- in contrast to "conventional" annotation, which remains at the presentation layer. Therefore, we refer to the framework as deep annotation 1.We consider deep annotation as particularly valid because, (i), web pages generated from databases outnumber static web pages, (ii), annotation of web pages may be a very intuitive way to create semantic data from a database and, (iii), data from databases should not be materialized as RDF files, it should remain where it can be handled most efficiently -- in its databases.
Keywords: annotation, information integration, mapping and merging, metadata, semantic web, wrapping
An infrastructure for searching, reusing and evolving distributed ontologies BIBAKFull-Text 439-448
  A. Maedche; B. Motik; L. Stojanovic; R. Studer; R. Volz
The vision of the Semantic Web can only be realized through proliferation of well-known ontologies describing different domains. To enable interoperability in the Semantic Web, it will be necessary to break these ontologies down into smaller, well-focused units that may be reused. Currently, three problems arise in that scenario. Firstly, it is difficult to locate ontologies to be reused, thus leading to many ontologies modeling the same thing. Secondly, current tools do not provide means for reusing existing ontologies while building new ontologies. Finally, ontologies are rarely static, but are being adapted to changing requirements. Hence, an infrastructure for management of ontology changes, taking into account dependencies between ontologies is needed. In this paper we present such an infrastructure addressing the aforementioned problems.
Keywords: ontology evolution, ontology registry, ontology reuse

Consistency and replication

Application specific data replication for edge services BIBAKFull-Text 449-460
  Lei Gao; Mike Dahlin; Amol Nayate; Jiandan Zheng; Arun Iyengar
The emerging edge services architecture promises to improve the availability and performance of web services by replicating servers at geographically distributed sites. A key challenge in such systems is data replication and consistency so that edge server code can manipulate shared data without incurring the availability and performance penalties that would be incurred by accessing a traditional centralized database. This paper explores using a distributed object architecture to build an edge service system for an e-commerce application, an online bookstore represented by the TPC-W benchmark. We take advantage of application specific semantics to design distributed objects to manage a specific subset of shared information using simple and effective consistency models. Our experimental results show that by slightly relaxing consistency within individual distributed objects, we can build an edge service system that is highly available and efficient. For example, in one experiment we find that our object-based edge server system provides a factor of five improvement in response time over a traditional centralized cluster architecture and a factor of nine improvement over an edge service system that distributes code but retains a centralized database.
Keywords: availability, data replication, distributed objects, edge services, performance, wide area networks (WAN)
Evaluation of edge caching/offloading for dynamic content delivery BIBAKFull-Text 461-471
  Chun Yuan; Yu Chen; Zheng Zhang
As dynamic content becomes increasingly dominant, it becomes an important research topic as how the edge resources such as client-side proxies, which are otherwise underutilized for such content, can be put into use. However, it is unclear what will be the best strategy and the design/deployment tradeoffs lie therein. In this paper, using one representative e-commerce benchmark, we report our experience of an extensive investigation of different offloading and caching options. Our results point out that, while great benefits can be reached in general, advanced offloading strategies can be overly complex and even counter-productive. In contrast, simple augmentation at proxies to enable fragment caching and page composition achieves most of the benefit without compromising important considerations such as security.
Keywords: dynamic content, edge caching, offloading
Modeling redirection in geographically diverse server sets BIBAKFull-Text 472-481
  Lisa Amini; Anees Shaikh; Henning Schulzrinne
Internet server selection mechanisms attempt to optimize, subject to a variety of constraints, the distribution of client requests to a geographically and topologically diverse pool of servers. Research on server selection has thus far focused primarily on techniques for choosing a server from a group administered by single entity, like a content distribution network provider. In a federated, multi-provider computing system, however, selection must occur over distributed server sets deployed by the participating providers, without the benefit of the full information available in the single-provider case. Intelligent server set selection algorithms will require a model of the expected performance clients would receive from a candidate server set.
   In this paper, we study whether the complex policies and dynamics of intelligent server selection can be effectively modeled in order to predict client performance for server sets. We introduce a novel server set distance metric, and use it in a measurement study of several million server selection transactions to develop simple models of existing server selection schemes. We then evaluate these models in terms of their ability to accurately predict performance for a second, larger set of distributed clients. We show that our models are able to predict performance within 20ms for over 90% of the observed samples. Our analysis demonstrates that although existing deployments use a variety of complex and dynamic server selection criteria, most of which are proprietary, these schemes can be modeled with surprising accuracy.
Keywords: content distribution network (CDN), performance, server selection, web traffic redirection

Open hypermedia and the web

Offering open hypermedia services to the WWW: a step-by-step approach for developers BIBAKFull-Text 482-489
  Nikos Karousos; Ippokratis Pandis; Siegfried Reich; Manolis Tzagarakis
Hypermedia systems and more specifically open hypermedia systems (OHS) provide a rich set of implementations of different hypertext flavors such as navigational hypertext, spatial hypertext or taxonomic hypertext. Additionally, these systems offer component-based modular architectures and address interoperability between hypertext domains. Despite multiple efforts of integrating Web clients, a widespread adoption of OHS technology by Web developers has not taken place. In this paper it is argued that Web Services -- which offer a component model for Web applications -- can be integrated in OHSs. An architectural integration is proposed, a step-by-step process is outlined and an example of integration is provided. This very approach is aimed to benefit both worlds: the Web community with new rich hypermedia functionality that extends the current navigational hypermedia, and the OHS community by opening its tools and platforms to the many developer groups of the Web community.
Keywords: babylon system, hypermedia services, open hypermedia systems, web services
Xspect: bridging open hypermedia and XLink BIBAKFull-Text 490-499
  Bent G. Christensen; Frank Allan Hansen; Niels Olof Bouvin
This paper evaluates the XLink format in comparison with other linking formats. The comparison is based on Xspect, an implementation of XLink. Xspect handles transformation between an open hypermedia format (OHIF) and XLink, and the paper discusses this isomorphic transformation and generalises it to include another open hypermedia format, FOHM. The Xspect system, based on XSLT and JavaScript, provides users with an interface to browse and merge linkbases. Xspect supports navigational hypermedia in the form of links inserted on the fly into Web pages, as well as guided tours presented as SVG. Xspect has two implementations: one server-side and one running on the client. Both implementation provide the user with an interface for the creation of annotations. The main result of the paper is a critique of XLink. XLink is shown to be a format well suited for navigational hypermedia, but lacking in more advanced constructs. More problematic are the issues regarding large-scale use, such as evaluating validity and credibility of linkbases, and ensuring general support for a format as flexible as XLink.
Keywords: FOHM, OHIF, SVG, XLink, XPointer, Xspect, annotations, open hypermedia
The XML web: a first study BIBAKFull-Text 500-510
  Laurent Mignet; Denilson Barbosa; Pierangelo Veltri
Although originally designed for large-scale electronic publishing, XML plays an increasingly important role in the exchange of data on the Web. In fact, it is expected that XML will become the lingua franca of the Web, eventually replacing HTML. Not surprisingly, there has been a great deal of interest on XML both in industry and in academia. Nevertheless, to date no comprehensive study on the XML Web (i.e., the subset of the Web made of XML documents only) nor on its contents has been made. This paper is the first attempt at describing the XML Web and the documents contained in it. Our results are drawn from a sample of a repository of the publicly available XML documents on the Web, consisting of about 200,000 documents. Our results show that, despite its short history, XML already permeates the Web, both in terms of generic domains and geographically. Also, our results about the contents of the XML Web provide valuable input for the design of algorithms, tools and systems that use XML in one form or another.
Keywords: XML documents, XML web, statistical analysis, structural properties

Data mining

A matrix density based algorithm to hierarchically co-cluster documents and words BIBAFull-Text 511-518
  Bhushan Mandhani; Sachindra Joshi; Krishna Kummamuru
This paper proposes an algorithm to hierarchically cluster documents. Each cluster is actually a cluster of documents and an associated cluster of words, thus a document-word co-cluster. Note that, the vector model for documents creates the document-word matrix, of which every co-cluster is a submatrix. One would intuitively expect a submatrix made up of high values to be a good document cluster, with the corresponding word cluster containing its most distinctive features. Our algorithm looks to exploit this. We have defined matrix density, and our algorithm basically uses matrix density considerations in its working.
   The algorithm is a partitional-agglomerative algorithm. The partitioning step involves the identification of dense submatrices so that the respective row sets partition the row set of the complete matrix. The hierarchical agglomerative step involves merging the most "similar" submatrices until we are down to the required number of clusters (if we want a flat clustering) or until we have just the single complete matrix left (if we are interested in a hierarchical arrangement of documents). It also generates apt labels for each cluster or hierarchy node. The similarity measure between clusters that we use here for the merging cleverly uses the fact that the clusters here are co-clusters, and is a key point of difference from existing agglomerative algorithms. We will refer to the proposed algorithm as RPSA (Rowset Partitioning and Submatrix Agglomeration). We have compared it as a clustering algorithm with Spherical K-Means and Spectral Graph Partitioning. We have also evaluated some hierarchies generated by the algorithm.
Mining the peanut gallery: opinion extraction and semantic classification of product reviews BIBAKFull-Text 519-528
  Kushal Dave; Steve Lawrence; David M. Pennock
The web contains a wealth of product reviews, but sifting through them is a daunting task. Ideally, an opinion mining tool would process a set of search results for a given item, generating a list of product attributes (quality, features, etc.) and aggregating opinions about each of them (poor, mixed, good). We begin by identifying the unique properties of this problem and develop a method for automatically distinguishing between positive and negative reviews. Our classifier draws on information retrieval techniques for feature extraction and scoring, and the results for various metrics and heuristics vary depending on the testing situation. The best methods work as well as or better than traditional machine learning. When operating on individual sentences collected from web searches, performance is limited due to noise and ambiguity. But in the context of a complete web-based tool and aided by a simple method for grouping sentences into attributes, the results are qualitatively quite useful.
Keywords: document classification, opinion mining
Mining newsgroups using networks arising from social behavior BIBAKFull-Text 529-535
  Rakesh Agrawal; Sridhar Rajagopalan; Ramakrishnan Srikant; Yirong Xu
Recent advances in information retrieval over hyperlinked corpora have convincingly demonstrated that links carry less noisy information than text. We investigate the feasibility of applying link-based methods in new applications domains. The specific application we consider is to partition authors into opposite camps within a given topic in the context of newsgroups. A typical newsgroup posting consists of one or more quoted lines from another posting followed by the opinion of the author. This social behavior gives rise to a network in which the vertices are individuals and the links represent "responded-to" relationships. An interesting characteristic of many newsgroups is that people more frequently respond to a message when they disagree than when they agree. This behavior is in sharp contrast to the WWW link graph, where linkage is an indicator of agreement or common interest. By analyzing the graph structure of the responses, we are able to effectively classify people into opposite camps. In contrast, methods based on statistical analysis of text yield low accuracy on such datasets because the vocabulary used by the two sides tends to be largely identical, and many newsgroup postings consist of relatively few words of text.
Keywords: data mining, link analysis, newsgroup, social network, text mining, web mining

Scaling up the semantic web

Super-peer-based routing and clustering strategies for RDF-based peer-to-peer networks BIBAKFull-Text 536-543
  Wolfgang Nejdl; Martin Wolpers; Wolf Siberski; Christoph Schmitz; Mario Schlosser; Ingo Brunkhorst; Alexander Löser
RDF-based P2P networks have a number of advantages compared with simpler P2P networks such as Napster, Gnutella or with approaches based on distributed indices such as CAN and CHORD. RDF-based P2P networks allow complex and extendable descriptions of resources instead of fixed and limited ones, and they provide complex query facilities against these metadata instead of simple keyword-based searches.
   In previous papers, we have described the Edutella infrastructure and different kinds of Edutella peers implementing such an RDF-based P2P network. In this paper we will discuss these RDF-based P2P networks as a specific example of a new type of P2P networks, schema-based P2P networks, and describe the use of super-peer based topologies for these networks. Super-peer based networks can provide better scalability than broadcast based networks, and do provide perfect support for inhomogeneous schema-based networks, which support different metadata schemas and ontologies (crucial for the Semantic Web). Furthermore, as we will show in this paper, they are able to support sophisticated routing and clustering strategies based on the metadata schemas, attributes and ontologies used. Especially helpful in this context is the RDF functionality to uniquely identify schemas, attributes and ontologies. The resulting routing indices can be built using dynamic frequency counting algorithms and support local mediation and transformation rules, and we will sketch some first ideas for implementing these advanced functionalities as well.
Keywords: distributed RDF repositories, peer-to-peer, schema-based routing, semantic web
On labeling schemes for the semantic web BIBAFull-Text 544-555
  Vassilis Christophides; Dimitris Plexousakis; Michel Scholl; Sotirios Tourtounis
This paper focuses on the optimization of the navigation through voluminous subsumption hierarchies of topics employed by Portal Catalogs like Netscape Open Directory (ODP). We advocate for the use of labeling schemes for modeling these hierarchies in order to efficiently answer queries such as subsumption check, descendants, ancestors or nearest common ancestor, which usually require costly transitive closure computations. We first give a qualitative comparison of three main families of schemes, namely bit vector, prefix and interval based schemes. We then show that two labeling schemes are good candidates for an efficient implementation of label querying using standard relational DBMS, namely, the Dewey Prefix scheme [6] and an Interval scheme by Agrawal, Borgida and Jagadish [1]. We compare their storage and query evaluation performance for the 16 ODP hierarchies using the PostgreSQL engine.
Piazza: data management infrastructure for semantic web applications BIBAKFull-Text 556-567
  Alon Y. Halevy; Zachary G. Ives; Peter Mork; Igor Tatarinov
The Semantic Web envisions a World Wide Web in which data is described with rich semantics and applications can pose complex queries. To this point, researchers have defined new languages for specifying meanings for concepts and developed techniques for reasoning about them, using RDF as the data model. To flourish, the Semantic Web needs to be able to accommodate the huge amounts of existing data and the applications operating on them. To achieve this, we are faced with two problems. First, most of the world's data is available not in RDF but in XML; XML and the applications consuming it rely not only on the domain structure of the data, but also on its document structure. Hence, to provide interoperability between such sources, we must map between both their domain structures and their document structures. Second, data management practitioners often prefer to exchange data through local point-to-point data translations, rather than mapping to common mediated schemas or ontologies.
   This paper describes the Piazza system, which addresses these challenges. Piazza offers a language for mediating between data sources on the Semantic Web, which maps both the domain structure and document structure. Piazza also enables interoperation of XML data with RDF data that is accompanied by rich OWL ontologies. Mappings in Piazza are provided at a local scale between small sets of nodes, and our query answering algorithm is able to chain sets mappings together to obtain relevant data from across the Piazza network. We also describe an implemented scenario in Piazza and the lessons we learned from it.
Keywords: XML, peer data management systems, semantic web

Dynamic services and analysis

On the bursty evolution of blogspace BIBAFull-Text 568-576
  Ravi Kumar; Jasmine Novak; Prabhakar Raghavan; Andrew Tomkins
We propose two new tools to address the evolution of hyperlinked corpora. First, we define time graphs to extend the traditional notion of an evolving directed graph, capturing link creation as a point phenomenon in time. Second, we develop definitions and algorithms for time-dense community tracking, to crystallize the notion of community evolution. We develop these tools in the context of Blogspace, the space of weblogs (or blogs). Our study involves approximately 750K links among 25K blogs. We create a time graph on these blogs by an automatic analysis of their internal time stamps. We then study the evolution of connected component structure and microscopic community structure in this time graph. We show that Blogspace underwent a transition behavior around the end of 2001, and has been rapidly expanding over the past year, not just in metrics of scale, but also in metrics of community structure and connectedness. This expansion shows no sign of abating, although measures of connectedness must plateau within two years. By randomizing link destinations in Blogspace, but retaining sources and timestamps, we introduce a concept of randomized Blogspace. Herein, we observe similar evolution of a giant component, but no corresponding increase in community structure. Having demonstrated the formation of micro-communities over time, we then turn to the ongoing activity within active communities. We extend recent work of Kleinberg [11] to discover dense periods of "bursty" intra-community link creation.
Make it fresh, make it quick: searching a network of personal webservers BIBAKFull-Text 577-586
  Mayank Bawa; Roberto J., Jr. Bayardo; Sridhar Rajagopalan; Eugene J. Shekita
Personal webservers have proven to be a popular means of sharing files and peer collaboration. Unfortunately, the transient availability and rapidly evolving content on such hosts render centralized, crawl-based search indices stale and incomplete. To address this problem, we propose YouSearch, a distributed search application for personal webservers operating within a shared context (e.g., a corporate intranet). With YouSearch, search results are always fast, fresh and complete -- properties we show arise from an architecture that exploits both the extensive distributed resources available at the peer webservers in addition to a centralized repository of summarized network state. YouSearch extends the concept of a shared context within web communities by enabling peers to aggregate into groups and users to search over specific groups. In this paper, we describe the challenges, design, implementation and experiences with a successful intranet deployment of YouSearch.
Keywords: P2P, decentralized systems, information communities, intranet search, peer-to-peer networks, web search
Engineering and hosting adaptive freshness-sensitive web applications on data centers BIBAKFull-Text 587-598
  Wen-Syan Li; Oliver Po; Wang-Pin Hsiung; K. Selçuk Candan; Divyakant Agrawal
Wide-area database replication technologies and the availability of content delivery networks allow Web applications to be hosted and served from powerful data centers. This form of application support requires a complete Web application suite to be distributed along with the database replicas. A major advantage of this approach is that dynamic content is served from locations closer to users, leading into reduced network latency and fast response times. However, this is achieved at the expense of overheads due to (a) invalidation of cached dynamic content in the edge caches and (b) synchronization of database replicas in the data center. These have adverse effects on the freshness of delivered content. In this paper, we propose a freshness-driven adaptive dynamic content caching, which monitors the system status and adjusts caching policies to provide content freshness guarantees. The proposed technique has been intensively evaluated to validate its effectiveness. The experimental results show that the freshness-driven adaptive dynamic content caching technique consistently provides good content freshness. Furthermore, even a Web site that enables dynamic content caching can further benefit from our solution, which improves content freshness up to 7 times, especially under heavy user request traffic and long network latency conditions. Our approach also provides better scalability and significantly reduced response times up to 70% in the experiments.
Keywords: database-driven web applications, dynamic content, freshness, response time, net-work latency, web acceleration

CDNs and caching

Evaluating a new approach to strong web cache consistency with snapshots of collected content BIBAKFull-Text 599-608
  Mikhail Mikhailov; Craig E. Wills
The problem of Web cache consistency continues to be an important one. Current Web caches use heuristic-based policies for determining the freshness of cached objects, often forcing content providers to unnecessarily mark their content as uncacheable simply to retain control over it. Server-driven invalidation has been proposed as a mechanism for providing strong cache consistency for Web objects, but it requires servers to maintain per-client state even for infrequently changing objects. We propose an alternative approach to strong cache consistency, called MONARCH, which does not require servers to maintain per-client state. In this work we focus on a new approach for evaluation of MONARCH in comparison with current practice and other cache consistency policies. This approach uses snapshots of content collected from real Web sites as input to a simulator. Results of the evaluation show MONARCH generates little more request traffic than an optimal cache coherency policy.
Keywords: cache consistency, change characteristics, collected content, object composition, object relationships, server invalidation, web caching
Scalable techniques for memory-efficient CDN simulations BIBAKFull-Text 609-618
  Purushottam Kulkarni; Prashant Shenoy; Weibo Gong
Since CDN simulations are known to be highly memory-intensive, in this paper, we argue the need for reducing the memory requirements of such simulations. We propose a novel memory-efficient data structure that stores cache state for a small subset of popular objects accurately and uses approximations for storing the state for the remaining objects. Since popular objects receive a large fraction of the requests while less frequently accessed objects consume much of the memory space, this approach yields large memory savings and reduces errors. We use bloom filters to store approximate state and show that careful choice of parameters can substantially reduce the probability of errors due to approximations. We implement our techniques into a user library for constructing proxy caches in CDN simulators. Our experimental results show up to an order of magnitude reduction in memory requirements of CDN simulations, while incurring a 5-10% error.
Keywords: approximate data structures, content distribution networks, simulation, web proxy cache
Value-based web caching BIBAKFull-Text 619-628
  Sean C. Rhea; Kevin Liang; Eric Brewer
Despite traditional web caching techniques, redundant data is often transferred over HTTP links. These redundant transfers result from both resource modification and aliasing. Resource modification causes the data represented by a single URI to change; often, in transferring the new data, some old data is retransmitted. Aliasing, in contrast, occurs when the same data is named by multiple URIs, often in the context of dynamic or advertising content. Traditional web caching techniques index data by its name and thus often fail to recognize and take advantage of aliasing.
   Despite traditional web caching techniques, redundant data is often transferred over HTTP links. These redundant transfers result from both resource modification and aliasing. Resource modification causes the data represented by a single URI to change; often, in transferring the new data, some old data is retransmitted. Aliasing, in contrast, occurs when the same data is named by multiple URIs, often in the context of dynamic or advertising content. Traditional web caching techniques index data by its name and thus often fail to recognize and take advantage of aliasing.
Keywords: HTTP, WWW, aliasing, caching, duplicate suppression, dynamic content, hypertext transfer protocol, privacy, proxy, redundant transfers, resource modification, scalability, world wide web

Protocols

An XPath-based preference language for P3P BIBAKFull-Text 629-639
  Rakesh Agrawal; Jerry Kiernan; Ramakrishnan Srikant; Yirong Xu
The Platform for Privacy Preferences (P3P) is the most significant effort currently underway to enable web users to gain control over their private information. The designers of P3P simultaneously designed a preference language called APPEL to allow users to express their privacy preferences, thus enabling automatic matching of privacy preferences against P3P policies. Unfortunately subtle interactions between P3P and APPEL result in serious problems when using APPEL: Users can only directly specify what is unacceptable in a policy, not what is acceptable; simple preferences are hard to express; and writing APPEL preferences is error prone. We show that these problems follow from a fundamental design choice made by APPEL, and cannot be solved without completely redesigning the language. Therefore we explore alternatives to APPEL that can overcome these problems. In particular, we show that XPath serves quite nicely as a preference language and solves all the above problems. We identify the minimal subset of XPath that is needed, thus allowing matching programs to potentially use a smaller memory footprint. We also give an APPEL to XPath translator that shows that XPath is as expressive as APPEL.
Keywords: APPEL, P3P, XPath, XPref, hippocratic databases, preference, privacy-aware data management
The Eigentrust algorithm for reputation management in P2P networks BIBAKFull-Text 640-651
  Sepandar D. Kamvar; Mario T. Schlosser; Hector Garcia-Molina
Peer-to-peer file-sharing networks are currently receiving much attention as a means of sharing and distributing information. However, as recent experience shows, the anonymous, open nature of these networks offers an almost ideal environment for the spread of self-replicating inauthentic files.
   We describe an algorithm to decrease the number of downloads of inauthentic files in a peer-to-peer file-sharing network that assigns each peer a unique global trust value, based on the peer's history of uploads. We present a distributed and secure method to compute global trust values, based on Power iteration. By having peers use these global trust values to choose the peers from whom they download, the network effectively identifies malicious peers and isolates them from the network.
   In simulations, this reputation system, called EigenTrust, has been shown to significantly decrease the number of inauthentic files on the network, even under a variety of conditions where malicious peers cooperate in an attempt to deliberately subvert the system.
Keywords: distributed eigenvector computation, peer-to-peer, reputation
Similarity measure and instance selection for collaborative filtering BIBAKFull-Text 652-658
  Chun Zeng; Chun-Xiao Xing; Li-Zhu Zhou
Collaborative filtering has been very successful in both research and applications such as information filtering and E-commerce. The k-Nearest Neighbor (KNN) method is a popular way for its realization. Its key technique is to find k nearest neighbors for a given user to predict his interests. However, this method suffers from two fundamental problems: sparsity and scalability. In this paper, we present our solutions for these two problems. We adopt two techniques: a matrix conversion method for similarity measure and an instance selection method. And then we present an improved collaborative filtering algorithm based on these two methods. In contrast with existing collaborative algorithms, our method shows its satisfactory accuracy and performance.
Keywords: collaborative filtering, instance selection, similarity measure

Web crawling and measurement

Monitoring the dynamic web to respond to continuous queries BIBAKFull-Text 659-668
  Sandeep Pandey; Krithi Ramamritham; Soumen Chakrabarti
Continuous queries are queries for which responses given to users must be continuously updated, as the sources of interest get updated. Such queries occur, for instance, during on-line decision making, e.g., traffic flow control, weather monitoring, etc. The problem of keeping the responses current reduces to the problem of deciding how often to visit a source to determine if and how it has been modified, in order to update earlier responses accordingly. On the surface, this seems to be similar to the crawling problem since crawlers attempt to keep indexes up-to-date as pages change and users pose search queries. We show that this is not the case, both due to the inherent differences between the nature of the two problems as well as the performance metric. We propose, develop and evaluate a novel multi-phase (Continuous Adaptive Monitoring) (CAM) solution to the problem of maintaining the currency of query results. Some of the important phases are: The tracking phase, in which changes, to an initially identified set of relevant pages, are tracked. From the observed change characteristics of these pages, a probabilistic model of their change behavior is formulated and weights are assigned to pages to denote their importance for the current queries. During the next phase, the resource allocation phase, based on these statistics, resources, needed to continuously monitor these pages for changes, are allocated. Given these resource allocations, the scheduling phase produces an optimal achievable schedule for the monitoring tasks. An experimental evaluation of our approach compared to prior approaches for crawling dynamic web pages shows the effectiveness of CAM for monitoring dynamic changes. For example, by monitoring just 5% of the page changes, CAM is able to return 90% of the changed information to the users. The experiments also produce some interesting observations pertaining to the differences between the two problems of crawling -- to build an index -- and the problem of change tracking -- to respond to continuous queries.
Keywords: allocation policies, continuous queries
A large-scale study of the evolution of web pages BIBAKFull-Text 669-678
  Dennis Fetterly; Mark Manasse; Marc Najork; Janet Wiener
How fast does the web change? Does most of the content remain unchanged once it has been authored, or are the documents continuously updated? Do pages change a little or a lot? Is the extent of change correlated to any other property of the page? All of these questions are of interest to those who mine the web, including all the popular search engines, but few studies have been performed to date to answer them.
   One notable exception is a study by Cho and Garcia-Molina, who crawled a set of 720,000 pages on a daily basis over four months, and counted pages as having changed if their MD5 checksum changed. They found that 40% of all web pages in their set changed within a week, and 23% of those pages that fell into the .com domain changed daily.
   This paper expands on Cho and Garcia-Molina's study, both in terms of coverage and in terms of sensitivity to change. We crawled a set of 150,836,209 HTML pages once every week, over a span of 11 weeks. For each page, we recorded a checksum of the page, and a feature vector of the words on the page, plus various other data such as the page length, the HTTP status code, etc. Moreover, we pseudo-randomly selected 0.1% of all of our URLs, and saved the full text of each download of the corresponding pages.
   After completion of the crawl, we analyzed the degree of change of each page, and investigated which factors are correlated with change intensity. We found that the average degree of change varies widely across top-level domains, and that larger pages change more often and more severely than smaller ones.
   This paper describes the crawl and the data transformations we performed on the logs, and presents some statistical observations on the degree of change of different classes of pages.
Keywords: degree of change, rate of change, web characterization, web evolution, web pages
Efficient URL caching for world wide web crawling BIBAKFull-Text 679-689
  Andrei Z. Broder; Marc Najork; Janet L. Wiener
Crawling the web is deceptively simple: the basic algorithm is (a) Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before, repeat (a)-(c). However, the size of the web (estimated at over 4 billion pages) and its rate of change (estimated at 7% per week) move this plan from a trivial programming exercise to a serious algorithmic and system design challenge. Indeed, these two factors alone imply that for a reasonably fresh and complete crawl of the web, step (a) must be executed about a thousand times per second, and thus the membership test (c) must be done well over ten thousand times per second against a set too large to store in main memory. This requires a distributed architecture, which further complicates the membership test.
   A crucial way to speed up the test is to cache, that is, to store in main memory a (dynamic) subset of the "seen" URLs. The main goal of this paper is to carefully investigate several URL caching techniques for web crawling. We consider both practical algorithms: random replacement, static cache, LRU, and CLOCK, and theoretical limits: clairvoyant caching and infinite cache. We performed about 1,800 simulations using these algorithms with various cache sizes, using actual log data extracted from a massive 33 day web crawl that issued over one billion HTTP requests.
   Our main conclusion is that caching is very effective -- in our setup, a cache of roughly 50,000 entries can achieve a hit rate of almost 80%. Interestingly, this cache size falls at a critical point: a substantially smaller cache is much less effective while a substantially larger cache brings little additional benefit. We conjecture that such critical points are inherent to our problem and venture an explanation for this phenomenon.
Keywords: URL caching, caching, crawling, distributed crawlers, web crawlers, web graph models

Using the semantic web

ρ-Queries: enabling querying for semantic associations on the semantic web BIBAKFull-Text 690-699
  Kemafor Anyanwu; Amit Sheth
This paper presents the notion of Semantic Associations as complex relationships between resource entities. These relationships capture both a connectivity of entities as well as similarity of entities based on a specific notion of similarity called ρ-isomorphism. It formalizes these notions for the RDF data model, by introducing a notion of a Property Sequence as a type. In the context of a graph model such as that for RDF, Semantic Associations amount to specific certain graph signatures. Specifically, they refer to sequences (i.e. directed paths) here called Property Sequences, between entities, networks of Property Sequences (i.e. undirected paths), or subgraphs of r-isomorphic Property Sequences.
   The ability to query about the existence of such relationships is fundamental to tasks in analytical domains such as national security and business intelligence, where tasks often focus on finding complex yet meaningful and obscured relationships between entities. However, support for such queries is lacking in contemporary query systems, including those for RDF.
   This paper discusses how querying for Semantic Associations might be enabled on the Semantic Web through the use of an operator ρ. It also discusses two approaches for processing ρ-queries on available persistent RDF stores memory resident RDF data graphs thereby building on current RDF query languages.
Keywords: RDF, complex data relationships, graph traversals, semantic associations, semantic web querying
Semantic search BIBAKFull-Text 700-709
  R. Guha; Rob McCool; Eric Miller
Activities such as Web Services and the Semantic Web are working to create a web of distributed machine understandable data. In this paper we present an application called 'Semantic Search' which is built on these supporting technologies and is designed to improve traditional web searching. We provide an overview of TAP, the application framework upon which the Semantic Search is built. We describe two implemented Semantic Search systems which, based on the denotation of the search query, augment traditional search results with relevant data aggregated from distributed sources. We also discuss some general issues related to searching and the Semantic Web and outline how an understanding of the semantics of the search terms can be used to provide better results.
Keywords: search, semantic web
Agent-based semantic web services BIBAKFull-Text 710-717
  Nicholas Gibbins; Stephen Harris; Nigel Shadbolt
The Web Services world consists of loosely-coupled distributed systems which adapt to ad-hoc changes by the use of service descriptions that enable opportunistic service discovery. At present, these service descriptions are semantically impoverished, being concerned with describing the functional signature of the services rather than characterising their meaning. In the Semantic Web community, the DAML Services effort attempts to rectify this by providing a more expressive way of describing Web services using ontologies. However, this approach does not separate the domain-neutral communicative intent of a message (considered in terms of speech acts) from its domain-specific content, unlike similar developments from the multi-agent systems community.
   In this paper, we describe our experiences of designing and building an ontologically motivated Web Services system for situational awareness and information triage in a simulated humanitarian aid scenario. In particular, we discuss the merits of using techniques from the multi-agent systems community for separating the intentional force of messages from their content, and the implementation of these techniques within the DAML Services model.
Keywords: DAML-S, agent communication languages, ontologies, semantic web, web services

Improving the browsing experience

A framework for coordinated multi-modal browsing with multiple clients BIBAKFull-Text 718-726
  Alistair Coles; Eric Deliot; Tom Melamed; Kevin Lansard
As users acquire or gain access to an increasingly diverse range of web access clients, web applications are adapting their user interfaces to support multiple modalities on multiple client types. User experiences can be enhanced by clients with differing capabilities combining to provide a distributed user interface to applications. Indeed, users will be frustrated if their interaction with applications is limited to one client at a time.
   This paper discusses the requirements for coordinating web interaction across an aggregation of clients. We present a framework for multi-device browsing that provides both coordinated navigation between web resources and coordinated interaction between variants, or representations, of those resources once instantiated in the clients. The framework protects the application from some of the complexities of client aggregation.
   We show how a small number of enhancements to the XForms and XML Events vocabularies can facilitate coordination between clients and provide an appropriate level of control to applications. We also describe a novel proxy which consolidates HTTP requests from aggregations of clients and reduces the burden that multi-client browsing places on the application.
Keywords: multi-modal browsing, web proxy
A comparative web browser (CWB) for browsing and comparing web pages BIBAKFull-Text 727-735
  Akiyo Nadamoto; Katsumi Tanaka
In this paper, we propose a new type of Web browser, called the Comparative Web Browser (CWB), which concurrently presents multiple Web pages in a way that enables the content of the Web pages to be automatically synchronized. The ability to view multiple Web pages at one time is useful when we wish to make a comparison on the Web, such as when we compare similar products or news articles from different newspapers. The CWB is characterized by (1) automatic content-based retrieval of passages from another Web page based on a passage of the Web page the user is reading, and (2) automatic transformation of a user's behavior (scrolling, clicking, or moving backward or forward) on a Web page into a series of behaviors on the other Web pages. The CWB tries to concurrently present "similar" passages from different Web pages, and for this purpose our CWB automatically navigates Web pages that contain passages similar to those of the initial Web page. Furthermore, we propose an enhancement to the CWB, which enables it to use linkage information to find related documents based on link structure.
Keywords: comparison, content synchronization, passage retrieval, web browser
Comparing link marker visualization techniques: changes in reading behavior BIBAKFull-Text 736-745
  Hartmut Obendorf; Harald Weinreich
Links are one of the most important means for navigation in the World Wide Web. However, the visualization of and the interaction with Web links have been scarcely explored, although Links have severe implications on the appearance and usability of Web pages and the World Wide Web as such.
   This paper presents two studies giving first insights of the effects of link visualization techniques on reading habits and performance. The first user study compares different highlighting techniques for link markers and evaluates their effect on reading performance and user acceptance. The second study examines links-on-demand, links that appear when pressing a dedicated key, and discusses their possible effects on reading and browsing habits.
   The findings of the conducted studies imply that the standard appearance of link markers has seriously underestimated effects on the usability of Web pages. They can significantly reduce the readability of the text, and alternatives should be carefully considered for the design of future Web browsers.
Keywords: hypermedia, input interaction technologies, link marking techniques, links-on-demand, usability survey