HCI Bibliography Home | HCI Conferences | WWW Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
WWW Tables of Contents: 01020304-104-205-105-2060708091011-111-2

Proceedings of the 2004 International Conference on the World Wide Web

Fullname:Proceedings of the 13th International Conference on World Wide Web
Editors:Stuart Feldman; Mike Uretsky; Marc Najork; Craig Wills
Location:New York
Dates:2004-May-17 to 2004-May-20
Volume:1
Publisher:ACM
Standard No:ISBN: 1-58113-844-X; ACM DL: Table of Contents hcibib: WWW04-1
Papers:74
Pages:738
Links:Conference Home Page
  1. WWW 2004-05-17 Volume 1
    1. Search engineering 1
    2. Security and privacy
    3. Usability and accessibility
    4. Information extraction
    5. Mobility
    6. XML
    7. Learning classifiers
    8. Web site engineering
    9. Semantic interfaces and OWL tools
    10. Server performance and scalability
    11. Link analysis
    12. Optimizing encoding
    13. Semantic web applications
    14. Reputation networks
    15. Versioning and fragmentation
    16. Semantic annotation and integration
    17. Mining new media
    18. Workload analysis
    19. Semantic web services
    20. Search engineering 2
    21. Infrastructure for implementation
    22. Distributed semantic query
    23. Query result processing
    24. Web site analysis and customization
    25. Semantic web foundations

WWW 2004-05-17 Volume 1

Search engineering 1

What's new on the web?: the evolution of the web from a search engine perspective BIBAKFull-Text 1-12
  Alexandros Ntoulas; Junghoo Cho; Christopher Olston
We seek to gain improved insight into how Web search engines should cope with the evolving Web, in an attempt to provide users with the most up-to-date results possible. For this purpose we collected weekly snapshots of some 150 Web sites over the course of one year,and measured the evolution of content and link structure. Our measurements focus on aspects of potential interest to search engine designers: the evolution of link structure over time, the rate of creation of new pages and new distinct content on the Web, and the rate of change of the content of existing pages under search-centric measures of degree of change.
   Our findings indicate a rapid turnover rate of Web pages, i.e.,high rates of birth and death, coupled with an even higher rate of turnover in the hyperlinks that connect them. For pages that persist over time we found that, perhaps surprisingly, the degree of content shift as measured using TF.IDF cosine distance does not appear to be consistently correlated with the frequency of content updating. Despite this apparent non-correlation, the rate of content shift of a given page is likely to remain consistent over time. That is, pages that change a great deal in one week will likely change by a similarly large degree in the following week. Conversely, pages that experience little change will continue to experience little change. We conclude the paper with a discussion of the potential implications of our results for the design of effective Web search engines.
Keywords: change prediction, degree of change, link structure evolution, rate of change, search engines, web characterization, web evolution, web pages
Understanding user goals in web search BIBAKFull-Text 13-19
  Daniel E. Rose; Danny Levinson
Previous work on understanding user web search behavior has focused on how people search and what they are searching for, but not why they are searching. In this paper, we describe a framework for understanding the underlying goals of user searches, and our experience in using the framework to manually classify queries from a web search engine. Our analysis suggests that so-called navigational" searches are less prevalent than generally believed while a previously unexplored "resource-seeking" goal may account for a large fraction of web searches. We also illustrate how this knowledge of user search goals might be used to improve future web search engines.
Keywords: information retrieval, query classification, user behavior, user goals, web search
Impact of search engines on page popularity BIBAKFull-Text 20-29
  Junghoo Cho; Sourashis Roy
Recent studies show that a majority of Web page accesses are referred by search engines. In this paper we study the widespread use of Web search engines and its impact on the ecology of the Web. In particular, we study how much impact search engines have on the popularity evolution of Web pages. For example, given that search engines return currently popular" pages at the top of search results, are we somehow penalizing newly created pages that are not very well known yet? Are popular pages getting even more popular and new pages completely ignored? We first show that this unfortunate trend indeed exists on the Web through an experimental study based on real Web data. We then analytically estimate how much longer it takes for a new page to attract a large number of Web users when search engines return only popular pages at the top of search results. Our result shows that search engines can have an immensely worrisome impact on the discovery of new Web pages.
Keywords: change in PageRank, PageRank, random surfer model, search engine's impact, web evolution

Security and privacy

Anti-aliasing on the web BIBAKFull-Text 30-39
  Jasmine Novak; Prabhakar Raghavan; Andrew Tomkins
It is increasingly common for users to interact with the web using a number of different aliases. This trend is a double-edged sword. On one hand, it is a fundamental building block in approaches to online privacy. On the other hand, there are economic and social consequences to allowing each user an arbitrary number of free aliases. Thus, there is great interest in understanding the fundamental issues in obscuring the identities behind aliases.
   However, most work in the area has focused on linking aliases through analysis of lower-level properties of interactions such as network routes. We show that aliases that actively post text on the web can be linked together through analysis of that text. We study a large number of users posting on bulletin boards, and develop algorithms to anti-alias those users: we can with a high degree of success identify when two aliases belong to the same individual.
   Our results show that such techniques are surprisingly effective, leading us to conclude that guaranteeing privacy among aliases that post actively requires mechanisms that do not yet exist.
Keywords: alias detection, aliases, bulletin boards, personas, privacy, pseudonyms
Securing web application code by static analysis and runtime protection BIBAKFull-Text 40-52
  Yao-Wen Huang; Fang Yu; Christian Hang; Chung-Hung Tsai; Der-Tsai Lee; Sy-Yen Kuo
Security remains a major roadblock to universal acceptance of the Web for many kinds of transactions, especially since the recent sharp increase in remotely exploitable vulnerabilities have been attributed to Web application bugs. Many verification tools are discovering previously unknown vulnerabilities in legacy C programs, raising hopes that the same success can be achieved with Web applications. In this paper, we describe a sound and holistic approach to ensuring Web application security. Viewing Web application vulnerabilities as a secure information flow problem, we created a lattice-based static analysis algorithm derived from type systems and typestate, and addressed its soundness. During the analysis, sections of code considered vulnerable are instrumented with runtime guards, thus securing Web applications in the absence of user intervention. With sufficient annotations, runtime overhead can be reduced to zero. We also created a tool named.
   WebSSARI (Web application Security by Static Analysis and Runtime Inspection) to test our algorithm, and used it to verify 230 open-source Web application projects on SourceForge.net, which were selected to represent projects of different maturity, popularity, and scale. 69 contained vulnerabilities. After notifying the developers, 38 acknowledged our findings and stated their plans to provide patches. Our statistics also show that static analysis reduced potential runtime overhead by 98.4%.
Keywords: information flow, noninterference, program security, security vulnerabilities, type systems, verification, web application security
Trust-serv: model-driven lifecycle management of trust negotiation policies for web services BIBAKFull-Text 53-62
  Halvard Skogsrud; Boualem Benatallah; Fabio Casati
A scalable approach to trust negotiation is required in Web service environments that have large and dynamic requester populations. We introduce Trust-Serv, a model-driven trust negotiation framework for Web services. The framework employs a model for trust negotiation that is based on state machines, extended with security abstractions. Our policy model supports lifecycle management, an important trait in the dynamic environments that characterize Web services. In particular, we provide a set of change operations to modify policies, and migration strategies that permit ongoing negotiations to be migrated to new policies without being disrupted. Experimental results show the performance benefit of these strategies. The proposed approach has been implemented as a container-centric mechanism that is transparent to the Web services and to the developers of Web services, simplifying Web service development and management as well as enabling scalable deployments.
Keywords: conceptual modeling, lifecycle management, trust negotiation, web services

Usability and accessibility

SmartBack: supporting users in back navigation BIBAKFull-Text 63-71
  Natasa Milic-Frayling; Rachel Jones; Kerry Rodden; Gavin Smyth; Alan Blackwell; Ralph Sommerer
This paper presents the design and user evaluation of SmartBack, a feature that complements the standard Back button by enabling users to jump directly to key pages in their navigation session, making common navigation activities more efficient. Defining key pages was informed by the findings of a user study that involved detailed monitoring of Web usage and analysis of Web browsing in terms of navigation trails. The pages accessible through SmartBack are determined automatically based on the structure of the user's navigation trails or page association with specific user's activities, such as search or browsing bookmarked sites. We discuss implementation decisions and present results of a usability study in which we deployed the SmartBack prototype and monitored usage for a month in both corporate and home settings. The results show that the feature brings qualitative improvement to the browsing experience of individuals who use it.
Keywords: back navigation, browsing, navigation, revisitation, usability study, web trails, web usage
Web accessibility: a broader view BIBAKFull-Text 72-79
  John T. Richards; Vicki L. Hanson
Web accessibility is an important goal. However, most approaches to its attainment are based on unrealistic economic models in which Web content developers are required to spend too much for which they receive too little. We believe this situation is due, in part, to the overly narrow definitions given both to those who stand to benefit from enhanced access to the Web and what is meant by this enhanced access. In this paper, we take a broader view, discussing a complementary approach that costs developers less and provides greater advantages to a larger community of users. While we have quite specific aims in our technical work, we hope it can also serve as an example of how the technical conversation regarding Web accessibility can move beyond the narrow confines of limited adaptations for small populations.
Keywords: standards, user interface, web accessibility
Hearsay: enabling audio browsing on hypertext content BIBAKFull-Text 80-89
  I. V. Ramakrishnan; Amanda Stent; Guizhen Yang
In this paper we present HearSay, a system for browsing hypertext Web documents via audio. The HearSay system is based on our novel approach to automatically creating audio browsable content from hypertext Web documents. It combines two key technologies: (1) automatic partitioning of Web documents through tightly coupled structural and semantic analysis, which transforms raw HTML documents into semantic structures so as to facilitate audio browsing; and (2) VoiceXML, an already standardized technology which we adopt to represent voice dialogs automatically created from the XML output of partitioning. This paper describes the software components of HearSay and presents an initial system evaluation.
Keywords: HTML, VoiceXML, World Wide Web, audio browser, semantic analysis, structural analysis, user interface

Information extraction

Unsupervised learning of soft patterns for generating definitions from online news BIBAKFull-Text 90-99
  Hang Cui; Min-Yen Kan; Tat-Seng Chua
Breaking news often contains timely definitions and descriptions of current terms, organizations and personalities. We utilize such web sources to construct definitions for such terms. Previous work has identified definitions using hand-crafted rules or supervised learning that constructs rigid, hard text patterns. In contrast, we demonstrate a new approach that uses flexible, soft matching patterns to characterize definition sentences. Our soft patterns are able to effectively accommodate the diversity of definition sentence structure exhibited in news. We use pseudo-relevance feedback to automatically label sentences for use in soft pattern generation. The application of our unsupervised method significantly improves baseline systems on both the standardized TREC corpus as well as crawled online news articles by 27% and 30%, respectively, in terms of F measure. When applied to a state-of-art definition generation system recently fielded in the TREC 2003 definitional question answering task, it improves the performance by 14%.
Keywords: definition generation, definitional question answering, pseudo-relevance feedback, soft patterns, unsupervised learning
Web-scale information extraction in knowitall: (preliminary results) BIBAKFull-Text 100-110
  Oren Etzioni; Michael Cafarella; Doug Downey; Stanley Kok; Ana-Maria Popescu; Tal Shaked; Stephen Soderland; Daniel S. Weld; Alexander Yates
Manually querying search engines in order to accumulate a large body of factual information is a tedious, error-prone process of piecemeal search. Search engines retrieve and rank potentially relevant documents for human perusal, but do not extract facts, assess confidence, or fuse information from multiple documents. This paper introduces KnowItAll, a system that aims to automate the tedious process of extracting large collections of facts from the web in an autonomous,domain-independent, and scalable manner.
   The paper describes preliminary experiments in which an instance of KnowItAll, running for four days on a single machine, was able to automatically extract 54,753 facts. KnowItAll associates a probability with each fact enabling it to trade off precision and recall. The paper analyzes KnowItAll's architecture and reports on lessons learned for the design of large-scale information extraction systems.
Keywords: information extraction, mutual information, pmi, search
Is question answering an acquired skill? BIBAKFull-Text 111-120
  Ganesh Ramakrishnan; Soumen Chakrabarti; Deepa Paranjpe; Pushpak Bhattacharya
We present a question answering (QA) system which learns how to detect and rank answer passages by analyzing questions and their answers (QA pairs) provided as training data. We built our system in only a few person-months using off-the-shelf components: a part-of-speech tagger, a shallow parser, a lexical network, and a few well-known supervised learning algorithms. In contrast, many of the top TREC QA systems are large group efforts, using customized ontologies, question classifiers, and highly tuned ranking functions. Our ease of deployment arises from using generic, trainable algorithms that exploit simple feature extractors on QA pairs. With TREC QA data, our system achieves mean reciprocal rank (MRR) that compares favorably with the best scores in recent years, and generalizes from one corpus to another. Our key technique is to recover, from the question, fragments of what might have been posed as a structured query, had a suitable schema been available. comprises selectors: tokens that are likely to appear (almost) unchanged in an answer passage. The other fragment contains question tokens which give clues about the answer type, and are expected to be replaced in the answer passage by tokens which specialize or instantiate the desired answer type. Selectors are like constants in where-clauses in relational queries, and answer types are like column names. We present new algorithms for locating selectors and answer type clues and using them in scoring passages with respect to a question.
Keywords: machine learning, question answering

Mobility

Session level techniques for improving web browsing performance on wireless links BIBAKFull-Text 121-130
  Pablo Rodriguez; Sarit Mukherjee; Sampath Ramgarajan
Recent observations through experiments that we have performed in current third generation wireless networks have revealed that the achieved throughput over wireless links varies widely depending on the application. In particular, the throughput achieved by file transfer application (FTP) and web browsing application (HTTP) are quite different. The throughput achieved over a HTTP session is much lower than that achieved over an FTP session. The reason for the lower HTTP throughput is that the HTTP protocol is affected by the large Round-Trip Time (RTT) across Wireless links. HTTP transfers require multiple TCP connections and DNS lookups before a HTTP page can be displayed. Each TCP connection requires several RTTs to fully open the TCP send window and each DNS lookup requires several RTTs before resolving the domain name to IP mapping. These TCP/DNS RTTs significantly degrade the performance of HTTP over wireless links. To overcome these problems, we have developed session level optimization techniques to enhance HTTP download mechanisms. These techniques (a) minimize the number of DNS lookups over the wireless link and (b) minimize the number of TCP connections opened by the browser. These optimizations bridge the mismatch caused by wireless links between application-level protocols (such as HTTP) and transport-level protocols (such as TCP). Our solutions do not require any client-side software and can be deployed transparently on a service provider network to provide 30-50% decrease in end-to-end user perceived latency and 50-100% increase in data throughput across wireless links for HTTP sessions.
Keywords: optimizations, web, wireless
Flexible on-device service object replication with replets BIBAKFull-Text 131-142
  Dong Zhou; Nayeem Islam; Ali Ismael
An increasingly large amount of Web applications employ service objects such as Servlets to generate dynamic and personalized content. Existing caching infrastructures are not well suited for caching such content in mobile environments because of disconnection and weak connection. One possible approach to this problem is to replicate Web-related application logic to client devices. The challenges to this approach are to deal with client devices that exhibit huge divergence in resource availabilities, to support applications that have different data sharing and coherency requirements, and to accommodate the same application under different deployment environments.
   The Replet system targets these challenges. It uses client, server and application capability and preference information (CPI) to direct the replication of service objects to client devices: from the selection of a device for replication and populating the device with client-specific data, to choosing an appropriate replica to serve a given request and maintaining the desired state consistency among replicas. The Replet system exploits on-device replication to enable client-, server- and application-specific cost metrics for replica invocation and synchronization. We have implemented a prototype in the context of Servlet-based Web applications. Our experiment and simulation results demonstrate the viability and significant benefits of CPI-driven on-device service object replication.
Keywords: capability, preference, reconfiguration, replication, service, synchronization
Improving web browsing performance on wireless pdas using thin-client computing BIBAKFull-Text 143-154
  Albert M. Lai; Jason Nieh; Bhagyashree Bohra; Vijayarka Nandikonda; Abhishek P. Surana; Suchita Varshneya
Web applications are becoming increasingly popular for mobile wireless PDAs. However, web browsing on these systems can be quite slow. An alternative approach is handheld thin-client computing, in which the web browser and associated application logic run on a server, which then sends simple screen updates to the PDA for display. To assess the viability of this thin-client approach, we compare the web browsing performance of thin clients against fat clients that run the web browser locally on a PDA. Our results show that thin clients can provide better web browsing performance compared to fat clients, both in terms of speed and ability to correctly display web content. Surprisingly, thin clients are faster even when having to send more data over the network. We characterize and analyze different design choices in various thin-client systems and explain why these approaches can yield superior web browsing performance on mobile wireless PDAs.
Keywords: thin-client computing, web performance, wireless and mobility

XML

XVM: a bridge between XML data and its behavior BIBAKFull-Text 155-163
  Quanzhong Li; Michelle Y. Kim; Edward So; Steve Wood
XML has become one of the core technologies for contemporary business applications, especially web-based applications. To facilitate processing of diverse XML data, we propose an extensible, integrated XML processing architecture, the XML Virtual Machine (XVM), which connects XML data with their behaviors. At the same time, the XVM is also a framework for developing and deploying XML-based applications. Using component-based techniques, the XVM supports arbitrary granularity and provides a high degree of modularity and reusability. XVM components are dynamically loaded and composed during XML data processing. Using the XVM, both client-side and server-side XML applications can be developed and deployed in an integrated way. We also present an XML application container built on top of the XVM along with several sample applications to demonstrate the applicability of the XVM framework.
Keywords: XML, XML applications, XML processing, XVM, components, web applications
SchemaPath, a minimal extension to XML schema for conditional constraints BIBAKFull-Text 164-174
  Claudio Sacerdoti Coen; Paolo Marinelli; Fabio Vitali
In the past few years, a number of constraint languages for XML documents has been proposed. They are cumulatively called schema languages or validation languages and they comprise, among others, DTD, XML Schema, RELAX NG, Schematron, DSD, xlinkit. One major point of discrimination among schema languages is the support of co-constraints, or co-occurrence constraints, e.g., requiring that attribute A is present if and only if attribute B is (or is not) present in the same element. Although there is no way in XML Schema to express these requirements, they are in fact frequently used in many XML document types, usually only expressed in plain human-readable text, and validated by means of special code modules by the relevant applications. In this paper we propose SchemaPath, a light extension of XML Schema to handle conditional constraints on XML documents. Two new constructs have been added to XML Schema: conditions -- based on XPath patterns -- on type assignments for elements and attributes; and a new simple type, xsd:error, for the direct expression of negative constraints (e.g. it is prohibited for attribute A to be present if attribute B is also present). A proof-of-concept implementation is provided. A Web interface is publicly accessible for experiments and assessments of the real expressiveness of the proposed extension.
Keywords: co-constraints, schema languages, SchemaPath, XML
Composite events for XML BIBAKFull-Text 175-183
  Martin Bernauer; Gerti Kappel; Gerhard Kramler
Recently, active behavior has received attention in the XML field to automatically react to occurred events. Aside from proprietary approaches for enriching XML with active behavior, the W3C standardized the Document Object Model (DOM) Event Module for the detection of events in XML documents. When using any of these approaches, however, it is often impossible to decide which event to react upon because not a single event but a combination of multiple events, i.e., a composite event determines a situation to react upon. The paper presents the first approach for detecting composite events in XML documents by addressing the peculiarities of XML events which are caused by their hierarchical order in addition to their temporal order. It also provides for the detection of satisfied multiplicity constraints defined by XML schemas. Thereby the approach enables applications operating on XML documents to react to composite events which have richer semantics.
Keywords: active behavior, composite event, event algebra, event-condition-action rule, XML

Learning classifiers

LiveClassifier: creating hierarchical text classifiers through web corpora BIBAKFull-Text 184-192
  Chien-Chung Huang; Shui-Lung Chuang; Lee-Feng Chien
Many Web information services utilize techniques of information extraction (IE) to collect important facts from the Web. To create more advanced services, one possible method is to discover thematic information from the collected facts through text classification. However, most conventional text classification techniques rely on manual-labelled corpora and are thus ill-suited to cooperate with Web information services with open domains. In this work, we present a system named LiveClassifier that can automatically train classifiers through Web corpora based on user-defined topic hierarchies. Due to its flexibility and convenience, LiveClassifier can be easily adapted for various purposes. New Web information services can be created to fully exploit it; human users can use it to create classifiers for their personal applications. The effectiveness of classifiers created by LiveClassifier is well supported by empirical evidence.
Keywords: text classification, topic hierarchy, web mining
Using urls and table layout for web classification tasks BIBAKFull-Text 193-202
  L. K. Shih; D. R. Karger
We propose new features and algorithms for automating Web-page classification tasks such as content recommendation and ad blocking. We show that the automated classification of Web pages can be much improved if, instead of looking at their textual content, we consider each links's URL and the visual placement of those links on a referring page. These features are unusual: rather than being scalar measurements like word counts they are tree structured -- describing the position of the item in a tree. We develop a model and algorithm for machine learning using such tree-structured features. We apply our methods in automated tools for recognizing and blocking Web advertisements and for recommending "interesting" news stories to a reader. Experiments show that our algorithms are both faster and more accurate than those based on the text content of Web documents.
Keywords: classification, news recommendation, tree structures, web applications
Learning block importance models for web pages BIBAKFull-Text 203-211
  Ruihua Song; Haifeng Liu; Ji-Rong Wen; Wei-Ying Ma
Previous work shows that a web page can be partitioned into multiple segments or blocks, and often the importance of those blocks in a page is not equivalent. Also, it has been proven that differentiating noisy or unimportant blocks from pages can facilitate web mining, search and accessibility. However, no uniform approach and model has been presented to measure the importance of different segments in web pages. Through a user study, we found that people do have a consistent view about the importance of blocks in web pages. In this paper, we investigate how to find a model to automatically assign importance values to blocks in a web page. We define the block importance estimation as a learning problem. First, we use a vision-based page segmentation algorithm to partition a web page into semantic blocks with a hierarchical structure. Then spatial features (such as position and size) and content features (such as the number of images and links) are extracted to construct a feature vector for each block. Based on these features, learning algorithms are used to train a model to assign importance to different segments in the web page. In our experiments, the best model can achieve the performance with Micro-F1 79% and Micro-Accuracy 85.9%, which is quite close to a person's view.
Keywords: block importance model, classification, page segmentation, web mining

Web site engineering

Staging transformations for multimodal web interaction management BIBAKFull-Text 212-223
  Michael Narayan; Christopher Williams; Saverio Perugini; Naren Ramakrishnan
Multimodal interfaces are becoming increasingly ubiquitous with the advent of mobile devices, accessibility considerations, and novel software technologies that combine diverse interaction media. In addition to improving access and delivery capabilities, such interfaces enable flexible and personalized dialogs with websites, much like a conversation between humans. In this paper, we present a software framework for multimodal web interaction management that supports mixed-initiative dialogs between users and websites. A mixed-initiative dialog is one where the user and the website take turns changing the flow of interaction. The framework supports the functional specification and realization of such dialogs using staging transformations -- a theory for representing and reasoning about dialogs based on partial input. It supports multiple interaction interfaces, and offers sessioning, caching, and co-ordination functions through the use of an interaction manager. Two case studies are presented to illustrate the promise of this approach.
Keywords: mixed-initiative interaction, out-of-turn interaction, partial evaluation, program transformations, web dialogs
Enforcing strict model-view separation in template engines BIBAKFull-Text 224-233
  Terence John Parr
The mantra of every experienced web application developer is the same: thou shalt separate business logic from display. Ironically, almost all template engines allow violation of this separation principle, which is the very impetus for HTML template engine development. This situation is due mostly to a lack of formal definition of separation and fear that enforcing separation emasculates a template's power. I show that not only is strict separation a worthy design principle, but that we can enforce separation while providing a potent template engine. I demonstrate my StringTemplate engine, used to build jGuru.com and other commercial sites, at work solving some nontrivial generational tasks.
   My goal is to formalize the study of template engines, thus, providing a common nomenclature, a means of classifying template generational power, and a way to leverage interesting results from formal language theory. I classify three types of restricted templates analogous to Chomsky's type 1..3 grammar classes and formally define separation including the rules that embody separation.
   Because this paper provides a clear definition of model-view separation, template engine designers may no longer blindly claim enforcement of separation. Moreover, given theoretical arguments and empirical evidence, programmers no longer have an excuse to entangle model and view.
Keywords: model-view-controller, template engine, web application
A flexible framework for engineering "my" portals BIBAKFull-Text 234-243
  Fernando Bellas; Daniel Fernández; Abel Muiño
There exist many portal servers that support the construction of "My" portals that is portals that allow the user to have one or more personal pages composed of a number of personalizable services. The main drawback of current portal servers is their lack of generality and adaptability. This paper presents the design of MyPersonalizer a J2EE-based framework for engineering My portals. The framework is structured according to the Model-View-Controller and Layers architectural patterns providing generic adaptable model and controller layers that implement the typical use cases of a My portal. MyPersonalizer allows for a good separation of roles in the development team: graphical designers (without programming skills) develop the portal view by writing JSP pages while software engineers implement service plugins and specify framework configuration.
Keywords: design patterns, j2ee, portal technology, web application frameworks and architectures, web engineering

Semantic interfaces and OWL tools

Semantic email BIBAKFull-Text 244-254
  Luke McDowell; Oren Etzioni; Alon Halevy; Henry Levy
This paper investigates how the vision of the Semantic Web can be carried over to the realm of email. We introduce a general notion of semantic email, in which an email message consists of an RDF query or update coupled with corresponding explanatory text. Semantic email opens the door to a wide range of automated, email-mediated applications with formally guaranteed properties. In particular, this paper introduces a broad class of semantic email processes. For example consider the process of sending an email to a program committee asking who will attend the PC dinner automatically collecting the responses and tallying them up. We define both logical and decision-theoretic models where an email process is modeled as a set of updates to a data set on which we specify goals via certain constraints or utilities. We then describe a set of inference problems that arise while trying to satisfy these goals and analyze their computational tractability. In particular we show that for the logical model it is possible to automatically infer which email responses are acceptable w.r.t. a set of constraints in polynomial time and for the decision-theoretic model it is possible to compute the optimal message-handling policy in polynomial time. Finally we discuss our publicly available implementation of semantic email and outline research challenges in this realm.
Keywords: decision-theoretic, formal model, satisfiability, semantic web
How to make a semantic web browser BIBAKFull-Text 255-265
  D. A. Quan; R. Karger
Two important architectural choices underlie the success of the Web: numerous, independently operated servers speak a common protocol, and a single type of client the Web browser provides point-and-click access to the content and services on these decentralized servers. However, because HTML marries content and presentation into a single representation, end users are often stuck with inappropriate choices made by the Web site designer of how to work with and view the content. RDF metadata on the Semantic Web does not have this limitation: users can gain direct access to information and control over how it is presented. This principle forms the basis for our Semantic Web browser an end user application that automatically locates metadata and assembles point-and-click interfaces from a combination of relevant information, ontological specifications, and presentation knowledge, all described in RDF and retrieved dynamically from the Semantic Web. Because data and services are accessed directly through a standalone client and not through a central point of access (e.g., a portal), new content and services can be consumed as soon as they become available. In this way we take advantage of an important sociological force that encourages the production of new Semantic Web content while remaining faithful to the decentralized nature of the Web.
Keywords: bioinformatics, rdf, semantic web, user interface, web services
Parsing owl dl: trees or triples? BIBAKFull-Text 266-275
  Sean K. Bechhofer; Jeremy J. Carroll
The Web Ontology Language (OWL) defines three classes of documents: Lite, DL, and Full. All RDF/XML documents are OWL Full documents, some OWL Full documents are also OWL DL documents, and some OWL DL documents are also OWL Lite documents. This paper discusses parsing and species recognition -- that is the process of determining whether a given document falls into the OWL Lite, DL or Full class. We describe two alternative approaches to this task, one based on abstract syntax trees, the other on RDF triples, and compare their key characteristics.
Keywords: owl, parsing, rdf, semantic web

Server performance and scalability

A method for transparent admission control and request scheduling in e-commerce web sites BIBAKFull-Text 276-286
  Sameh Elnikety; Erich Nahum; John Tracey; Willy Zwaenepoel
This paper presents a method for admission control and request scheduling for multiply-tiered e-commerce Web sites, achieving both stable behavior during overload and improved response times. Our method externally observes execution costs of requests online, distinguishing different request types, and performs overload protection and preferential scheduling using relatively simple measurements and a straight forward control mechanism. Unlike previous proposals, which require extensive changes to the server or operating system, our method requires no modifications to the host O.S., Web server, application server or database. Since our method is external, it can be implemented in a proxy. We present such an implementation, called Gatekeeper, using it with standard software components on the Linux operating system. We evaluate the proxy using the industry standard TPC-W workload generator in a typical three-tiered e-commerce environment. We show consistent performance during overload and throughput increases of up to 10 percent. Response time improves by up to a factor of 14, with only a 15 percent penalty to large jobs.
Keywords: admission control, dynamic web content, load control, request scheduling, web servers
A smart hill-climbing algorithm for application server configuration BIBAKFull-Text 287-296
  Bowei Xi; Zhen Liu; Mukund Raghavachari; Cathy H. Xia; Li Zhang
The overwhelming success of the Web as a mechanism for facilitating information retrieval and for conducting business transactions has led to an increase in the deployment of complex enterprise applications. These applications typically run on Web Application Servers, which assume the burden of managing many tasks, such as concurrency, memory management, database access, etc., required by these applications. The performance of an Application Server depends heavily on appropriate configuration. Configuration is a difficult and error-prone task dueto the large number of configuration parameters and complex interactions between them. We formulate the problem of finding an optimal configuration for a given application as a black-box optimization problem. We propose a smart hill-climbing algorithm using ideas of importance sampling and Latin Hypercube Sampling (LHS). The algorithm is efficient in both searching and random sampling. It consists of estimating a local function, and then, hill-climbing in the steepest descent direction. The algorithm also learns from past searches and restarts in a smart and selective fashion using the idea of importance sampling. We have carried out extensive experiments with an on-line brokerage application running in a WebSphere environment. Empirical results demonstrate that our algorithm is more efficient than and superior to traditional heuristic methods.
Keywords: automatic tuning, gradient method, importance sampling, simulated annealing, system configuration
Challenges and practices in deploying web acceleration solutions for distributed enterprise systems BIBAKFull-Text 297-308
  Wen-Syan Li; Wang-Pin Hsiung; Oliver Po; Koji Hino; Kasim Selcuk Candan; Divyakant Agrawal
For most Web-based applications, contents are created dynamically based on the current state of a business, such as product prices and inventory, stored in database systems. These applications demand personalized content and track user behavior while maintaining application integrity. Many of such practices are not compatible with Web acceleration solutions. Consequently, although many web acceleration solutions have shown promising performance improvement and scalability, architecting and engineering distributed enterprise Web applications to utilize available content delivery networks remains a challenge. In this paper, we examine the challenge to accelerate J2EE-based enterprise web applications. We list obstacles and recommend some practices to transform typical database-driven J2EE applications to cache friendly Web applications where Web acceleration solutions can be applied. Furthermore, such transformation should be done without modification to the underlying application business logic and without sacrificing functions that are essential to e-commerce. We take the J2EE reference software, the Java PetStore, as a case study. By using the proposed guideline, we are able to cache more than 90% of the content in the PetStore and scale up the Web site more than 20 times.
Keywords: application server, dynamic content, edge server, fragment, j2ee, reliability, scalability, web acceleration

Link analysis

Ranking the web frontier BIBAKFull-Text 309-318
  Nadav Eiron; Kevin S. McCurley; John A. Tomlin
The celebrated PageRank algorithm has proved to be a very effective paradigm for ranking results of web search algorithms. In this paper we refine this basic paradigm to take into account several evolving prominent features of the web, and propose several algorithmic innovations. First, we analyze features of the rapidly growing "frontier" of the web, namely the part of the web that crawlers are unable to cover for one reason or another. We analyze the effect of these pages and find it to be significant. We suggest ways to improve the quality of ranking by modeling the growing presence of "link rot" on the web as more sites and pages fall out of maintenance. Finally we suggest new methods of ranking that are motivated by the hierarchical structure of the web, are more efficient than PageRank, and may be more resistant to direct manipulation.
Keywords: hypertext, PageRank, ranking
Link fusion: a unified link analysis framework for multi-type interrelated data objects BIBAKFull-Text 319-327
  Wensi Xi; Benyu Zhang; Zheng Chen; Yizhou Lu; Shuicheng Yan; Wei-Ying Ma; Edward Allan Fox
Web link analysis has proven to be a significant enhancement for quality based web search. Most existing links can be classified into two categories: intra-type links (e.g., web hyperlinks), which represent the relationship of data objects within a homogeneous data type (web pages), and inter-type links (e.g., user browsing log) which represent the relationship of data objects across different data types (users and web pages). Unfortunately, most link analysis research only considers one type of link. In this paper, we propose a unified link analysis framework, called "link fusion", which considers both the inter- and intra- type link structure among multiple-type inter-related data objects and brings order to objects in each data type at the same time. The PageRank and HITS algorithms are shown to be special cases of our unified link analysis framework. Experiments on an instantiation of the framework that makes use of the user data and web pages extracted from a proxy log show that our proposed algorithm could improve the search effectiveness over the HITS and DirectHit algorithms by 24.6% and 38.2% respectively.
Keywords: data fusion, information retrieval, link analysis algorithms, link fusion
Sic transit gloria telae: towards an understanding of the web's decay BIBAKFull-Text 328-337
  Ziv Bar-Yossef; Andrei Z. Broder; Ravi Kumar; Andrew Tomkins
The rapid growth of the web has been noted and tracked extensively. Recent studies have however documented the dual phenomenon: web pages have small half lives, and thus the web exhibits rapid death as well. Consequently, page creators are faced with an increasingly burdensome task of keeping links up-to-date, and many are falling behind. In addition to just individual pages, collections of pages or even entire neighborhoods of the web exhibit significant decay, rendering them less effective as information resources. Such neighborhoods are identified only by frustrated searchers, seeking a way out of these stale neighborhoods, back to more up-to-date sections of the web; measuring the decay of a page purely on the basis of dead links on the page is too naive to reflect this frustration. In this paper we formalize a strong notion of a decay measure and present algorithms for computing it efficiently. We explore this measure by presenting a number of validations, and use it to identify interesting artifacts on today's web. We then describe a number of applications of such a measure to search engines, web page maintainers, ontologists, and individual users.
Keywords: 404 return code, dead links, link analysis, web decay, web information retrieval

Optimizing encoding

Using link analysis to improve layout on mobile devices BIBAKFull-Text 338-344
  Xinyi Yin; Wee Sun Lee
Delivering web pages to mobile phones or personal digital assistants has become possible with the latest wireless technology. However, mobile devices have very small screen sizes and memory capacities. Converting web pages for delivery to a mobile device is an exciting new problem. In this paper, we propose to use a ranking algorithm similar to Google's PageRank algorithm to rank the content objects within a web page. This allows the extraction of only important parts of web pages for delivery to mobile devices. Experiments show that the new method is effective. In experiments on pages from randomly selected websites, the system needed to extract and deliver only 39% of the objects in a web page in order to provide 85% of a viewer's desired viewing content. This provides significant savings in the wireless traffic and downloading time while providing a satisfactory reading experience on the mobile device.
Keywords: html, link analysis, pda (personal digital assistant), www (world wide web)
An evaluation of binary XML encoding optimizations for fast stream based XML processing BIBAKFull-Text 345-354
  R. J. Bayardo; D. Gruhl; V. Josifovski; J. Myllymaki
This paper provides an objective evaluation of the performance impacts of binary XML encodings, using a fast stream-based XQuery processor as our representative application. Instead of proposing one binary format and comparing it against standard XML parsers, we investigate the individual effects of several binary encoding techniques that are shared by many proposals. Our goal is to provide a deeper understanding of the performance impacts of binary XML encodings in order to clarify the ongoing and often contentious debate over their merits, particularly in the domain of high performance XML stream processing.
Keywords: XML binary formats, XPath processing
Optimization of html automatically generated by wysiwyg programs BIBAKFull-Text 355-364
  Jacqueline Spiesser; Les Kitchen
Automatically generated HTML, as produced by WYSIWYG programs, typically contains much repetitive and unnecessary markup. This paper identifies aspects of such HTML that may be altered while leaving a semantically equivalent document, and proposes techniques to achieve optimizing modifications. These techniques include attribute re-arrangement via dynamic programming, the use of style classes, and dead-code removal. These techniques produce documents as small as 33% of original size. The size decreases obtained are still significant when the techniques are used in combination with conventional text-based compression.
Keywords: dynamic programming, haskell, html optimization, wysiwyg

Semantic web applications

Building a companion website in the semantic web BIBAKFull-Text 365-373
  Timothy J. Miles-Board; Christopher P. Bailey; Wendy Hall; Leslie A. Carr
A problem facing many textbook authors (including one of the authors of this paper) is the inevitable delay between new advances in the subject area and their incorporation in a new (paper) edition of the textbook. This means that some textbooks are quickly considered out of date, particularly in active technological areas such as the Web, even though the ideas presented in the textbook are still valid and important to the community. This paper describes our approach to building a companion website for the textbook Hypermedia and the Web: An Engineering Approach. We use Bloom's taxonomy of educational objectives to critically evaluate a number of authoring and presentation techniques used in existing companion websites, and adapt these techniques to create our own companion website using Semantic Web technologies in order to overcome the identified weaknesses. Finally, we discuss a potential model of future companion websites, in the context of an e-publishing, e-commerce Semantic Web services scenario.
Keywords: bloom's taxonomy, companion website, electronic publishing, semantic web, textbook
A hybrid approach for searching in the semantic web BIBAKFull-Text 374-383
  Cristiano Rocha; Daniel Schwabe; Marcus Poggi Aragao
This paper presents a search architecture that combines classical search techniques with spread activation techniques applied to a semantic model of a given domain. Given an ontology, weights are assigned to links based on certain properties of the ontology, so that they measure the strength of the relation. Spread activation techniques are used to find related concepts in the ontology given an initial set of concepts and corresponding initial activation values. These initial values are obtained from the results of classical search applied to the data associated with the concepts in the ontology. Two test cases were implemented, with very positive results. It was also observed that the proposed hybrid spread activation, combining the symbolic and the sub-symbolic approaches, achieved better results when compared to each of the approaches alone.
Keywords: network analysis, ontologies, semantic associations, semantic search, semantic web, spread activation algorithms
CS AKTive space: representing computer science in the semantic web BIBAKFull-Text 384-392
  m. c. schraefel; Nigel R. Shadbolt; Nicholas Gibbins; Stephen Harris; Hugh Glaser
We present a Semantic Web application that we call CS AKTive Space. The application exploits a wide range of semantically heterogeneous and distributed content relating to Computer Science research in the UK. This content is gathered on a continuous basis using a variety of methods including harvesting and scraping as well as adopting a range models for content acquisition. The content currently comprises around ten million RDF triples and we have developed storage, retrieval and maintenance methods to support its management. The content is mediated through an ontology constructed for the application domain and incorporates components from other published ontologies. CS AKTive Space supports the exploration of patterns and implications inherent in the content and exploits a variety of visualisations and multi dimensional representations. Knowledge services supported in the application include investigating communities of practice: who is working, researching or publishing with whom. This work illustrates a number of substantial challenges for the Semantic Web. These include problems of referential integrity, tractable inference and interaction support. We review our approaches to these issues and discuss relevant related work.
Keywords: ontologies, semantic web, semantic web challenge groups

Reputation networks

Shilling recommender systems for fun and profit BIBAKFull-Text 393-402
  Shyong K. Lam; John Riedl
Recommender systems have emerged in the past several years as an effective way to help people cope with the problem of information overload. One application in which they have become particularly common is in e-commerce, where recommendation of items can often help a customer find what she is interested in and, therefore can help drive sales. Unscrupulous producers in the never-ending quest for market penetration may find it profitable to shill recommender systems by lying to the systems in order to have their products recommended more often than those of their competitors. This paper explores four open questions that may affect the effectiveness of such shilling attacks: which recommender algorithm is being used, whether the application is producing recommendations or predictions, how detectable the attacks are by the operator of the system, and what the properties are of the items being attacked. The questions are explored experimentally on a large data set of movie ratings. Taken together, the results of the paper suggest that new ways must be used to evaluate and detect shilling attacks on recommender systems.
Keywords: collaborative filtering, recommender systems, shilling
Propagation of trust and distrust BIBAKFull-Text 403-412
  R. Guha; Ravi Kumar; Prabhakar Raghavan; Andrew Tomkins
A (directed) network of people connected by ratings or trust scores, and a model for propagating those trust scores, is a fundamental building block in many of today's most successful e-commerce and recommendation systems. We develop a framework of trust propagation schemes, each of which may be appropriate in certain circumstances, and evaluate the schemes on a large trust network consisting of 800K trust scores expressed among 130K people. We show that a small number of expressed trusts/distrust per individual allows us to predict trust between any two people in the system with high accuracy. Our work appears to be the first to incorporate distrust in a computational trust propagation setting.
Keywords: distrust, trust propagation, web of trust
A community-aware search engine BIBAKFull-Text 413-421
  Rodrigo B. Almeida; Virgilio A. F. Almeida
Current search technologies work in a "one size fits all" fashion. Therefore, the answer to a query is independent of specific user information need. In this paper we describe a novel ranking technique for personalized search services that combines content-based and community-based evidences. The community-based information is used in order to provide context for queries and is influenced by the current interaction of the user with the service. Our algorithm is evaluated using data derived from an actual service available on the Web an online bookstore. We show that the quality of content-based ranking strategies can be improved by the use of community information as another evidential source of relevance. In our experiments the improvements reach up to 48% in terms of average precision.
Keywords: data mining, searching and ranking

Versioning and fragmentation

Managing versions of web documents in a transaction-time web server BIBAKFull-Text 422-432
  Curtis E. Dyreson; Hui-ling Lin; Yingxia Wang
This paper presents a transaction-time HTTP server, called TTApache that supports document versioning. A document often consists of a main file formatted in HTML or XML and several included files such as images and stylesheets. A change to any of the files associated with a document creates a new version of that document. To construct a document version history, snapshots of the document's files are obtained over time. Transaction times are associated with each file version to record the version's lifetime. The transaction time is the system time of the edit that created the version. Accounting for transaction time is essential to supporting audit queries that delve into past document versions and differential queries that pinpoint differences between two versions. TTApache performs automatic versioning when a document is read thereby removing the burden of versioning from document authors. Since some versions may be created but never read, TTApache distinguishes between known and assumed versions of a document. TTApache has a simple query language to retrieve desired versions. A browser can request a specific version, or the entire history of a document. Queries can also rewrite links and references to point to current or past versions. Over time, the version history of a document continually grows. To free space, some versions can be vacuumed. Vacuuming a version however changes the semantics of requests for that version. This paper presents several policies for vacuuming versions and strategies for accounting for vacuumed versions in queries.
Keywords: observant system, transaction time, versioning
Fine-grained, structured configuration management for web projects BIBAKFull-Text 433-442
  Tien Nhut Nguyen; Ethan Vincent Munson; Cheng Thao
Researchers in Web engineering have regularly noted that existing Web application development environments provide little support for managing the evolution of Web applications. Key limitations of Web development environments include line-oriented change models that inadequately represent Web document semantics and in ability to model changes to link structure or the set of objects making up the Web application. Developers may find it difficult to grasp how the overall structure of the Web application has changed over time and may respond by using ad hoc solutions that lead to problems of maintain ability, quality and reliability. Web applications are software artifacts, and as such, can benefit from advanced version control and software configuration management (SCM)technologies from software engineering. We have modified an integrated development environment to manage the evolution and maintenance of Web applications. The resulting environment is distinguished by its fine-grained version control framework, fine-grained Web content change management, and product versioning configuration management, in which a Web project can be organized at the logical level and its structure and components are versioned in a fine-grained manner as well. This paper describes the motivation for this environment as well as its user interfaces, features, and implementation.
Keywords: software configuration management, version control, web engineering
Automatic detection of fragments in dynamically generated web pages BIBAKFull-Text 443-454
  Lakshmish Ramaswamy; Arun Iyengar; Ling Liu; Fred Douglis
Dividing web pages into fragments has been shown to provide significant benefits for both content generation and caching. In order for a web site to use fragment-based content generation, however, good methods are needed for dividing web pages into fragments. Manual fragmentation of web pages is expensive, error prone, and unscalable. This paper proposes a novel scheme to automatically detect and flag fragments that are cost-effective cache units in web sites serving dynamic content. We consider the fragments to be interesting if they are shared among multiple documents or they have different lifetime or personalization characteristics. Our approach has three unique features. First, we propose a hierarchical and fragment-aware model of the dynamic web pages and a data structure that is compact and effective for fragment detection. Second, we present an efficient algorithm to detect maximal fragments that are shared among multiple documents. Third, we develop a practical algorithm that effectively detects fragments based on their lifetime and personalization characteristics. We evaluate the proposed scheme through a series of experiments, showing the benefits and costs of the algorithms. We also study the impact of adopting the fragments detected by our system on disk space utilization and network bandwidth consumption.
Keywords: L-P fragments, dynamic content caching, fragment detection, fragment-based caching, shared fragments

Semantic annotation and integration

Incremental formalization of document annotations through ontology-based paraphrasing BIBAKFull-Text 455-461
  Jim Blythe; Yolanda Gil
For the manual semantic markup of documents to become wide-spread, users must be able to express annotations that conform to ontologies (or schemas) that have shared meaning. However, a typical user is unlikely to be familiar with the details of the terms as defined by the ontology authors. In addition, the idea to be expressed may not fit perfectly within a pre-defined ontology. The ideal tool should help users find a partial formalization that closely follows the ontology where possible but deviates from the formal representation where needed. We describe an implemented approach to help users create semi-structured semantic annotations for a document according to an extensible OWL ontology. In our approach, users enter a short sentence in free text to describe all or part of a document, and the system presents a set of potential paraphrases of the sentence that are generated from valid expressions in the ontology, from which the user chooses the closest match. We use a combination of off-the-shelf parsing tools and breadth-first search of expressions in the ontology to help users create valid annotations starting from free text. The user can also define new terms to augment the ontology, so the potential matches can improve over time.
Keywords: document annotation, knowledge acquisition, semantic markup
Towards the self-annotating web BIBAKFull-Text 462-471
  Philipp Cimiano; Siegfried Handschuh; Steffen Staab
The success of the Semantic Web depends on the availability of ontologies as well as on the proliferation of web pages annotated with metadata conforming to these ontologies. Thus, a crucial question is where to acquire these metadata from. In this paper we propose PANKOW (Pattern-based Annotation through Knowledge on the Web), a method which employs an unsupervised, pattern-based approach to categorize instances with regard to an ontology. The approach is evaluated against the manual annotations of two human subjects. The approach is implemented in OntoMat, an annotation tool for the Semantic Web and shows very promising results.
Keywords: information extraction, metadata, semantic annotation, semantic web
Web taxonomy integration using support vector machines BIBAKFull-Text 472-481
  Dell Zhang; Wee Sun Lee
We address the problem of integrating objects from a source taxonomy into a master taxonomy. This problem is not only currently pervasive on the web, but also important to the emerging semantic web. A straightforward approach to automating this process would be to train a classifier for each category in the master taxonomy, and then classify objects from the source taxonomy into these categories. In this paper we attempt to use a powerful classification method, Support Vector Machine (SVM), to attack this problem. Our key insight is that the availability of the source taxonomy data could be helpful to build better classifiers in this scenario, therefore it would be beneficial to do transductive learning rather than inductive learning, i.e., learning to optimize classification performance on a particular set of test examples. Noticing that the categorizations of the master and source taxonomies often have some semantic overlap, we propose a method, Cluster Shrinkage (CS), to further enhance the classification by exploiting such implicit knowledge. Our experiments with real-world web data show substantial improvements in the performance of taxonomy integration.
Keywords: classification, ontology mapping, semantic web, support vector machines, taxonomy integration, transductive learning

Mining new media

Newsjunkie: providing personalized newsfeeds via analysis of information novelty BIBAKFull-Text 482-490
  Evgeniy Gabrilovich; Susan Dumais; Eric Horvitz
We present a principled methodology for filtering news stories by formal measures of information novelty, and show how the techniques can be used to custom-tailor news feeds based on information that a user has already reviewed. We review methods for analyzing novelty and then describe Newsjunkie, a system that personalizes news for users by identifying the novelty of stories in the context of stories they have already reviewed. Newsjunkie employs novelty-analysis algorithms that represent articles as words and named entities. The algorithms analyze inter- and intra-document dynamics by considering how information evolves over time from article to article, as well as within individual articles. We review the results of a user study undertaken to gauge the value of the approach over legacy time-based review of newsfeeds, and also to compare the performance of alternate distance metrics that are used to estimate the dissimilarity between candidate new articles and sets of previously reviewed articles.
Keywords: news, novelty detection, personalization
Information diffusion through blogspace BIBAKFull-Text 491-501
  Daniel Gruhl; R. Guha; David Liben-Nowell; Andrew Tomkins
We study the dynamics of information propagation in environments of low-overhead personal publishing, using a large collection of weblogs over time as our example domain. We characterize and model this collection at two levels. First, we present a macroscopic characterization of topic propagation through our corpus, formalizing the notion of long-running "chatter" topics consisting recursively of "spike" topics generated by outside world events, or more rarely, by resonances within the community. Second, we present a microscopic characterization of propagation from individual to individual, drawing on the theory of infectious diseases to model the flow. We propose, validate, and employ an algorithm to induce the underlying propagation network from a sequence of posts, and report on the results.
Keywords: blogs, information propagation, memes, topic characterization, topic structure, viral propagation, viruses
Automatic web news extraction using tree edit distance BIBAKFull-Text 502-511
  D. C. Reis; P. B. Golgher; A. S. Silva; A. F. Laender
The Web poses itself as the largest data repository ever available in the history of humankind. Major efforts have been made in order to provide efficient access to relevant information within this huge repository of data. Although several techniques have been developed to the problem of Web data extraction, their use is still not spread, mostly because of the need for high human intervention and the low quality of the extraction results.
   In this paper, we present a domain-oriented approach to Web data extraction and discuss its application to automatically extracting news from Web sites. Our approach is based on a highly efficient tree structure analysis that produces very effective results. We have tested our approach with several important Brazilian on-line news sites and achieved very precise results, correctly extracting 87.71% of the news in a set of 4088 pages distributed among 35 different sites.
Keywords: data extraction, edit distance, schema inference, web

Workload analysis

Accurate, scalable in-network identification of p2p traffic using application signatures BIBAKFull-Text 512-521
  Subhabrata Sen; Oliver Spatscheck; Dongmei Wang
The ability to accurately identify the network traffic associated with different P2P applications is important to a broad range of network operations including application-specific traffic engineering, capacity planning, provisioning, service differentiation,etc. However, traditional traffic to higher-level application mapping techniques such as default server TCP or UDP network-port based disambiguation is highly inaccurate for some P2P applications.
   In this paper, we provide an efficient approach for identifying the P2P application traffic through application level signatures. We first identify the application level signatures by examining some available documentations, and packet-level traces. We then utilize the identified signatures to develop online filters that can efficiently and accurately track the P2P traffic even on high-speed network links.
   We examine the performance of our application-level identification approach using five popular P2P protocols. Our measurements show that our technique achieves less than 5% false positive and false negative ratios in most cases. We also show that our approach only requires the examination of the very first few packets (less than 10packets) to identify a P2P connection, which makes our approach highly scalable. Our technique can significantly improve the P2P traffic volume estimates over what pure network port based approaches provide. For instance, we were able to identify 3 times as much traffic for the popular Kazaa P2P protocol, compared to the traditional port-based approach.
Keywords: application-level signatures, online application classification, p2p, traffic analysis
Characterization of a large web site population with implications for content delivery BIBAKFull-Text 522-533
  L. Bent; M. Rabinovich; G. M. Voelker; Z. Xiao
This paper presents a systematic study of the properties of a large number of Web sites hosted by a major ISP. To our knowledge, ours is the first comprehensive study of a large server farm that contains thousands of commercial Web sites. We also perform a simulation analysis to estimate potential performance benefits of content delivery networks (CDNs) for these Web sites. We make several interesting observations about the current usage of Web technologies and Web site performance characteristics. First, compared with previous client workload studies, the Web server farm workload contains a much higher degree of uncacheable responses and responses that require mandatory cache validations. A significant reason for this is that cookie use is prevalent among our population, especially among more popular sites. However, we found an indication of wide-spread indiscriminate usage of cookies, which unnecessarily impedes the use of many content delivery optimizations. We also found that most Web sites do not utilize the cache-control features of the HTTP 1.1 protocol, resulting in suboptimal performance. Moreover, the implicit expiration time in client caches for responses is constrained by the maximum values allowed in the Squid proxy. Finally, our simulation results indicate that most Web sites benefit from the use of a CDN. The amount of the benefit depends on site popularity, and, somewhat surprisingly, a CDN may increase the peak to average request ratio at the origin server because the CDN can decrease the average request rate more than the peak request rate.
Keywords: content distribution, cookie, http, measurement, performance, web caching, workload characterization
Analyzing client interactivity in streaming media BIBAKFull-Text 534-543
  Cristiano P. Costa; Italo S. Cunha; Alex Borges; Claudiney V. Ramos; Marcus M. Rocha; Jussara M. Almeida; Berthier Ribeiro-Neto
This paper provides an extensive analysis of pre-stored streaming media workloads, focusing on the client interactive behavior. We analyze four workloads that fall into three different domains, namely, education, entertainment video and entertainment audio. Our main goals are: (a) to identify qualitative similarities and differences in the typical client behavior for the three workload classes and (b) to provide data for generating realistic synthetic workloads.
Keywords: streaming media, workload characterization

Semantic web services

Augmenting semantic web service descriptions with compositional specification BIBAKFull-Text 544-552
  Monika Solanki; Antonio Cau; Hussein Zedan
Current ontological specifications for semantically describing properties of Web services are limited to their static interface description. Normally for proving properties of service compositions, mapping input/output parameters and specifying the pre/post conditions are found to be sufficient. However these properties are assertions only on the initial and final states of the service respectively. They do not help in specifying/verifying ongoing behaviour of an individual service or a composed system. We propose a framework for enriching semantic service descriptions with two compositional assertions: assumption and commitment that facilitate reasoning about service composition and verification of their integration. The technique is based on Interval Temporal Logic (ITL): a sound formalism for specifying and proving temporal properties of systems. Our approach utilizes the recently proposed Semantic Web Rule Language.
Keywords: assumption, commitment, interval temporal logics, owl, owl-s, semantic web services, swrl, web services
Meteor-s web service annotation framework BIBAKFull-Text 553-562
  Abhijit A. Patil; Swapna A. Oundhakar; Amit P. Sheth; Kunal Verma
The World Wide Web is emerging not only as an infrastructure for data, but also for a broader variety of resources that are increasingly being made available as Web services. Relevant current standards like UDDI, WSDL, and SOAP are in their fledgling years and form the basis of making Web services a workable and broadly adopted technology. However, realizing the fuller scope of the promise of Web services and associated service oriented architecture will requite further technological advances in the areas of service interoperation, service discovery, service composition, and process orchestration. Semantics, especially as supported by the use of ontologies, and related Semantic Web technologies, are likely to provide better qualitative and scalable solutions to these requirements. Just as semantic annotation of data in the Semantic Web is the first critical step to better search, integration and analytics over heterogeneous data, semantic annotation of Web services is an equally critical first step to achieving the above promise. Our approach is to work with existing Web services technologies and combine them with ideas from the Semantic Web to create a better framework for Web service discovery and composition. In this paper we present MWSAF (METEOR-S Web Service Annotation Framework), a framework for semi-automatically marking up Web service descriptions with ontologies. We have developed algorithms to match and annotate WSDL files with relevant ontologies. We use domain ontologies to categorize Web services into domains. An empirical study of our approach is presented to help evaluate its performance.
Keywords: ontology, semantic annotation of web services, semantic web services, web services discovery, wsdl
Foundations for service ontologies: aligning OWL-S to dolce BIBAKFull-Text 563-572
  Peter Mika; Daniel Oberle; Aldo Gangemi; Marta Sabou
Clarity in semantics and a rich formalization of this semantics are important requirements for ontologies designed to be deployed in large-scale, open, distributed systems such as the envisioned Semantic Web This is especially important for the description of Web Services, which should enable complex tasks involving multiple agents. As one of the first initiatives of the Semantic Web community for describing Web Services, OWL-S attracts a lot of interest even though it is still under development. We identify problematic aspects of OWL-S and suggest enhancements through alignment to a foundational ontology. Another contribution of our work is the Core Ontology of Services that tries to fill the epistemological gap between the foundational ontology and OWL-S. It can be reused to align other Web Service description languages as well. Finally, we demonstrate the applicability of our work by aligning OWL-S' standard example called CongoBuy.
Keywords: core ontology of services, daml-s, descriptions and situations, dolce, owl-s, semantic web, web services

Search engineering 2

Mining models of human activities from the web BIBAKFull-Text 573-582
  Mike Perkowitz; Matthai Philipose; Kenneth Fishkin; Donald J. Patterson
The ability to determine what day-to-day activity (such as cooking pasta, taking a pill, or watching a video) a person is performing is of interest in many application domains. A system that can do this requires models of the activities of interest, but model construction does not scale well: humans must specify low-level details, such as segmentation and feature selection of sensor data, and high-level structure, such as spatio-temporal relations between states of the model, for each and every activity. As a result, previous practical activity recognition systems have been content to model a tiny fraction of the thousands of human activities that are potentially useful to detect. In this paper, we present an approach to sensing and modeling activities that scales to a much larger class of activities than before. We show how a new class of sensors, based on Radio Frequency Identification (RFID) tags, can directly yield semantic terms that describe the state of the physical world. These sensors allow us to formulate activity models by translating labeled activities, such as 'cooking pasta', into probabilistic collections of object terms, such as 'pot'. Given this view of activity models as text translations, we show how to mine definitions of activities in an unsupervised manner from the web. We have used our technique to mine definitions for over 20,000 activities. We experimentally validate our approach using data gathered from actual human activity as well as simulated data.
Keywords: activity inference, activity models, rfid, web mining
TeXQuery: a full-text search extension to XQuery BIBAKFull-Text 583-594
  S. Amer-Yahia; C. Botev; J. Shanmugasundaram
One of the key benefits of XML is its ability to represent a mix of structured and unstructured (text) data. Although current XML query languages such as XPath and XQuery can express rich queries over structured data, they can only express very rudimentary queries over text data. We thus propose TeXQuery, which is a powerful full-text search extension to XQuery. TeXQuery provides a rich set of fully composable full-text search primitives,such as Boolean connectives, phrase matching, proximity distance, stemming and thesauri. TeXQuery also enables users to seamlessly query over both structured and text data by embedding TeXQuery primitives in XQuery, and vice versa. Finally, TeXQuery supports a flexible scoring construct that can be used to score query results based on full-text predicates. TeXQuery is the precursor of the full-text language extensions to XPath 2.0 and XQuery 1.0 currently being developed by the W3C.
Keywords: full-text search, XQuery
The WebGraph framework I: compression techniques BIBAKFull-Text 595-602
  P. Boldi; S. Vigna
Studying web graphs is often difficult due to their large size. Recently,several proposals have been published about various techniques that allow to store a web graph in memory in a limited space, exploiting the inner redundancies of the web. The WebGraph framework is a suite of codes, algorithms and tools that aims at making it easy to manipulate large web graphs. This papers presents the compression techniques used in WebGraph, which are centred around referentiation and intervalisation (which in turn are dual to each other). WebGraph can compress the WebBase graph (118 Mnodes, 1 Glinks) in as little as 3.08 bits per link, and its transposed version in as little as 2.89 bits per link.
Keywords: compression, web graph

Infrastructure for implementation

XQuery at your web service BIBAKFull-Text 603-611
  Nicola Onose; Jerome Simeon
XML messaging is at the heart of Web services, providing the flexibility required for their deployment, composition, and maintenance. Yet, current approaches to Web services development hide the messaging layer behind Java or C# APIs, preventing the application to get direct access to the underlying XML information. To address this problem, we advocate the use of a native XML language, namely XQuery, as an integral part of the Web services development infrastructure. The main contribution of the paper is a binding between WSDL, the Web Services Description Language, and XQuery. The approach enables the use of XQuery for both Web services deployment and composition. We present a simple command-line tool that can be used to automatically deploy a Web service from a given XQuery module, and extend the XQuery language itself with a statement for accessing one or more Web services. The binding provides tight-coupling between WSDL and XQuery, yielding additional benefits, notably: the ability to use WSDL as an interface language for XQuery, and the ability to perform static typing on XQuery programs that include Web service calls. Last but not least, the proposal requires only minimal changes to the existing infrastructure. We report on our experience implementing this approach in the Galax XQuery processor.
Keywords: XML, XQuery, interface, modules, web services, wsdl
Adapting databases and WebDAV protocol BIBAFull-Text 612-620
  Bita Shadgar; Ian Holyer
The ability of the Web to share data regardless of geographical location raises a new issue called remote authoring. With the Internet and Web browsers being independent of hardware, it becomes possible to build Web-enabled database applications. Many approaches are provided to integrate databases into the Web environment, which use the Web's protocol i.e. HTTP to transfer the data between clients and servers. However, those methods are affected by the HTTP shortfalls with regard to remote authoring. This paper introduces and discusses a new methodology for remote authoring of databases, which is based on the WebDAV protocol. It is a seamless and effective methodology for accessing and authoring databases, particularly in that it naturally benefits from the WebDAV advantages such as metadata and access control. These features establish a standard way of accessing database metadata, and increase the database security, while speeding up the database connection.
Analysis of interacting BPEL web services BIBAKFull-Text 621-630
  Xiang Fu; Tevfik Bultan; Jianwen Su
This paper presents a set of tools and techniques for analyzing interactions of composite web services which are specified in BPEL and communicate through asynchronous XML messages. We model the interactions of composite web services as conversations, the global sequence of messages exchanged by the web services. As opposed to earlier work, our tool-set handles rich data manipulation via XPath expressions. This allows us to verify designs at a more detailed level and check properties about message content. We present a framework where BPEL specifications of web services are translated to an intermediate representation, followed by the translation of the intermediate representation to a verification language. As an intermediate representation we use guarded automata augmented with unbounded queues for incoming messages, where the guards are expressed as XPath expressions. As the target verification language we use Promela, input language of the model checker SPIN. Since SPIN model checker is a finite-state verification tool we can only achieve partial verification by fixing the sizes of the input queues in the translation. We propose the concept of synchronizability to address this problem. We show that if a composite web service is synchronizable, then its conversation set remains same when asynchronous communication is replaced with synchronous communication. We give a set of sufficient conditions that guarantee synchronizability and that can be checked statically. Based on our synchronizability results, we show that a large class of composite web services with unbounded input queues can be completely verified using a finite state model checker such as SPIN.
Keywords: BPEL, asynchronous communication, conversation, model checking, spin, synchronizability, web service, XPath

Distributed semantic query

Index structures and algorithms for querying distributed RDF repositories BIBAKFull-Text 631-639
  Heiner Stuckenschmidt; Richard Vdovjak; Geert-Jan Houben; Jeen Broekstra
A technical infrastructure for storing, querying and managing RDFdata is a key element in the current semantic web development. Systems like Jena, Sesame or the ICS-FORTH RDF Suite are widely used for building semantic web applications. Currently, none of these systems supports the integrated querying of distributed RDF repositories. We consider this a major shortcoming since the semantic web is distributed by nature. In this paper we present an architecture for querying distributed RDF repositories by extending the existing Sesame system. We discuss the implications of our architecture and propose an index structure as well as algorithms for query processing and optimization in such a distributed context.
Keywords: RDF querying, index structures, optimization
REMINDIN': semantic query routing in peer-to-peer networks based on social metaphors BIBAKFull-Text 640-649
  Christoph Tempich; Steffen Staab; Adrian Wranik
In peer-to-peer networks, finding the appropriate answer for an information request, such as the answer to a query for RDF(S) data, depends on selecting the right peer in the network. We here investigate how social metaphors can be exploited effectively and efficiently to solve this task. To this end, we define a method for query routing, REMINDIN', that lets (i) peers observe which queries are successfully answered by other peers, (ii), memorizes this observation, and, (iii), subsequently uses this information in order to select peers to forward requests to.
   REMINDIN' has been implemented for the SWAP peer-to-peer platform as well as for a simulation environment. We have used the simulation environment in order to investigate how successful variations of REMINDIN' are and how they compare to baseline strategies in terms of number of messages forwarded in the network and statements appropriately retrieved.
Keywords: ontologies, peer selection, peer-to-peer, query routing
RDFPeers: a scalable distributed RDF repository based on a structured peer-to-peer network BIBAKFull-Text 650-657
  Min Cai; Martin Frank
Centralized Resource Description Framework (RDF) repositories have limitations both in their failure tolerance and in their scalability. Existing Peer-to-Peer (P2P) RDF repositories either cannot guarantee to find query results, even if these results exist in the network, or require up-front definition of RDF schemas and designation of super peers. We present a scalable distributed RDF repository (RDFPeers) that stores each triple at three places in a multi-attribute addressable network by applying globally known hash functions to its subject predicate and object. Thus all nodes know which node is responsible for storing triple values they are looking for and both exact-match and range queries can be efficiently routed to those nodes. RDFPeers has no single point of failure nor elevated peers and does not require the prior definition of RDF schemas. Queries are guaranteed to find matched triples in the network if the triples exist. In RDFPeers both the number of neighbors per node and the number of routing hops for inserting RDF triples and for resolving most queries are logarithmic to the number of nodes in the network. We further performed experiments that show that the triple-storing load in RDFPeers differs by less than an order of magnitude between the most and the least loaded nodes for real-world RDF data.
Keywords: distributed RDF repositories, peer-to-peer, semantic web

Query result processing

A hierarchical monothetic document clustering algorithm for summarization and browsing search results BIBAKFull-Text 658-665
  Krishna Kummamuru; Rohit Lotlikar; Shourya Roy; Karan Singal; Raghu Krishnapuram
Organizing Web search results into a hierarchy of topics and sub-topics facilitates browsing the collection and locating results of interest. In this paper, we propose a new hierarchical monothetic clustering algorithm to build a topic hierarchy for a collection of search results retrieved in response to a query. At every level of the hierarchy, the new algorithm progressively identifies topics in a way that maximizes the coverage while maintaining distinctiveness of the topics. We refer the proposed algorithm to as DisCover. Evaluating the quality of a topic hierarchy is a non-trivial task, the ultimate test being user judgment. We use several objective measures such as coverage and reach time for an empirical comparison of the proposed algorithm with two other monothetic clustering algorithms to demonstrate its superiority. Even though our algorithm is slightly more computationally intensive than one of the algorithms, it generates better hierarchies. Our user studies also show that the proposed algorithm is superior to the other algorithms as a summarizing and browsing tool.
Keywords: automatic taxonomy generation, clustering, data mining, search, summarization
Mining anchor text for query refinement BIBAKFull-Text 666-674
  Reiner Kraft; Jason Zien
When searching large hypertext document collections, it is often possible that there are too many results available for ambiguous queries. Query refinement is an interactive process of query modification that can be used to narrow down the scope of search results. We propose a new method for automatically generating refinements or related terms to queries by mining anchor text for a large hypertext document collection. We show that the usage of anchor text as a basis for query refinement produces high quality refinement suggestions that are significantly better in terms of perceived usefulness compared to refinements that are derived using the document content. Furthermore, our study suggests that anchor text refinements can also be used to augment traditional query refinement algorithms based on query logs, since they typically differ in coverage and produce different refinements. Our results are based on experiments on an anchor text collection of a large corporate intranet.
Keywords: anchor text, query refinement, rank, web search
Adaptive web search based on user profile constructed without any effort from users BIBAKFull-Text 675-684
  Kazunari Sugiyama; Kenji Hatano; Masatoshi Yoshikawa
Web search engines help users find useful information on the World Wide Web (WWW). However, when the same query is submitted by different users, typical search engines return the same result regardless of who submitted the query. Generally, each user has different information needs for his/her query. Therefore, the search result should be adapted to users with different information needs. In this paper, we first propose several approaches to adapting search results according to each user's need for relevant information without any user effort, and then verify the effectiveness of our proposed approaches. Experimental results show that search systems that adapt to each user's preferences can be achieved by constructing user profiles based on modified collaborative filtering with detailed analysis of user's browsing history in one day.
Keywords: WWW, information retrieval, user modeling

Web site analysis and customization

Practical semantic analysis of web sites and documents BIBAKFull-Text 685-693
  Thierry Despeyroux
As Web sites are now ordinary products, it is necessary to explicit the notion of quality of a Web site. The quality of a site may be linked to the easiness of accessibility and also to other criteria such as the fact that the site is up to date and coherent. This last quality is difficult to insure because sites may be updated very frequently, may have many authors, may be partially generated and in this context proof-reading is very difficult. The same piece of information may be found in different occurrences, but also in data or meta-data, leading to the need for consistency checking. In this paper we make a parallel between programs and Web sites. We present some examples of semantic constraints that one would like to specify (constraints between the meaning of categories and sub-categories in a thematic directory, consistency between the organization chart and the rest of the site in an academic site). We present quickly the Natural Semantics a way to specify the semantics of programming languages that inspires our works. Natural Semantics itself comes from both an operational semantics and from logic programming and its implementation uses Prolog. Then we propose a specification language for semantic constraints in Web sites that, in conjunction with the well known "make" program, permits to generate some site verification tools by compiling the specification into Prolog code. We apply our method to a large XML document which is the scientific part of our institute activity report, tracking errors or inconsistencies and also constructing some indicators that can be used by the management of the institute.
Keywords: XML, consistency, content management, formal semantics, information system, knowledge management, logic programming, quality, web engineering, web site evolution, web sites
Web customization using behavior-based remote executing agents BIBAKFull-Text 694-703
  Eugene Hung; Joseph Pasquale
ReAgents are remotely executing agents that customize Web browsing for non-standard clients. A reAgent is essentially a one-shot" mobile agent that acts as an extension of a client dynamically launched by the client to run on its behalf at a remote more advantageous location. ReAgents simplify the use of mobile agent technology by transparently handling data migration and run-time network communications and provide a general interface for programmers to more easily implement their application-specific customizing logic. This is made possible by the identification of useful remote behaviors i.e. common patterns of actions that exploit the ability to process and communicate remotely. Examples of such behaviors are transformers monitors cachers and collators. In this paper we identify a set of useful reAgent behaviors for interacting with Web services via a standard browser describe how to program and use reAgents and show that the overhead of using reAgents is low and outweighed by its benefits.
Keywords: dynamic deployment, remote agents, web customization

Semantic web foundations

A possible simplification of the semantic web architecture BIBAKFull-Text 704-713
  Bernardo Cuenca Grau
In the semantic Web architecture, Web ontology languages are built on top of RDF(S). However, serious difficulties have arisen when trying to layer expressive ontology languages, like OWL, on top of RDF-Schema. Although these problems can be avoided, OWL (and the whole semantic Web architecture) becomes much more complex than it should be. In this paper, a possible simplification of the semantic Web architecture is suggested, which has several important advantages with respect to the layering currently accepted by the W3C Ontology Working Group.
Keywords: description logics, ontology web language (OWL), resource description framework (RDF), resource description framework schema (RDF-schema), semantic web
A combined approach to checking web ontologies BIBAKFull-Text 714-722
  J. S. Dong; C. H. Lee; H. B. Lee; Y. F. Li; H. Wang
The understanding of Semantic Web documents is built upon ontologies that define concepts and relationships of data. Hence, the correctness of ontologies is vital. Ontology reasoners such as RACER and FaCT have been developed to reason ontologies with a high degree of automation. However, complex ontology-related properties may not be expressible within the current web ontology languages, consequently they may not be checkable by RACER and FaCT. We propose to use the software engineering techniques and tools, i.e., Z/EVES and Alloy Analyzer, to complement the ontology tools for checking Semantic Web documents.
   In this approach, Z/EVES is first applied to remove trivial syntax and type errors of the ontologies. Next, RACER is used to identify any ontological inconsistencies, whose origins can be traced by Alloy Analyzer. Finally Z/EVES is used again to express complex ontology-related properties and reveal errors beyond the modeling capabilities of the current web ontology languages. We have successfully applied this approach to checking a set of military plan ontologies.
Keywords: alloy, daml+oil, ontologies, racer, semantic web, z
A proposal for an owl rules language BIBAKFull-Text 723-731
  Ian Horrocks; Peter F. Patel-Schneider
Although the OWLWeb Ontology Language adds considerable expressive power to the Semantic Web it does have expressive limitations, particularly with respect to what can be said about properties. We present ORL (OWL Rules Language), a Horn clause rules extension to OWL that overcomes many of these limitations. ORL extends OWL in a syntactically and semantically coherent manner: the basic syntax for ORL rules is an extension of the abstract syntax for OWL DL and OWLLite; ORL rules are given formal meaning via an extension of the OWLDL model-theoretic semantics; ORL rules are given an XML syntax based on the OWL XML presentation syntax; and a mapping from ORL rules to RDF graphs is given based on the OWL RDF/XML exchange syntax. We discuss the expressive power of ORL, showing that the ontology consistency problem is undecidable, provide several examples of ORLusage, and discuss how reasoning support for ORL might be provided.
Keywords: model-theoretic semantics, representation, semantic web