HCI Bibliography Home | HCI Conferences | WWW Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
WWW Tables of Contents: 01020304-104-205-105-2060708091011-111-212-112-2

Proceedings of the 2005 International Conference on the World Wide Web

Fullname:Proceedings of the 14th International Conference on World Wide Web
Editors:Allan Ellis; Tatsuya Hagino
Location:Chiba, Japan
Dates:2005-May-10 to 2005-May-14
Volume:1
Publisher:ACM
Standard No:ISBN: 1-59593-046-9; ACM DL: Table of Contents hcibib: WWW05-1
Papers:82
Pages:769
Links:Conference Home Page
  1. WWW 2005-05-10 Volume 1
    1. Keynote
    2. Usage analysis
    3. Wide-area architecture and protocols
    4. Data extraction
    5. Keynote
    6. Semantic querying
    7. Web services
    8. Web application design
    9. Semantic Web
    10. Applications
    11. Indexing and querying
    12. Keynote
    13. XML query and programming languages
    14. Web-based educational applications
    15. Text analysis and extraction
    16. Keynote
    17. Web engineering with semantic annotation
    18. User-focused search and crawling
    19. Trustworthy Web sites
    20. Semantic search
    21. Security through the eyes of users
    22. Measurements and analysis
    23. Keynote
    24. Service selection and metadata
    25. Link-based ranking
    26. Improving the browsing experience
    27. Semantic Web foundations
    28. Link-based similarity
    29. XML parsing and stylesheets
    30. Schemas and semantics
    31. Architecture and implementation of Web sites

WWW 2005-05-10 Volume 1

Keynote

WWW at 15 years: looking forward BIBAFull-Text 1
  Tim Berners-Lee
The key property of the WWWW is its universality: One must be able to access it whatever the hardware device, software platform, and network one is using, and despite the disabilities one might have, and whether oner is in a "developed" or "developing" country; it must support information of any language, culture, quality, medium, and field without discrimination so that a hypertext link can go anywhere; it must support information intended for people, and that intended for machine processing. The Web architecture incorporated various choices which support these axes of universality.
   Currently the architecture and the principles are being exploited in the recent Mobile Web initiative in W3C to promote content which can be accessed optimally from conventional computers and mobile devices. New exciting areas arise every few months as possible Semantic Web flagship applications. As new areas burst forth, the fundamental principles remain important and are extended and adjusted. At the same time, the principles of openness and consensus among international stakeholders which the WWW consortium employs for new technology are adjusted, but ever-important.

Usage analysis

Semantic similarity between search engine queries using temporal correlation BIBAKFull-Text 2-11
  Steve Chien; Nicole Immorlica
We investigate the idea of finding semantically related search engine queries based on their temporal correlation; in other words, we infer that two queries are related if their popularities behave similarly over time. To this end, we first define a new measure of the temporal correlation of two queries based on the correlation coefficient of their frequency functions. We then conduct extensive experiments using our measure on two massive query streams from the MSN search engine, revealing that this technique can discover a wide range of semantically similar queries. Finally, we develop a method of efficiently finding the highest correlated queries for a given input query using far less space and time than the naive approach, making real-time implementation possible.
Keywords: query stream analysis, search engines, semantic similarity among queries
Duplicate detection in click streams BIBAKFull-Text 12-21
  Ahmed Metwally; Divyakant Agrawal; Amr El Abbadi
We consider the problem of finding duplicates in data streams. Duplicate detection in data streams is utilized in various applications including fraud detection. We develop a solution based on Bloom Filters [9], and discuss the space and time requirements for running the proposed algorithm in both the contexts of sliding, and landmark stream windows. We run a comprehensive set of experiments, using both real and synthetic click streams, to evaluate the performance of the proposed solution. The results demonstrate that the proposed solution yields extremely low error rates.
Keywords: advertising networks, approximate queries, data streams, duplicate detection
Improving recommendation lists through topic diversification BIBAKFull-Text 22-32
  Cai-Nicolas Ziegler; Sean M. McNee; Joseph A. Konstan; Georg Lausen
In this work we present topic diversification, a novel method designed to balance and diversify personalized recommendation lists in order to reflect the user's complete spectrum of interests. Though being detrimental to average accuracy, we show that our method improves user satisfaction with recommendation lists, in particular for lists generated using the common item-based collaborative filtering algorithm.
   Our work builds upon prior research on recommender systems, looking at properties of recommendation lists as entities in their own right rather than specifically focusing on the accuracy of individual recommendations. We introduce the intra-list similarity metric to assess the topical diversity of recommendation lists and the topic diversification approach for decreasing the intra-list similarity. We evaluate our method using book recommendation data, including offline analysis on 361,349 ratings and an online study involving more than 2,100 subjects.
Keywords: accuracy, collaborative filtering, diversification, metrics, recommender systems

Wide-area architecture and protocols

GlobeDB: autonomic data replication for web applications BIBAKFull-Text 33-42
  Swaminathan Sivasubramanian; Gustavo Alonso; Guillaume Pierre; Maarten van Steen
We present GlobeDB, a system for hosting Web applications that performs autonomic replication of application data. GlobeDB offers data-intensive Web applications the benefits of low access latencies and reduced update traffic. The major distinction in our system compared to existing edge computing infrastructures is that the process of distribution and replication of application data is handled by the system automatically with very little manual administration. We show that significant performance gains can be obtained this way. Performance evaluations with the TPC-W benchmark over an emulated wide-area network show that GlobeDB reduces latencies by a factor of 4 compared to non-replicated systems and reduces update traffic by a factor of 6 compared to fully replicated systems.
Keywords: autonomic replication, data replication, edge services, performance
Hierarchical substring caching for efficient content distribution to low-bandwidth clients BIBAKFull-Text 43-53
  Utku Irmak; Torsten Suel
While overall bandwidth in the internet has grown rapidly over the last few years, and an increasing number of clients enjoy broadband connectivity, many others still access the internet over much slower dialup or wireless links. To address this issue, a number of techniques for optimized delivery of web and multimedia content over slow links have been proposed, including protocol optimizations, caching, compression, and multimedia transcoding, and several large ISPs have recently begun to widely promote dialup acceleration services based on such techniques. A recent paper by Rhea, Liang, and Brewer proposed an elegant technique called value-based caching that caches substrings of files, rather than entire files, and thus avoids repeated transmission of substrings common to several pages or page versions.
   We propose and study a hierarchical substring caching technique that provides significant savings over this basic approach. We describe several additional techniques for minimizing overheads and perform an evaluation on a large set of real web access traces that we collected. In the second part of our work, we compare our approach to a widely studied alternative approach based on delta compression, and show how to integrate the two for best overall performance. The studied techniques are typically employed in a client-proxy environment, with each proxy serving a large number of clients, and an important aspect is how to conserve resources on the proxy while exploiting the significant memory and CPU power available on current clients.
Keywords: HTTP, WWW, compression, web caching, web proxies
Executing incoherency bounded continuous queries at web data aggregators BIBAKFull-Text 54-65
  Rajeev Gupta; Ashish Puri; Krithi Ramamritham
Continuous queries are used to monitor changes to time varying data and to provide results useful for online decision making. Typically a user desires to obtain the value of some function over distributed data items, for example, to determine when and whether (a) the traffic entering a highway from multiple feed roads will result in congestion in a thoroughfare or (b) the value of a stock portfolio exceeds a threshold. Using the standard Web infrastructure for these applications will increase the reach of the underlying information. But, since these queries involve data from multiple sources, with sources supporting standard HTTP (pull-based) interfaces, special query processing techniques are needed. Also, these applications often have the flexibility to tolerate some incoherency, i.e., some differences between the results reported to the user and that produced from the virtual database made up of the distributed data sources.
   In this paper, we develop and evaluate client-pull-based techniques for refreshing data so that the results of the queries over distributed data can be correctly reported, conforming to the limited incoherency acceptable to the users.
   We model as well as estimate the dynamics of the data items using a probabilistic approach based on Markov Chains. Depending on the dynamics of data we adapt the data refresh times to deliver query results with the desired coherency. The commonality of data needs of multiple queries is exploited to further reduce refresh overheads. Effectiveness of our approach is demonstrated using live sources of dynamic data: the number of refreshes it requires is (a) an order of magnitude less than what we would need if every potential update is pulled from the sources, and (b) comparable to the number of messages needed by an ideal algorithm, one that knows how to optimally refresh the data from distributed data sources. Our evaluations also bring out a very practical and attractive tradeoff property of pull based approaches, e.g., a small increase in tolerable incoherency leads to a large decrease in message overheads.
Keywords: Markov model, coherency, continuous queries, fidelity, online decision making

Data extraction

Fully automatic wrapper generation for search engines BIBAKFull-Text 66-75
  Hongkun Zhao; Weiyi Meng; Zonghuan Wu; Vijay Raghavan; Clement Yu
When a query is submitted to a search engine, the search engine returns a dynamically generated result page containing the result records, each of which usually consists of a link to and/or snippet of a retrieved Web page. In addition, such a result page often also contains information irrelevant to the query, such as information related to the hosting site of the search engine and advertisements. In this paper, we present a technique for automatically producing wrappers that can be used to extract search result records from dynamically generated result pages returned by search engines. Automatic search result record extraction is very important for many applications that need to interact with search engines such as automatic construction and maintenance of metasearch engines and deep Web crawling. The novel aspect of the proposed technique is that it utilizes both the visual content features on the result page as displayed on a browser and the HTML tag structures of the HTML source file of the result page. Experimental results indicate that this technique can achieve very high extraction accuracy.
Keywords: information extraction, search engine, wrapper generation
Web data extraction based on partial tree alignment BIBAKFull-Text 76-85
  Yanhong Zhai; Bing Liu
This paper studies the problem of extracting data from a Web page that contains several structured data records. The objective is to segment these data records, extract data items/fields from them and put the data in a database table. This problem has been studied by several researchers. However, existing methods still have some serious limitations. The first class of methods is based on machine learning, which requires human labeling of many examples from each Web site that one is interested in extracting data from. The process is time consuming due to the large number of sites and pages on the Web. The second class of algorithms is based on automatic pattern discovery. These methods are either inaccurate or make many assumptions. This paper proposes a new method to perform the task automatically. It consists of two steps, (1) identifying individual data records in a page, and (2) aligning and extracting data items from the identified data records. For step 1, we propose a method based on visual information to segment data records, which is more accurate than existing methods. For step 2, we propose a novel partial alignment technique based on tree matching. Partial alignment means that we align only those data fields in a pair of data records that can be aligned (or matched) with certainty, and make no commitment on the rest of the data fields. This approach enables very accurate alignment of multiple data records. Experimental results using a large number of Web pages from diverse domains show that the proposed two-step technique is able to segment data records, align and extract data from them very accurately.
Keywords: data extraction, data record extraction, wrapper
Thresher: automating the unwrapping of semantic content from the World Wide Web BIBAKFull-Text 86-95
  Andrew Hogue; David Karger
We describe Thresher, a system that lets non-technical users teach their browsers how to extract semantic web content from HTML documents on the World Wide Web. Users specify examples of semantic content by highlighting them in a web browser and describing their meaning. We then use the tree edit distance between the DOM subtrees of these examples to create a general pattern, or wrapper, for the content, and allow the user to bind RDF classes and predicates to the nodes of these wrappers. By overlaying matches to these patterns on standard documents inside the Haystack semantic web browser, we enable a rich semantic interaction with existing web pages, "unwrapping" semantic data buried in the pages' HTML. By allowing end-users to create, modify, and utilize their own patterns, we hope to speed adoption and use of the Semantic Web and its applications.
Keywords: RDF, haystack, semantic Web, tree edit distance, wrapper induction

Keynote

The case for technology for developing regions BIBAFull-Text 96
  Eric A. Brewer
Moore's Law and the wave of technologies it enabled have led to tremendous improvements in productivity and the quality of life in the industrialized world. Yet, technology has had almost no effect on the four billion people that make less US$2000/day. In this talk I argue that the decreasing costs of computing and wireless networking make this the right time to spread the benefits of technology, and that the biggest missing piece is a lack of focus on the problems that matter, including health, education, and government. After covering some example applications that have shown very high impact, I take an early look at the research agenda for developing regions. Finally, I examine some of the pragmatic issues required to make progress on these very challenging problems. My goal is to convince high-tech researchers that technology for developing regions is an important and viable research topic.

Semantic querying

Ranking a stream of news BIBAKFull-Text 97-106
  Gianna M. Del Corso; Antonio Gullí; Francesco Romani
According to a recent survey made by Nielsen NetRatings, searching on news articles is one of the most important activity online. Indeed, Google, Yahoo, MSN and many others have proposed commercial search engines for indexing news feeds. Despite this commercial interest, no academic research has focused on ranking a stream of news articles and a set of news sources. In this paper, we introduce this problem by proposing a ranking framework which models: (1) the process of generation of a stream of news articles, (2) the news articles clustering by topics, and (3) the evolution of news story over the time. The ranking algorithm proposed ranks news information, finding the most authoritative news sources and identifying the most interesting events in the different categories to which news article belongs. All these ranking measures take in account the time and can be obtained without a predefined sliding window of observation over the stream. The complexity of our algorithm is linear in the number of pieces of news still under consideration at the time of a new posting. This allow a continuous on-line process of ranking. Our ranking framework is validated on a collection of more than 300,000 pieces of news, produced in two months by more then 2000 news sources belonging to 13 different categories (World, U.S, Europe, Sports, Business, etc). This collection is extracted from the index of comeToMyHead, an academic news search engine available online.
Keywords: information extraction, news engines, news ranking
Algorithmic detection of semantic similarity BIBAKFull-Text 107-116
  Ana G. Maguitman; Filippo Menczer; Heather Roinestad; Alessandro Vespignani
Automatic extraction of semantic information from text and links in Web pages is key to improving the quality of search results. However, the assessment of automatic semantic measures is limited by the coverage of user studies, which do not scale with the size, heterogeneity, and growth of the Web. Here we propose to leverage human-generated metadata -- namely topical directories -- to measure semantic relationships among massive numbers of pairs of Web pages or topics. The Open Directory Project classifies millions of URLs in a topical ontology, providing a rich source from which semantic relationships between Web pages can be derived. While semantic similarity measures based on taxonomies (trees) are well studied, the design of well-founded similarity measures for objects stored in the nodes of arbitrary ontologies (graphs) is an open problem. This paper defines an information-theoretic measure of semantic similarity that exploits both the hierarchical and non-hierarchical structure of an ontology. An experimental study shows that this measure improves significantly on the traditional taxonomy-based approach. This novel measure allows us to address the general question of how text and link analyses can be combined to derive measures of relevance that are in good agreement with semantic similarity. Surprisingly, the traditional use of text similarity turns out to be ineffective for relevance ranking.
Keywords: Web mining, Web search, content and link similarity, ranking evaluation, semantic similarity
SemRank: ranking complex relationship search results on the semantic web BIBAKFull-Text 117-127
  Kemafor Anyanwu; Angela Maduko; Amit Sheth
While the idea that querying mechanisms for complex relationships (otherwise known as Semantic Associations) should be integral to Semantic Web search technologies has recently gained some ground, the issue of how search results will be ranked remains largely unaddressed. Since it is expected that the number of relationships between entities in a knowledge base will be much larger than the number of entities themselves, the likelihood that Semantic Association searches would result in an overwhelming number of results for users is increased, therefore elevating the need for appropriate ranking schemes. Furthermore, it is unlikely that ranking schemes for ranking entities (documents, resources, etc.) may be applied to complex structures such as Semantic Associations.
   In this paper, we present an approach that ranks results based on how predictable a result might be for users. It is based on a relevance model SemRank, which is a rich blend of semantic and information-theoretic techniques with heuristics that supports the novel idea of modulative searches, where users may vary their search modes to effect changes in the ordering of results depending on their need. We also present the infrastructure used in the SSARK system to support the computation of SemRank values for resulting Semantic Associations and their ordering.
Keywords: SemRank, discovery query, path expression tree, ranking complex relationships, semantic Web, semantic associations search, semantic match, semantic ranking, semantic relationship search, semantic similarity, semantic summary

Web services

A service creation environment based on end to end composition of Web services BIBAKFull-Text 128-137
  Vikas Agarwal; Koustuv Dasgupta; Neeran Karnik; Arun Kumar; Ashish Kundu; Sumit Mittal; Biplav Srivastava
The demand for quickly delivering new applications is increasingly becoming a business imperative today. Application development is often done in an ad hoc manner, without standard frameworks or libraries, thus resulting in poor reuse of software assets. Web services have received much interest in industry due to their potential in facilitating seamless business-to-business or enterprise application integration. A web services composition tool can help automate the process, from creating business process functionality, to developing executable workflows, to deploying them on an execution environment. However, we find that the main approaches taken thus far to standardize and compose web services are piecemeal and insufficient. The business world has adopted a (distributed) programming approach in which web service instances are described using WSDL, composed into flows with a language like BPEL and invoked with the SOAP protocol. Academia has propounded the AI approach of formally representing web service capabilities in ontologies, and reasoning about their composition using goal-oriented inferencing techniques from planning. We present the first integrated work in composing web services end to end from specification to deployment by synergistically combining the strengths of the above approaches. We describe a prototype service creation environment along with a use-case scenario, and demonstrate how it can significantly speed up the time-to-market for new services.
Keywords: Web services composition, planning, semantic Web
Ensuring required failure atomicity of composite Web services BIBAKFull-Text 138-147
  Sami Bhiri; Olivier Perrin; Claude Godart
The recent evolution of Internet, driven by the Web services technology, is extending the role of the Web from a support of information interaction to a middleware for B2B interactions.
   Indeed, the Web services technology allows enterprises to outsource parts of their business processes using Web services. And it also provides the opportunity to dynamically offer new value-added services through the composition of pre-existing Web services.
   In spite of the growing interest in Web services, current technologies are found lacking efficient transactional support for composite Web services (CSs).
   In this paper, we propose a transactional approach to ensure the failure atomicity, of a CS, required by partners. We use the Accepted Termination States (ATS) property as a mean to express the required failure atomicity.
   Partners specify their CS, mainly its control flow, and the required ATS. Then, we use a set of transactional rules to assist designers to compose a valid CS with regards to the specified ATS.
Keywords: failure atomicity, reliable Web services compositions, transactional models
Web service interfaces BIBAKFull-Text 148-159
  Dirk Beyer; Arindam Chakrabarti; Thomas A. Henzinger
We present a language for specifying web service interfaces. A web service interface puts three kinds of constraints on the users of the service. First, the interface specifies the methods that can be called by a client, together with types of input and output parameters; these are called signature constraints. Second, the interface may specify propositional constraints on method calls and output values that may occur in a web service conversation; these are called consistency constraints. Third, the interface may specify temporal constraints on the ordering of method calls; these are called protocol constraints. The interfaces can be used to check, first, if two or more web services are compatible, and second, if a web service A can be safely substituted for a web service B. The algorithm for compatibility checking verifies that two or more interfaces fulfill each others' constraints. The algorithm for substitutivity checking verifies that service A demands fewer and fulfills more constraints than service B.
Keywords: Web service compatibility, Web service interfaces, Web service substitutivity, Web services, formal specification, formal verification

Web application design

Building adaptable and reusable XML applications with model transformations BIBAKFull-Text 160-169
  Ivan Kurtev; Klaas van den Berg
We present an approach in which the semantics of an XML language is defined by means of a transformation from an XML document model (an XML schema) to an application specific model. The application specific model implements the intended behavior of documents written in the language. A transformation is specified in a model transformation language used in the Model Driven Architecture (MDA) approach for software development. Our approach provides a better separation of three concerns found in XML applications: syntax, syntax processing logic and intended meaning of the syntax. It frees the developer of low-level syntactical details and improves the adaptability and reusability of XML applications. Declarative transformation rules and the explicit application model provide a finer control over the application parts affected by adaptations. Transformation rules and the application model for an XML language may be composed with the corresponding rules and application models defined for other XML languages. In that way we achieve reuse and composition of XML applications.
Keywords: MDA, XML, XML processing, model transformations, transformation language
Exception handling in workflow-driven Web applications BIBAKFull-Text 170-179
  Marco Brambilla; Stefano Ceri; Sara Comai; Christina Tziviskou
As the Web becomes a platform for implementing B2B applications, the need arises of Web conceptual models for describing Web oriented workflow applications implementing business processes. In this context, new problems about process correctness arise, due to the loose control of Web applications upon the behavior of their Web clients. Indeed, incoherent user's behavior can lead to inconsistent processes.
   This paper presents a high level approach to the management of exceptions that occur during the execution of processes on the Web. We present a classification of exceptions that can be raised inside workflow-driven Web applications, and recovery policies to retrieve coherent status and data after an exception. We devise these concepts at high level and then we exploit them using a Web modeling language (WebML) that in turn provides development facilities like automatic code generation, validation of hypertext models, and so on. An industrial implementation experience is briefly presented too.
Keywords: Web applications, exceptions, failure, navigation behavior, workflow
AwareDAV: a generic WebDAV notification framework and implementation BIBAKFull-Text 180-189
  Henning Qin Jehøj; Niels Olof Bouvin; Kaj Grønbæk
WebDAV needs awareness support in order to be a full-fledged collaboration system, This paper introduces AwareDAV, a new WebDAV extension framework enabling shared awareness through event notification. By extending the WebDAV protocol with seven new request-methods and an extensible XML based event subscription scheme, AwareDAV supports fine grained event subscriptions over a range of transport mechanisms and enables a wide range of collaboration scenarios. This paper describes the design of AwareDAV, its API, experiences with its initial implementation, as well as a comparison with Microsoft Exchange and WebDAV-notify.
Keywords: AwareDAV, CSCW, WebDAV, event notification

Semantic Web

Learning domain ontologies for Web service descriptions: an experiment in bioinformatics BIBAKFull-Text 190-198
  Marta Sabou; Chris Wroe; Carole Goble; Gilad Mishne
The reasoning tasks that can be performed with semantic web service descriptions depend on the quality of the domain ontologies used to create these descriptions. However, building such domain ontologies is a time consuming and difficult task.
   We describe an automatic extraction method that learns domain ontologies for web service descriptions from textual documentations attached to web services. We conducted our experiments in the field of bioinformatics by learning an ontology from the documentation of the web services used in myGrid, a project that supports biology experiments on the Grid. Based on the evaluation of the extracted ontology in the context of the project, we conclude that the proposed extraction method is a helpful tool to support the process of building domain ontologies for web service descriptions.
Keywords: OWL-S, Web services, bioinformatics, domain ontology, ontology evaluation, ontology learning, semantic Web
Making RDF presentable: integrated global and local semantic Web browsing BIBAKFull-Text 199-206
  Lloyd Rutledge; Jacco van Ossenbruggen; Lynda Hardman
This paper discusses generating document structure from annotated media repositories in a domain-independent manner. This approaches the vision of a universal RDF browser. We start by applying the search-and-browse paradigm established for the WWW to RDF presentation. Furthermore, this paper adds to this paradigm the clustering-based derivation of document structure from search returns, providing simple but domain-independent hypermedia generation from RDF stores. While such generated presentations hardly meet the standards of those written by humans, they provide quick access to media repositories when the required document has not yet been written. The resulting system allows a user to specify a topic for which it generates a hypermedia document providing guided navigation through virtually any RDF repository. The impact for content providers is that as soon as one adds new media items and their annotations to a repository, they become immediately available for automatic integration into subsequently requested presentations.
Keywords: RDF, browsing, clustering, hypermedia generation, media archives, search, semantic Web

Applications

Shared lexicon for distributed annotations on the Web BIBAKFull-Text 207-214
  Paolo Avesani; Marco Cova
The interoperability among distributed and autonomous systems is the ultimate challenge facing the semantic web. Heterogeneity of data representation is the main source of problems. This paper proposes an innovative solution that combines lexical approaches and language games. The benefits for distributed annotation systems on the web are twofold: firstly, it will reduce the complexity of the semantic problem by moving the focus from the full-featured ontology level to the simpler lexicon level; secondly, it will avoid the drawback of a centralized third party mediator that may become a single point of failure.
   The main contributions of this work are concerned with:
  • providing a proof of concept that language games can be an effective solution
       to creating and managing a distributed process of agreement on a shared
       lexicon,
  • describing a fully distributed service oriented architecture for language
       games,
  • providing empirical evidence on a real world case study in the domain of ski
       mountaineering.
    Keywords: distributed annotations, emergent semantics, interoperability, language games
  • Using XForms to simplify Web programming BIBAKFull-Text 215-224
      Richard Cardone; Danny Soroker; Alpana Tiwari
    The difficulty of developing and deploying commercial web applications increases as the number of technologies they use increases and as the interactions between these technologies become more complex. This paper describes a way to avoid this increasing complexity by re-examining the basic requirements of web applications. Our approach is to first separate client concerns from server concerns, and then to reduce the interaction between client and server to its most elemental: parameter passing. We define a simplified programming model for form-based web applications and we use XForms and a subset of J2EE as enabling technologies. We describe our implementation of an MVC-based application builder for this model, which automatically generates the code needed to marshal input and output data between clients and servers. This marshalling uses type checking and other forms of validation on both clients and servers. We also show how our programming model and application builder support the customization of web applications for different execution targets, including, for example, different client devices.
    Keywords: J2EE, MVC, Web application, XForms, XMLBeans, eclipse, visual builder
    Web-assisted annotation, semantic indexing and search of television and radio news BIBAKFull-Text 225-234
      Mike Dowman; Valentin Tablan; Hamish Cunningham; Borislav Popov
    The Rich News system, that can automatically annotate radio and television news with the aid of resources retrieved from the World Wide Web, is described. Automatic speech recognition gives a temporally precise but conceptually inaccurate annotation model. Information extraction from related web news sites gives the opposite: conceptual accuracy but no temporal data. Our approach combines the two for temporally accurate conceptual semantic annotation of broadcast news. First low quality transcripts of the broadcasts are produced using speech recognition, and these are then automatically divided into sections corresponding to individual news stories. A key phrases extraction component finds key phrases for each story and uses these to search for web pages reporting the same event. The text and meta-data of the web pages is then used to create index documents for the stories in the original broadcasts, which are semantically annotated using the KIM knowledge management platform. A web interface then allows conceptual search and browsing of news stories, and playing of the parts of the media files corresponding to each news story. The use of material from the World Wide Web allows much higher quality textual descriptions and semantic annotations to be produced than would have been possible using the ASR transcript directly. The semantic annotations can form a part of the Semantic Web, and an evaluation shows that the system operates with high precision, and with a moderate level of recall.
    Keywords: Web search, automatic speech recognition, key-phrase extraction, media archiving, multimedia, natural language processing, semantic Web, semantic annotation, topical segmentation

    Indexing and querying

    Improving Web search efficiency via a locality based static pruning method BIBAKFull-Text 235-244
      Edleno S. de Moura; Célia F. dos Santos; Daniel R. Fernandes; Altigran S. Silva; Pavel Calado; Mario A. Nascimento
    The unarguably fast, and continuous, growth of the volume of indexed (and indexable) documents on the Web poses a great challenge for search engines. This is true regarding not only search effectiveness but also time and space efficiency. In this paper we present an index pruning technique targeted for search engines that addresses the latter issue without disconsidering the former. To this effect, we adopt a new pruning strategy capable of greatly reducing the size of search engine indices. Experiments using a real search engine show that our technique can reduce the indices' storage costs by up to 60% over traditional lossless compression methods, while keeping the loss in retrieval precision to a minimum. When compared to the indices size with no compression at all, the compression rate is higher than 88%, i.e., less than one eighth of the original size. More importantly, our results indicate that, due to the reduction in storage overhead, query processing time can be reduced to nearly 65% of the original time, with no loss in average precision. The new method yields significative improvements when compared against the best known static pruning method for search engine indices. In addition, since our technique is orthogonal to the underlying search algorithms, it can be adopted by virtually any search engine.
    Keywords: indexing, information retrieval, pruning, search engines, web search
    Sampling search-engine results BIBAKFull-Text 245-256
      Aris Anagnostopoulos; Andrei Z. Broder; David Carmel
    We consider the problem of efficiently sampling Web search engine query results. In turn, using a small random sample instead of the full set of results leads to efficient approximate algorithms for several applications, such as:
  • Determining the set of categories in a given taxonomy spanned by the search
       results;
  • Finding the range of metadata values associated to the result set in order to
       enable "multi-faceted search;"
  • Estimating the size of the result set;
  • Data mining associations to the query terms. We present and analyze an efficient algorithm for obtaining uniform random samples applicable to any search engine based on posting lists and document-at-a-time evaluation. (To our knowledge, all popular Web search engines, e.g. Google, Inktomi, AltaVista, AllTheWeb, belong to this class.)
       Furthermore, our algorithm can be modified to follow the modern object-oriented approach whereby posting lists are viewed as streams equipped with a next method, and the next method for Boolean and other complex queries is built from the next method for primitive terms. In our case we show how to construct a basic next(p) method that samples term posting lists with probability p, and show how to construct next(p) methods for Boolean operators (AND, OR, WAND) from primitive methods.
       Finally, we test the efficiency and quality of our approach on both synthetic and real-world data.
    Keywords: WAND, sampling, search engines, weighted AND
  • Three-level caching for efficient query processing in large Web search engines BIBAKFull-Text 257-266
      Xiaohui Long; Torsten Suel
    Large web search engines have to answer thousands of queries per second with interactive response times. Due to the sizes of the data sets involved, often in the range of multiple terabytes, a single query may require the processing of hundreds of megabytes or more of index data. To keep up with this immense workload, large search engines employ clusters of hundreds or thousands of machines, and a number of techniques such as caching, index compression, and index and query pruning are used to improve scalability. In particular, two-level caching techniques cache results of repeated identical queries at the frontend, while index data for frequently used query terms are cached in each node at a lower level.
       We propose and evaluate a three-level caching scheme that adds an intermediate level of caching for additional performance gains. This intermediate level attempts to exploit frequently occurring pairs of terms by caching intersections or projections of the corresponding inverted lists. We propose and study several offline and online algorithms for the resulting weighted caching problem, which turns out to be surprisingly rich in structure. Our experimental evaluation based on a large web crawl and real search engine query log shows significant performance gains for the best schemes, both in isolation and in combination with the other caching levels. We also observe that a careful selection of cache admission and eviction policies is crucial for best overall performance.
    Keywords: Web search, caching, inverted index

    Keynote

    Innovation for a human-centered network: NTT's R&D activities for achieving the NTT group's medium-term management strategy BIBAFull-Text 267
      Yuji Inoue
    This talk presents NTT's approach for realizing a Human-Centered Network. Last November, we announced the NTT Group's Medium-Term Management Strategy, which consists of three management objectives: (1) building the ubiquitous broadband market and helping achieve the e-Japan Strategy and the u-Japan Initiative; (2) building a safe, secure, and convenient communications network environment and broadband access infrastructure, while achieving a seamless migration from the legacy telephone network to the next generation network; and (3) striving to increase corporate value and achieve sustainable growth. Since the management strategy takes account of Japan's future social issues such as declining birthrate and aging population, the need to reduce the environmental load, etc, we believe that the R&D activities directed towards accomplishing these objectives consequently lead to the realization of a Human-Centered Network.

    XML query and programming languages

    Sub-document queries over XML with XSQirrel BIBAKFull-Text 268-277
      Arnaud Sahuguet; Bogdan Alexe
    This paper describes XSQirrel, a new XML query language that transforms a document into a sub-document, i.e. a tree where the root-to-leaf paths are a subset of the root-to-leaf paths from the original document.
       We show that this type of queries is extremely useful for various applications (e.g. web services) and that the currently existing query languages are poorly equipped to express, reason and evaluate such queries. In particular, we emphasize the need to be able to compose such queries. We present the XSQirrel language with its syntax, semantics and two language specific operators, union and composition.
       For the evaluation of the language, we leverage well established query technologies by translating XSQirrel expressions into XPath programs, XQuery queries or XSLT stylesheets.
       We provide some experimental results that compare our various evaluation strategies. We also show the runtime benefits of query composition over sequential evaluation.
    Keywords: XML, XSQirrel, sub-document
    XJ: facilitating XML processing in Java BIBAKFull-Text 278-287
      Matthew Harren; Mukund Raghavachari; Oded Shmueli; Michael G. Burke; Rajesh Bordawekar; Igor Pechtchanski; Vivek Sarkar
    The increased importance of XML as a data representation format has led to several proposals for facilitating the development of applications that operate on XML data. These proposals range from runtime API-based interfaces to XML-based programming languages. The subject of this paper is XJ, a research language that proposes novel mechanisms for the integration of XML as a first-class construct into Java. The design goals of XJ distinguish it from past work on integrating XML support into programming languages -- specifically, the XJ design adheres to the XML Schema and XPath standards. Moreover, it supports in-place updates of XML data thereby keeping with the imperative nature of Java. We have built a prototype compiler for XJ, and our preliminary experiments demonstrate that the performance of XJ programs can approach that of traditional low-level API-based interfaces, while providing a higher level of abstraction.
    Keywords: Java, XML, language design
    XQuery containment in presence of variable binding dependencies BIBAKFull-Text 288-297
      Li Chen; Elke A. Rundensteiner
    Semantic caching is an important technology for improving the response time of future user queries specified over remote servers. This paper deals with the fundamental query containment problem in an XQuery-based semantic caching system. To our best knowledge, the impact of subtle differences in XQuery semantics caused by different ways of specifying variables on query containment has not yet been studied. We introduce the concept of variable binding dependencies for representing the hierarchical element dependencies preserved by an XQuery. We analyze the problem of XQuery containment in the presence of such dependencies. We propose a containment mapping technique for nested XQuery in presence of variable binding dependencies. The implication of the nested block structure on XQuery containment is also considered. We mention the performance gains achieved by a semantic caching system we build based on the proposed technique.
    Keywords: XQuery containment, variable binding dependency

    Web-based educational applications

    eBag: a ubiquitous Web infrastructure for nomadic learning BIBAKFull-Text 298-306
      Christina Brodersen; Bent G. Christensen; Kaj Grønbæk; Christian Dindler; Balasuthas Sundararajah
    This paper describes the eBag infrastructure, which is a generic infrastructure inspired from work with school children who could benefit from a electronic schoolbag for collaborative handling of their digital material. The eBag infrastructure is utilizing the Context-aware HyCon framework and collaborative web services based on WebDAV. A ubiquitous login and logout mechanism has been built based on BlueTooth sensor networks. The eBag infrastructure has been tried out in field tests with school kids. In this paper we discuss experiences and design issues for ubiquitous Web integration in interactive school environments with multiple interactive whiteboards and workstations. This includes proposals for specialized and adaptive XLink structures for organizing school materials as well as issues in login/logout based on proximity of different display surfaces.
    Keywords: HyCon, WebDAV, XLink, adaptive, context-aware, eBag
    Online curriculum on the semantic Web: the CSD-UoC portal for peer-to-peer e-learning BIBAKFull-Text 307-314
      Dimitris Kotzinos; Sofia Pediaditaki; Apostolos Apostolidis; Nikolaos Athanasis; Vassilis Christophides
    Online Curriculum Portals aim to support networks of instructors and learners by providing a space of convergence for enhancing peer-to-peer learning interactions among individuals of an educational institution. To this end, effective, open and scalable e-learning systems are required to acquire, store, and share knowledge under the form of learning objects (LO). In this paper, we are interested in exploiting the semantic relationships that characterize these LOs (e.g., prerequisite, part-of or see-also) in order to capture and access individual and group knowledge in conjunction with the learning processes supported by educational institutions. To achieve this functionality, Semantic Web (e.g., RDF/s) and declarative query languages (e.g., RQL) are employed to represent LOs and their relationships (e.g., LOM), as well as, to support navigation at the conceptual e-learning Portal space. In this way, different LOs could be presented to the same learners, according to the traversed schema navigation paths (i.e., learning paths). Using the Apache Jetspeed framework we are able to generate and assemble at run-time portlets (i.e., pluggable web components) for visualizing personalized views as dynamic web pages. Last but not least, both learners and instructors can employ the same Portal GUI for updating semantically described LOs and thus support an open-ended continuum of learning. To the best of our knowledge, the work presented in this paper is the first Online Curriculum Portal platform supporting the aforementioned functionality.
    Keywords: IEEE-LOM, e-learning portals, JetSpeed portlets, semantic Web
    The classroom sentinel: supporting data-driven decision-making in the classroom BIBAKFull-Text 315-321
      Mark K. Singley; Richard B. Lam
    Whereas schools typically record mounds of data regarding student performance, attendance, and other behaviors over the course of a school year, rarely is that data consulted and used to inform day-to-day instructional practice in the classroom. As teachers come under increasing pressure to ensure success for all of their students, we are attempting to provide tools to help teachers make sense of what is happening in their classrooms and take appropriate proactive and/or remedial action. One such tool is a Web service we've dubbed the Classroom Sentinel. The Classroom Sentinel mines electronic gradebook and other student information system data sources to detect critical teaching and learning patterns and bring those patterns to the attention of the teacher in the form of timely alerts. In this paper, we introduce the notion of classroom patterns, present some examples, and describe a framework for alert generation and delivery.
    Keywords: alert generation, classroom pattern detection, data integration, data-driven decision making, teacher cognition

    Text analysis and extraction

    Topic segmentation of message hierarchies for indexing and navigation support BIBAKFull-Text 322-331
      Jong Wook Kim; K. Selçuk Candan; Mehmet E. Dönderler
    Message hierarchies in web discussion boards grow with new postings. Threads of messages evolve as new postings focus within or diverge from the original themes of the threads. Thus, just by investigating the subject headings or contents of earlier postings in a message thread, one may not be able to guess the contents of the later postings. The resulting navigation problem is further compounded for blind users who need the help of a screen reader program that can provide only a linear representation of the content. We see that, in order to overcome the navigation obstacle for blind as well as sighted users, it is essential to develop techniques that help identify how the content of a discussion board grows through generalizations and specializations of topics. This knowledge can be used in segmenting the content in coherent units and guiding the users through segments relevant to their navigational goals. Our experimental results showed that the segmentation algorithm described in this paper provides up to 80-85% success rate in labeling messages. The algorithm is being deployed in a software system to reduce the navigational load of blind students in accessing web-based electronic course materials; however, we note that the techniques are equally applicable for developing web indexing and summarization tools for users with sight.
    Keywords: assistive technology for blind users, discussion boards, navigational aid, segmentation
    Gimme' the context: context-driven automatic semantic annotation with C-PANKOW BIBAKFull-Text 332-341
      Philipp Cimiano; Günter Ladwig; Steffen Staab
    Without the proliferation of formal semantic annotations, the Semantic Web is certainly doomed to failure. In earlier work we presented a new paradigm to avoid this: the 'Self Annotating Web', in which globally available knowledge is used to annotate resources such as web pages. In particular, we presented a concrete method instantiating this paradigm, called PANKOW (Pattern-based ANnotation through Knowledge On the Web). In PANKOW, a named entity to be annotated is put into several linguistic patterns that convey competing semantic meanings. The patterns that are matched most often on the Web indicate the meaning of the named entity -- leading to automatic or semi-automatic annotation.
       In this paper we present C-PANKOW (Context-driven PANKOW), which alleviates several shortcomings of PANKOW. First, by downloading abstracts and processing them off-line, we avoid the generation of large number of linguistic patterns and correspondingly large number of Google queries.
       Second, by linguistically analyzing and normalizing the downloaded abstracts, we increase the coverage of our pattern matching mechanism and overcome several limitations of the earlier pattern generation process. Third, we use the annotation context in order to distinguish the significance of a pattern match for the given annotation task. Our experiments show that C-PANKOW inherits all the advantages of PANKOW (no training required etc.), but in addition it is far more efficient and effective.
    Keywords: annotation, information extraction, metadata, semantic Web
    Opinion observer: analyzing and comparing opinions on the Web BIBAKFull-Text 342-351
      Bing Liu; Minqing Hu; Junsheng Cheng
    The Web has become an excellent source for gathering consumer opinions. There are now numerous Web sites containing such opinions, e.g., customer reviews of products, forums, discussion groups, and blogs. This paper focuses on online customer reviews of products. It makes two contributions. First, it proposes a novel framework for analyzing and comparing consumer opinions of competing products. A prototype system called Opinion Observer is also implemented. The system is such that with a single glance of its visualization, the user is able to clearly see the strengths and weaknesses of each product in the minds of consumers in terms of various product features. This comparison is useful to both potential customers and product manufacturers. For a potential customer, he/she can see a visual side-by-side and feature-by-feature comparison of consumer opinions on these products, which helps him/her to decide which product to buy. For a product manufacturer, the comparison enables it to easily gather marketing intelligence and product benchmarking information. Second, a new technique based on language pattern mining is proposed to extract product features from Pros and Cons in a particular type of reviews. Such features form the basis for the above comparison. Experimental results show that the technique is highly effective and outperform existing methods significantly.
    Keywords: information extraction, opinion analysis, sentiment analysis, visualization

    Keynote

    Towards usable Web privacy and security BIBAFull-Text 352
      Lorrie Faith Cranor
    Internet users now rely on a whole arsenal of tools to protect their security and privacy. Experts recommend that computer users install personal firewalls, anti-virus software, spyware blockers, spam filters, cookie managers, and a variety of other tools to keep themselves safe. Users are told to pick hard-to-guess passwords, use a different password at every Web site, and not to write any of their passwords down. They are told to read privacy policies before providing personal information to Web sites, look for lock icons before typing in a credit card number, refrain from opening email attachments from people they don't know, and even to think twice about opening email attachments from people they do know. With so many do's and don'ts, it is not surprising that much of this advice is ignored. In this talk I will highlight usability problems that make it difficult for people to protect their privacy and security on the Web, and I will discuss a number of approaches to addressing these problems.

    Web engineering with semantic annotation

    Accessibility: a Web engineering approach BIBAKFull-Text 353-362
      Peter Plessers; Sven Casteleyn; Yeliz Yesilada; Olga De Troyer; Robert Stevens; Simon Harper; Carole Goble
    Currently, the vast majority of web sites do not support accessibility for visually impaired users. Usually, these users have to rely on screen readers: applications that sequentially read the content of a web page in audio. Unfortunately, screen readers are not able to detect the meaning of the different page objects, and thus the implicit semantic knowledge conveyed in the presentation of the page is lost. One approach described in literature to tackle this problem, is the Dante approach, which allows semantic annotation of web pages to provide screen readers with extra (semantic) knowledge to better facilitate the audio presentation of a web page. Until now, such annotations were done manually, and failed for dynamic pages. In this paper, we combine the Dante approach with a web design method, WSDM, to fully automate the generation of the semantic annotation for visually impaired users. To do so, the semantic knowledge gathered during the design process is exploited, and the annotations are generated as a by-product of the design process, requiring no extra effort from the designer.
    Keywords: Web engineering, accessibility, semantic Web, visual impairment
    A multilingual usage consultation tool based on internet searching: more than a search engine, less than QA BIBAKFull-Text 363-371
      Kumiko Tanaka-Ishii; Hiroshi Nakagawa
    We present a usage consultation tool, based on Internet searching, for language learners. When a user enters a string of words for which he wants to find usages, the system sends this string as a query to a search engine and obtains search results about the string. The usages are extracted by performing statistical analysis on snippets and then fed back to the user.
       Unlike existing tools, this usage consultation tool is multi-lingual, so that usages can be obtained even in a language for which there are no well-established analytical methods. Our evaluation has revealed that usages can be obtained more effectively than by only using a search engine directly. Also, we have found that the resulting usage does not depend on the search engine for a prominent usage when the amount of data downloaded from the search engine is increased.
    Keywords: question answering, text mining, usage consultation
    Improving portlet interoperability through deep annotation BIBAKFull-Text 372-381
      Oscar Díaz; Jon Iturrioz; Arantza Irastorza
    Portlets (i.e. multi-step, user-facing applications to be syndicated within a portal) are currently supported by most portal frameworks. However, there is not yet a definitive answer to portlet interoperation whereby data flows smoothly from one portlet to a neighbouring one. Both data-based and API-based approaches exhibit some drawbacks in either the limitation of the sharing scope or the standardization effort required. We argue that these limitations can be overcome by using deep annotation techniques. By providing additional markup about the background services, deep annotation strives to interact with these underlying services rather than with the HTML surface that conveys the markup. In this way, the portlet producer can extend a portlet markup, a fragment, with data about the processes whose rendering this fragment supports. Then, the portlet consumer (e.g. a portal) can use deep annotation to map an output process in fragment A to an input process in fragment B. This mapping results in fragment B having its input form (or other "input" widget) filled up. We consider deep annotation as particularly valid for portlet interoperation due to the controlled and cooperative environment that characterizes the portal setting.
    Keywords: data-flow, deep-annotation, event, portal ontology, portlet interoperability

    User-focused search and crawling

    CubeSVD: a novel approach to personalized Web search BIBAFull-Text 382-390
      Jian-Tao Sun; Hua-Jun Zeng; Huan Liu; Yuchang Lu; Zheng Chen
    As the competition of Web search market increases, there is a high demand for personalized Web search to conduct retrieval incorporating Web users' information needs. This paper focuses on utilizing clickthrough data to improve Web search. Since millions of searches are conducted everyday, a search engine accumulates a large volume of clickthrough data, which records who submits queries and which pages he/she clicks on. The clickthrough data is highly sparse and contains different types of objects (user, query and Web page), and the relationships among these objects are also very complicated. By performing analysis on these data, we attempt to discover Web users' interests and the patterns that users locate information. In this paper, a novel approach CubeSVD is proposed to improve Web search. The clickthrough data is represented by a 3-order tensor, on which we perform 3-mode analysis using the higher-order singular value decomposition technique to automatically capture the latent factors that govern the relations among these multi-type objects: users, queries and Web pages. A tensor reconstructed based on the CubeSVD analysis reflects both the observed interactions among these objects and the implicit associations among them. Therefore, Web search activities can be carried out based on CubeSVD analysis. Experimental evaluations using a real-world data set collected from an MSN search engine show that CubeSVD achieves encouraging search results in comparison with some standard methods.
    Automatic identification of user goals in Web search BIBAKFull-Text 391-400
      Uichin Lee; Zhenyu Liu; Junghoo Cho
    There has been recent interests in studying the "goal" behind a user's Web query, so that this goal can be used to improve the quality of a search engine's results. Previous studies have mainly focused on using manual query-log investigation to identify Web query goals. In this paper we study whether and how we can automate this goal-identification process. We first present our results from a human subject study that strongly indicate the feasibility of automatic query-goal identification. We then propose two types of features for the goal-identification task: user-click behavior and anchor-link distribution. Our experimental evaluation shows that by combining these features we can correctly identify the goals for 90% of the queries studied.
    Keywords: Web search, query classification, user goals
    User-centric Web crawling BIBAKFull-Text 401-411
      Sandeep Pandey; Christopher Olston
    Search engines are the primary gateways of information access on the Web today. Behind the scenes, search engines crawl the Web to populate a local indexed repository of Web pages, used to answer user search queries. In an aggregate sense, the Web is very dynamic, causing any repository of Web pages to become out of date over time, which in turn causes query answer quality to degrade. Given the considerable size, dynamicity, and degree of autonomy of the Web as a whole, it is not feasible for a search engine to maintain its repository exactly synchronized with the Web.
       In this paper we study how to schedule Web pages for selective (re)downloading into a search engine repository. The scheduling objective is to maximize the quality of the user experience for those who query the search engine. We begin with a quantitative characterization of the way in which the discrepancy between the content of the repository and the current content of the live Web impacts the quality of the user experience. This characterization leads to a user-centric metric of the quality of a search engine's local repository. We use this metric to derive a policy for scheduling Web page (re)downloading that is driven by search engine usage and free of exterior tuning parameters. We then focus on the important subproblem of scheduling refreshing of Web pages already present in the repository, and show how to compute the priorities efficiently. We provide extensive empirical comparisons of our user-centric method against prior Web page refresh strategies, using real Web data. Our results demonstrate that our method requires far fewer resources to maintain same search engine quality level for users, leaving substantially more resources available for incorporating new Web pages into the search repository.
    Keywords: Web crawling, Web page refreshing, user-centric

    Trustworthy Web sites

    An abuse-free fair contract signing protocol based on the RSA signature BIBAKFull-Text 412-421
      Guilin Wang
    A fair contract signing protocol allows two potentially mistrusted parities to exchange their commitments (i.e., digital signatures) to an agreed contract over the Internet in a fair way, so that either each of them obtains the other's signature, or neither party does. Based on the RSA signature scheme, a new digital contract signing protocol is proposed in this paper. Like the existing RSA-based solutions for the same problem, our protocol is not only fair, but also optimistic, since the third trusted party is involved only in the situations where one party is cheating or the communication channel is interrupted. Furthermore, the proposed protocol satisfies a new property, i.e., it is abuse-free. That is, if the protocol is executed unsuccessfully, none of the two parties can show the validity of intermediate results to others. Technical details are provided to analyze the security and performance of the proposed protocol. In summary, we present the first abuse-free fair contract signing protocol based on the RSA signature, and show that it is both secure and efficient.
    Keywords: RSA, contract signing, cryptographic protocols, digital signatures, e-commerce, fair-exchange, security
    TrustGuard: countering vulnerabilities in reputation management for decentralized overlay networks BIBAFull-Text 422-431
      Mudhakar Srivatsa; Li Xiong; Ling Liu
    Reputation systems have been popular in estimating the trustworthiness and predicting the future behavior of nodes in a large-scale distributed system where nodes may transact with one another without prior knowledge or experience. One of the fundamental challenges in distributed reputation management is to understand vulnerabilities and develop mechanisms that can minimize the potential damages to a system by malicious nodes. In this paper, we identify three vulnerabilities that are detrimental to decentralized reputation management and propose TrustGuard -- a safeguard framework for providing a highly dependable and yet efficient reputation system. First, we provide a dependable trust model and a set of formal methods to handle strategic malicious nodes that continuously change their behavior to gain unfair advantages in the system. Second, a transaction based reputation system must cope with the vulnerability that malicious nodes may misuse the system by flooding feedbacks with fake transactions. Third, but not least, we identify the importance of filtering out dishonest feedbacks when computing reputation-based trust of a node, including the feedbacks filed by malicious nodes through collusion. Our experiments show that, comparing with existing reputation systems, our framework is highly dependable and effective in countering malicious nodes regarding strategic oscillating behavior, flooding malevolent feedbacks with fake transactions, and dishonest feedbacks.
    Static approximation of dynamically generated Web pages BIBAKFull-Text 432-441
      Yasuhiko Minamide
    Server-side programming is one of the key technologies that support today's WWW environment. It makes it possible to generate Web pages dynamically according to a user's request and to customize pages for each user. However, the flexibility obtained by server-side programming makes it much harder to guarantee validity and security of dynamically generated pages.
       To check statically the properties of Web pages generated dynamically by a server-side program, we develop a static program analysis that approximates the string output of a program with a context-free grammar. The approximation obtained by the analyzer can be used to check various properties of a server-side program and the pages it generates.
       To demonstrate the effectiveness of the analysis, we have implemented a string analyzer for the server-side scripting language PHP. The analyzer is successfully applied to publicly available PHP programs to detect cross-site scripting vulnerabilities and to validate pages they generate dynamically.
    Keywords: HTML validation, context-free grammars, cross-site scripting, server-side scripting, static analysis

    Semantic search

    A search engine for natural language applications BIBAKFull-Text 442-452
      Michael J. Cafarella; Oren Etzioni
    Many modern natural language-processing applications utilize search engines to locate large numbers of Web documents or to compute statistics over the Web corpus. Yet Web search engines are designed and optimized for simple human queries -- they are not well suited to support such applications. As a result, these applications are forced to issue millions of successive queries resulting in unnecessary search engine load and in slow applications with limited scalability.
       In response, this paper introduces the Bindings Engine (BE), which supports queries containing typed variables and string-processing functions. For example, in response to the query "powerful <noun>" BE will return all the nouns in its index that immediately follow the word "powerful", sorted by frequency. In response to the query "Cities such as ProperNoun(Head(<NounPhrase>))", BE will return a list of proper nouns likely to be city names.
       BE's novel neighborhood index enables it to do so with O(k) random disk seeks and O(k) serial disk reads, where k is the number of non-variable terms in its query. As a result, BE can yield several orders of magnitude speedup for large-scale language-processing applications. The main cost is a modest increase in space to store the index. We report on experiments validating these claims, and analyze how BE's space-time tradeoff scales with the size of its index and the number of variable types. Finally, we describe how a BE-based application extracts thousands of facts from the Web at interactive speeds in response to simple user queries.
    Keywords: corpus, indexing, information extraction, language, query, search engine, variables
    An enhanced model for searching in semantic portals BIBAKFull-Text 453-462
      Lei Zhang; Yong Yu; Jian Zhou; ChenXi Lin; Yin Yang
    Semantic Portal is the next generation of web portals that are powered by Semantic Web technologies for improved information sharing and exchange for a community of users. Current methods of searching in Semantic Portals are limited to keyword-based search using information retrieval (IR) techniques, ontology-based formal query and reasoning, or a simple combination of the two. In this paper, we propose an enhanced model that tightly integrates IR with formal query and reasoning to fully utilize both textual and semantic information for searching in Semantic Portals. The model extends the search capabilities of existing methods and can answer more complex search requests. The ideas in a fuzzy description logic (DL) IR model and a formal DL query method are employed and combined in our model. Based on the model, a semantic search service is implemented and evaluated. The evaluation shows very large improvements over existing methods.
    Keywords: fuzzy description logic, fuzzy reasoning, information retrieval, semantic portal, semantic search
    Disambiguating Web appearances of people in a social network BIBAKFull-Text 463-470
      Ron Bekkerman; Andrew McCallum
    Say you are looking for information about a particular person. A search engine returns many pages for that person's name but which pages are about the person you care about, and which are about other people who happen to have the same name? Furthermore, if we are looking for multiple people who are related in some way, how can we best leverage this social network? This paper presents two unsupervised frameworks for solving this problem: one based on link structure of the Web pages, another using Agglomerative/Conglomerative Double Clustering (A/CDC) -- an application of a recently introduced multi-way distributional clustering method. To evaluate our methods, we collected and hand-labeled a dataset of over 1000 Web pages retrieved from Google queries on 12 personal names appearing together in someones in an email folder. On this dataset our methods outperform traditional agglomerative clustering by more than 20%, achieving over 80% F-measure.
    Keywords: Web appearance, document clustering, information bottleneck, link structure, name disambiguation, social network

    Security through the eyes of users

    A convenient method for securely managing passwords BIBAKFull-Text 471-479
      J. Alex Halderman; Brent Waters; Edward W. Felten
    Computer users are asked to generate, keep secret, and recall an increasing number of passwords for uses including host accounts, email servers, e-commerce sites, and online financial services. Unfortunately, the password entropy that users can comfortably memorize seems insufficient to store unique, secure passwords for all these accounts, and it is likely to remain constant as the number of passwords (and the adversary's computational power) increases into the future. In this paper, we propose a technique that uses a strengthened cryptographic hash function to compute secure passwords for arbitrarily many accounts while requiring the user to memorize only a single short password. This mechanism functions entirely on the client; no server-side changes are needed. Unlike previous approaches, our design is both highly resistant to brute force attacks and nearly stateless, allowing users to retrieve their passwords from any location so long as they can execute our program and remember a short secret. This combination of security and convenience will, we believe, entice users to adopt our scheme. We discuss the construction of our algorithm in detail, compare its strengths and weaknesses to those of related approaches, and present Password Multiplier, an implementation in the form of an extension to the Mozilla Firefox web browser.
    Keywords: password security, website user authentication
    Improving understanding of website privacy policies with fine-grained policy anchors BIBAKFull-Text 480-488
      Stephen E. Levy; Carl Gutwin
    Website privacy policies state the ways that a site will use personal identifiable information (PII) that is collected from fields and forms in web-based transactions. Since these policies can be complex, machine-readable versions have been developed that allow automatic comparison of a site's privacy policy with a user's privacy preferences. However, it is still difficult for users to determine the cause and origin of conformance conflicts, because current standards operate at the page level -- they can only say that there is a conflict on the page, not where the conflict occurs or what causes it. In this paper we describe fine-grained policy anchors, an extension to the way a website implements the Platform for Privacy Preferences (P3P), that solves this problem. Fine grained policy anchors enable field-level comparisons of policy and preference, field-specific conformance displays, and faster access to additional conformance information. We built a prototype user agent based on these extensions and tested it with representative users. We found that fine-grained anchors do help users understand how privacy policy relates to their privacy preferences, and where and why conformance conflicts occur.
    Keywords: APPEL, P3P, conformance conflicts, e-commerce, privacy policies, privacy preferences, user agents
    Hardening Web browsers against man-in-the-middle and eavesdropping attacks BIBAKFull-Text 489-498
      Haidong Xia; José Carlos Brustoloni
    Existing Web browsers handle security errors in a manner that often confuses users. In particular, when a user visits a secure site whose certificate the browser cannot verify, the browser typically allows the user to view and install the certificate and connect to the site despite the verification failure. However, few users understand the risk of man-in-the-middle attacks and the principles behind certificate-based authentication. We propose context-sensitive certificate verification (CSCV), whereby the browser interrogates the user about the context in which a certificate verification error occurs. Considering the context, the browser then guides the user in handling and possibly overcoming the security error. We also propose specific password warnings (SPW) when users are about to send passwords in a form vulnerable to eavesdropping. We performed user studies to evaluate CSCV and SPW. Our results suggest that CSCV and SPW can greatly improve Web browsing security and are easy to use even without training. Moreover, CSCV had greater impact than did staged security training.
    Keywords: HTTPS, SSL, Web browser, certificate, eavesdropping attack, just-in-time instruction, man-in-the-middle attack, password, safe staging, well-in-advance instruction

    Measurements and analysis

    ATMEN: a triggered network measurement infrastructure BIBAKFull-Text 499-509
      Balachander Krishnamurthy; Harsha V. Madhyastha; Oliver Spatscheck
    Web performance measurements and availability tests have been carried out using a variety of infrastructures over the last several years. Disruptions in the Internet can lead to Web sites being unavailable or increase user-perceived latency. The unavailability could be due to DNS, failures in segments of the physical network cutting off thousands of users, or attacks. Prompt reactions to network-wide events can be facilitated by local or remote measurement and monitoring. Better yet, a distributed set of intercommunicating measurement and monitoring entities that react to events dynamically could go a long way to handle disruptions.
       We have designed and built ATMEN, a triggered measurement infrastructure to communicate and coordinate across various administrative entities. ATMEN nodes can trigger new measurements, query ongoing passive measurements or historical measurements stored on remote nodes, and coordinate the responses to make local decisions. ATMEN reduces wasted measurements by judiciously reusing measurements along three axes: spatial, temporal, and application.
       We describe the use of ATMEN for key Web applications such as performance based ranking of popular Web sites and availability of DNS servers on which most Web transactions are dependent. The evaluation of ATMEN is done using multiple network monitoring entities called Gigascopes installed across the USA, measurement data of a popular network application involving millions of users distributed across the Internet, and scores of clients to aid in gathering measurement information upon demand. Our results show that such a system can be built in a scalable fashion.
    Keywords: DNS availability, measurement, measurement infrastructure, performance-based ranking, reuse, triggered measurements
    On the lack of typical behavior in the global Web traffic network BIBAKFull-Text 510-518
      Mark Meiss; Filippo Menczer; Alessandro Vespignani
    We offer the first large-scale analysis of Web traffic based on network flow data. Using data collected on the Internet2 network, we constructed a weighted bipartite client-server host graph containing more than 18 x 106 vertices and 68 x 106 edges valued by relative traffic flows. When considered as a traffic map of the World-Wide Web, the generated graph provides valuable information on the statistical patterns that characterize the global information flow on the Web. Statistical analysis shows that client-server connections and traffic flows exhibit heavy-tailed probability distributions lacking any typical scale. In particular, the absence of an intrinsic average in some of the distributions implies the absence of a prototypical scale appropriate for server design, Web-centric network design, or traffic modeling. The inspection of the amount of traffic handled by clients and servers and their number of connections highlights non-trivial correlations between information flow and patterns of connectivity as well as the presence of anomalous statistical patterns related to the behavior of users on the Web. The results presented here may impact considerably the modeling, scalability analysis, and behavioral study of Web applications.
    Keywords: Web usage, degree, network flows, power laws, scale-free networks, strength, traffic statistics
    Analysis of multimedia workloads with implications for internet streaming BIBAFull-Text 519-528
      Lei Guo; Songqing Chen; Zhen Xiao; Xiaodong Zhang
    In this paper, we study the media workload collected from a large number of commercial Web sites hosted by a major ISP and that collected from a large group of home users connected to the Internet via a well-known cable company. Some of our key findings are: (1) Surprisingly, the majority of media contents are still delivered via downloading from Web servers. (2) A substantial percentage of media downloading connections are aborted before completion due to the long waiting time. (3) A hybrid approach, pseudo streaming, is used by clients to imitate real streaming. (4) The mismatch between the downloading rate and the client playback speed in pseudo streaming is common, which either causes frequent playback delays to the clients, or unnecessary traffic to the Internet. (5) Compared with streaming, downloading and pseudo streaming are neither bandwidth efficient nor performance effective. To address this problem, we propose the design of AutoStream, an innovative system that can provide additional previewing and streaming services automatically for media objects hosted on standard Web sites in server farms at the client's will.

    Keynote

    Real and the future of digital media BIBFull-Text 529
      Rob Glaser

    Service selection and metadata

    On optimal service selection BIBAKFull-Text 530-538
      P. A. Bonatti; P. Festa
    While many works have been devoted to service matchmaking and modeling nonfunctional properties, the problem of matching service requests to offers in an optimal way has not yet been extensively studied. In this paper we formalize three kinds of optimal service selection problems, based on different criteria. Then we study their complexity and implement solutions. We prove that one-time costs make the optimal selection problem computationally hard; in the absence of these costs the problem can be solved in polynomial time. We designed and implemented both exact and heuristic (suboptimal) algorithms for the hard case, and carried out a preliminary experimental evaluation with interesting results.
    Keywords: automatic service composition, nonfunctional properties, service matchmaking, service selection problem
    G-ToPSS: fast filtering of graph-based metadata BIBAKFull-Text 539-547
      Milenko Petrovic; Haifeng Liu; Hans-Arno Jacobsen
    RDF is increasingly being used to represent metadata. RDF Site Summary (RSS) is an application of RDF on the Web that has considerably grown in popularity. However, the way RSS systems operate today does not scale well. In this paper we introduce G-ToPSS, a scalable publish/subscribe system for selective information dissemination. G-ToPSS is particularly well suited for applications that deal with large-volume content distribution from diverse sources. RSS is an instance of the content distribution problem. G-ToPSS allows use of ontology as a way to provide additional information about the data. Furthermore, in this paper we show how G-ToPSS can support RDFS class taxonomies. We have implemented and experimentally evaluated G-ToPSS and we provide results in the paper demonstrating its scalability compared to alternatives.
    Keywords: RDF, content-based routing, graph matching, information dissemination, publish/subscribe
    Automating metadata generation: the simple indexing interface BIBAKFull-Text 548-556
      Kris Cardinaels; Michael Meire; Erik Duval
    In this paper, we focus on the development of a framework for automatic metadata generation. The first step towards this framework is the definition of an Application Programmer Interface (API), which we call the Simple Indexing Interface (SII). The second step is the definition of a framework for implementation of the SII. Both steps are presented in some detail in this paper. We also report on empirical evaluation of the metadata that the SII and supporting framework generated in a real-life context.
    Keywords: learning objects, metadata generation

    Link-based ranking

    PageRank as a function of the damping factor BIBAKFull-Text 557-566
      Paolo Boldi; Massimo Santini; Sebastiano Vigna
    PageRank is defined as the stationary state of a Markov chain. The chain is obtained by perturbing the transition matrix induced by a web graph with a damping factor α that spreads uniformly part of the rank. The choice of α is eminently empirical, and in most cases the original suggestion α = 0.85 by Brin and Page is still used. Recently, however, the behaviour of PageRank with respect to changes in α was discovered to be useful in link-spam detection[21]. Moreover, an analytical justification of the value chosen for α is still missing. In this paper, we give the first mathematical analysis of PageRank when α changes. In particular, we show that, contrarily to popular belief, for real-world graphs values of α close to 1 do not give a more meaningful ranking. Then, we give closed-form formulae for PageRank derivatives of any order, and an extension of the Power Method that approximates them with convergence O (tk αt) for the k-th derivative. Finally, we show a tight connection between iterated computation and analytical behaviour by proving that the k-th iteration of the Power Method gives exactly the PageRank value obtained using a Maclaurin polynomial of degree k. The latter result paves the way towards the application of analytical methods to the study of PageRank.
    Keywords: PageRank, Web graph, approximation
    Object-level ranking: bringing order to Web objects BIBAKFull-Text 567-574
      Zaiqing Nie; Yuanzhi Zhang; Ji-Rong Wen; Wei-Ying Ma
    In contrast with the current Web search methods that essentially do document-level ranking and retrieval, we are exploring a new paradigm to enable Web search at the object level. We collect Web information for objects relevant for a specific application domain and rank these objects in terms of their relevance and popularity to answer user queries. Traditional PageRank model is no longer valid for object popularity calculation because of the existence of heterogeneous relationships between objects. This paper introduces PopRank, a domain-independent object-level link analysis model to rank the objects within a specific domain. Specifically we assign a popularity propagation factor to each type of object relationship, study how different popularity propagation factors for these heterogeneous relationships could affect the popularity ranking, and propose efficient approaches to automatically decide these factors. Our experiments are done using 1 million CS papers, and the experimental results show that PopRank can achieve significantly better ranking results than naively applying PageRank on the object graph.
    Keywords: PageRank, PopRank, Web information retrieval, Web objects, link analysis
    A uniform approach to accelerated PageRank computation BIBAKFull-Text 575-582
      Frank McSherry
    In this note we consider a simple reformulation of the traditional power iteration algorithm for computing the stationary distribution of a Markov chain. Rather than communicate their current probability values to their neighbors at each step, nodes instead communicate only changes in probability value. This reformulation enables a large degree of flexibility in the manner in which nodes update their values, leading to an array of optimizations and features, including faster convergence, efficient incremental updating, and a robust distributed implementation.
       While the spirit of many of these optimizations appear in previous literature, we observe several cases where this unification simplifies previous work, removing technical complications and extending their range of applicability. We implement and measure the performance of several optimizations on a sizable (34M node) web subgraph, seeing significant composite performance gains, especially for the case of incremental recomputation after changes to the web graph.
    Keywords: PageRank, link analysis, random walks, web graph

    Improving the browsing experience

    Information search and re-access strategies of experienced web users BIBAKFull-Text 583-592
      Anne Aula; Natalie Jhaveri; Mika Käki
    Experienced web users have strategies for information search and re-access that are not directly supported by web browsers or search engines. We studied how prevalent these strategies are and whether even experienced users have problems with searching and re-accessing information. With this aim, we conducted a survey with 236 experienced web users. The results showed that this group has frequently used key strategies (e.g., using several browser windows in parallel) that they find important, whereas some of the strategies that have been suggested in previous studies are clearly less important for them (e.g., including URLs on a webpage). In some aspects, such as query formulation, this group resembles less experienced web users. For instance, we found that most of the respondents had misconceptions about how their search engine handles queries, as well as other problems with information search and re-access. In addition to presenting the prevalence of the strategies and rationales for their use, we present concrete designs solutions and ideas for making the key strategies also available to less experienced users.
    Keywords: experienced web users, information re-access, questionnaire study, web search
    Browsing fatigue in handhelds: semantic bookmarking spells relief BIBAKFull-Text 593-602
      Saikat Mukherjee; I. V. Ramakrishnan
    Focused Web browsing activities such as periodically looking up headline news, weather reports, etc., which require only selective fragments of particular Web pages, can be made more efficient for users of limited-display-size handheld mobile devices by delivering only the target fragments. Semantic bookmarks provide a robust conceptual framework for recording and retrieving such targeted content not only from the specific pages used in creating the bookmarks but also from any user-specified page with similar content semantics. This paper describes a technique for realizing semantic bookmarks by coupling machine learning with Web page segmentation to create a statistical model of the bookmarked content. These models are used to identify and retrieve the bookmarked content from Web pages that share a common content domain. In contrast to ontology-based approaches where semantic bookmarks are limited to available concepts in the ontology, the learning-based approach allows users to bookmark ad-hoc personalized semantic concepts to effectively target content that fits the limited display of handhelds. User evaluation measuring the effectiveness of a prototype implementation of learning-based semantic bookmarking at reducing browsing fatigue in handhelds is provided.
    Keywords: Web page partitioning, handheld device content adaptation, semantic bookmarking
    WebPod: persistent Web browsing sessions with pocketable storage devices BIBAKFull-Text 603-612
      Shaya Potter; Jason Nieh
    We present WebPod, a portable system that enables mobile users to use the same persistent, personalized web browsing session on any Internet-enabled device. No matter what computer is being used, WebPod provides a consistent browsing session, maintaining all of a user's plugins, bookmarks, browser web content, open browser windows, and browser configuration options and preferences. This is achieved by leveraging rapid improvements in capacity, cost, and size of portable storage devices. WebPod provides a virtualization and checkpoint/restart mechanism that decouples the browsing environment from the host, enabling web browsing sessions to be suspended to portable storage, carried around, and resumed from the storage device on another computer. WebPod virtualization also isolates web browsing sessions from the host, protecting the browsing privacy of the user and preventing malicious web content from damaging the host. We have implemented a Linux WebPod prototype and demonstrate its ability to quickly suspend and resume web browsing sessions, enabling a seamless web browsing experience for mobile users as they move among computers.
    Keywords: checkpoint/restart, portable storage, process migration, virtualization, web browsing

    Semantic Web foundations

    Named graphs, provenance and trust BIBAKFull-Text 613-622
      Jeremy J. Carroll; Christian Bizer; Pat Hayes; Patrick Stickler
    The Semantic Web consists of many RDF graphs nameable by URIs. This paper extends the syntax and semantics of RDF to cover such Named Graphs. This enables RDF statements that describe graphs, which is beneficial in many Semantic Web application areas. As a case study, we explore the application area of Semantic Web publishing: Named Graphs allow publishers to communicate assertional intent, and to sign their graphs; information consumers can evaluate specific graphs using task-specific trust policies, and act on information from those Named Graphs that they accept. Graphs are trusted depending on: their content; information about the graph; and the task the user is performing. The extension of RDF to Named Graphs provides a formally defined framework to be a foundation for the Semantic Web trust layer.
    Keywords: RDF, provenance, semantic Web, trust
    OWL DL vs. OWL flight: conceptual modeling and reasoning for the semantic Web BIBAKFull-Text 623-632
      Jos de Bruijn; Rubén Lara; Axel Polleres; Dieter Fensel
    The Semantic Web languages RDFS and OWL have been around for some time now. However, the presence of these languages has not brought the breakthrough of the Semantic Web the creators of the languages had hoped for. OWL has a number of problems in the area of interoperability and usability in the context of many practical application scenarios which impede the connection to the Software Engineering and Database communities. In this paper we present OWL Flight, which is loosely based on OWL, but the semantics is grounded in Logic Programming rather than Description Logics, and it borrows the constraint-based modeling style common in databases. This results in different types of modeling primitives and enforces a different style of ontology modeling. We analyze the modeling paradigms of OWL DL and OWL Flight, as well as reasoning tasks supported by both languages. We argue that different applications on the Semantic Web require different styles of modeling and thus both types of languages are required for the Semantic Web.
    Keywords: description logics, logic programming, ontologies, semantic Web
    Debugging OWL ontologies BIBAKFull-Text 633-640
      Bijan Parsia; Evren Sirin; Aditya Kalyanpur
    As an increasingly large number of OWL ontologies become available on the Semantic Web and the descriptions in the ontologies become more complicated, finding the cause of errors becomes an extremely hard task even for experts. Existing ontology development environments provide some limited support, in conjunction with a reasoner, for detecting and diagnosing errors in OWL ontologies. Typically these are restricted to the mere detection of, for example, unsatisfiable concepts. We have integrated a number of simple debugging cues generated from our description logic reasoner, Pellet, in our hypertextual ontology development environment, Swoop. These cues, in conjunction with extensive undo/redo and Annotea based collaboration support in Swoop, significantly improve the OWL debugging experience, and point the way to more general improvements in the presentation of an ontology to new users.
    Keywords: OWL, explanation, ontology engineering, semantic Web

    Link-based similarity

    Scaling link-based similarity search BIBAKFull-Text 641-650
      Dániel Fogaras; Balázs Rácz
    To exploit the similarity information hidden in the hyperlink structure of the web, this paper introduces algorithms scalable to graphs with billions of vertices on a distributed architecture. The similarity of multi-step neighborhoods of vertices are numerically evaluated by similarity functions including SimRank [20], a recursive refinement of cocitation; PSimRank, a novel variant with better theoretical characteristics; and the Jaccard coefficient, extended to multi-step neighborhoods. Our methods are presented in a general framework of Monte Carlo similarity search algorithms that precompute an index database of random fingerprints, and at query time, similarities are estimated from the fingerprints. The performance and quality of the methods were tested on the Stanford Webbase [19] graph of 80M pages by comparing our scores to similarities extracted from the ODP directory [26]. Our experimental results suggest that the hyperlink structure of vertices within four to five steps provide more adequate information for similarity search than single-step neighborhoods.
    Keywords: fingerprint, link-analysis, scalability, similarity search
    LSH forest: self-tuning indexes for similarity search BIBAKFull-Text 651-660
      Mayank Bawa; Tyson Condie; Prasanna Ganesan
    We consider the problem of indexing high-dimensional data for answering (approximate) similarity-search queries. Similarity indexes prove to be important in a wide variety of settings: Web search engines desire fast, parallel, main-memory-based indexes for similarity search on text data; database systems desire disk-based similarity indexes for high-dimensional data, including text and images; peer-to-peer systems desire distributed similarity indexes with low communication cost. We propose an indexing scheme called LSH Forest which is applicable in all the above contexts. Our index uses the well-known technique of locality-sensitive hashing (LSH), but improves upon previous designs by (a) eliminating the different data-dependent parameters for which LSH must be constantly hand-tuned, and (b) improving on LSH's performance guarantees for skewed data distributions while retaining the same storage and query overhead. We show how to construct this index in main memory, on disk, in parallel systems, and in peer-to-peer systems. We evaluate the design with experiments on multiple text corpora and demonstrate both the self-tuning nature and the superior performance of LSH Forest.
    Keywords: peer-to-peer (P2P), similarity indexes
    Partitioning of Web graphs by community topology BIBAKFull-Text 661-669
      Hidehiko Ino; Mineichi Kudo; Atsuyoshi Nakamura
    We introduce a stricter Web community definition to overcome boundary ambiguity of a Web community defined by Flake, Lawrence and Giles [2], and consider the problem of finding communities that satisfy our definition. We discuss how to find such communities and hardness of this problem.
       We also propose Web page partitioning by equivalence relation defined using the class of communities of our definition. Though the problem of efficiently finding all communities of our definition is NP-complete, we propose an efficient method of finding a subclass of communities among the sets partitioned by each of n-1 cuts represented by a Gomory-Hu tree [10], and partitioning a Web graph by equivalence relation defined using the subclass.
       According to our preliminary experiments, partitioning by our method divided the pages retrieved by keyword search into several different categories to some extent.
    Keywords: Web community, graph partitioning, maximum flow algorithm

    XML parsing and stylesheets

    Incremental maintenance for materialized XPath/XSLT views BIBAKFull-Text 671-681
      Makoto Onizuka; Fong Yee Chan; Ryusuke Michigami; Takashi Honishi
    This paper proposes an incremental maintenance algorithm that efficiently updates the materialized XPath/XSLT views defined using XPath expressions in XP([],*,//,vars). The algorithm consists of two processes. 1) The dynamic execution flow of an XSLT program is stored as an XT (XML Transformation) tree during the full transformation. 2) In response to a source XML data update, the impacted portions of the XT-tree are identified and maintained by partially re-evaluating the XSLT program. This paper discusses the XPath/XSLT features of incremental view maintenance for subtree insertion/deletion and applies them to the maintenance algorithm. Experiments show that the incremental maintenance algorithm outperforms full XML transformation algorithms by factors of up to 500.
    Keywords: XML, XPath, XSLT, materialized view, view maintenance
    Compiling XSLT 2.0 into XQuery 1.0 BIBAKFull-Text 682-691
      Achille Fokoue; Kristoffer Rose; Jérôme Siméon; Lionel Villard
    As XQuery is gathering momentum as the standard query language for XML, there is a growing interest in using it as an integral part of the XML application development infrastructure. In that context, one question which is often raised is how well XQuery interoperates with other XML languages, and notably with XSLT. XQuery 1.0 [16] and XSLT 2.0 [7] share a lot in common: they share XPath 2.0 as a common sub-language and have the same expressiveness. However, they are based on fairly different programming paradigms. While XSLT has adopted a highly declarative template based approach, XQuery relies on a simpler, and more operational, functional approach.
       In this paper, we present an approach to compile XSLT 2.0 into XQuery 1.0, and a working implementation of that approach. The compilation rules explain how XSLT's template-based approach can be implemented using the functional approach of XQuery and underpins the tight connection between the two languages. The resulting compiler can be used to migrate a XSLT code base to XQuery, or to enable the use of XQuery runtimes (e.g., as will soon be provided by most relational database management systems) for XSLT users. We also identify a number of areas where compatibility between the two languages could be improved. Finally, we show experiments on actual XSLT stylesheets, demonstrating the applicability of the approach in practice.
    Keywords: Web services, XML, XQuery, XSLT
    An adaptive, fast, and safe XML parser based on byte sequences memorization BIBAKFull-Text 692-701
      Toshiro Takase; Hisashi Miyashita; Toyotaro Suzumura; Michiaki Tatsubori
    XML (Extensible Markup Language) processing can incur significant runtime overhead in XML-based infrastructural middleware such as Web service application servers. This paper proposes a novel mechanism for efficiently processing similar XML documents. Given a new XML document as a byte sequence, the XML parser proposed in this paper normally avoids syntactic analysis but simply matches the document with previously processed ones, reusing those results. Our parser is adaptive since it partially parses and then remembers XML document fragments that it has not met before. Moreover, it processes safely since its partial parsing correctly checks the well-formedness of documents. Our implementation of the proposed parser complies with the JSR 63 standard of the Java API for XML Processing (JAXP) 1.1 specification. We evaluated Deltarser performance with messages using Google Web services. Comparing to Piccolo (and Apache Xerces), it effectively parses 35% (106%) faster in a server-side use-case scenario, and 73% (126%) faster in a client-side use-case scenario.
    Keywords: SAX, XML parsers, automata

    Schemas and semantics

    CaTTS: calendar types and constraints for Web applications BIBAKFull-Text 702-711
      François Bry; Frank-André Ries; Stephanie Spranger
    Data referring to cultural calendars such as the widespread Gregorian dates but also dates after the Chinese, Hebrew, or Islamic calendars as well as data referring to professional calendars like fiscal years or teaching terms are omnipresent on the Web. Formalisms such as XML Schema have acknowledged this by offering a rather extensive set of Gregorian dates and times as basic data types. This article introduces into CaTTS, the Calendar and Time Type System. CaTTS goes far beyond predefined date and time types after the Gregorian calendar as supported by XML Schema. CaTTS first gives rise to declaratively specify more or less complex cultural or professional calendars including specificities such as leap seconds, leap years, and time zones. CaTTS further offers a tool for the static type checking (of data typed after calendar(s) defined in CaTTS). CaTTS finally offers a language for declaratively expressing and a solver for efficiently solving temporal constraints (referring to calendar(s) expressed in CaTTS). CaTTS complements data modeling and reasoning methods designed for generic Semantic Web applications such as RDF or OWL with methods specific to the particular application domain of calendars and time.
    Keywords: Web reasoning, calendars, time, types
    Expressiveness of XSDs: from practice to theory, there and back again BIBAKFull-Text 712-721
      Geert Jan Bex; Wim Martens; Frank Neven; Thomas Schwentick
    On an abstract level, XML Schema increases the limited expressive power of Document Type Definitions (DTDs) by extending them with a recursive typing mechanism. However, an investigation of the XML Schema Definitions (XSDs) occurring in practice reveals that the vast majority of them are structurally equivalent to DTDs. This might be due to the complexity of the XML Schema specification and the difficulty to understand the effect of constraints on typing and validation of schemas. To shed some light on the actual expressive power of XSDs this paper studies the impact of the Element Declarations Consistent (EDC) and the Unique Particle Attribution (UPA) rule. An equivalent formalism based on contextual patterns rather than on recursive types is proposed which might serve as a light-weight front end for XML Schema. Finally, the effect of EDC and UPA on the way XML documents can be typed is discussed. It is argued that a cleaner, more robust, stronger but equally efficient class is obtained by replacing EDC and UPA with the notion of 1-pass preorder typing: schemas that allow to determine the type of an element of a streaming document when its opening tag is met. This notion can be defined in terms of restrained competition regular expressions and there is again an equivalent syntactical formalism based on contextual patterns.
    Keywords: XML schema, expressiveness, formal model
    WEESA: Web engineering for semantic Web applications BIBAKFull-Text 722-729
      Gerald Reif; Harald Gall; Mehdi Jazayeri
    The success of the Semantic Web crucially depends on the existence of Web pages that provide machine-understandable meta-data. This meta-data is typically added in the semantic annotation process which is currently not part of the Web engineering process. Web engineering, however, proposes methodologies to design, implement and maintain Web applications but lack the generation of meta-data. In this paper we introduce a technique to extend existing Web engineering methodologies to develop semantically annotated Web pages. The novelty of this approach is the definition of a mapping from XML Schema to ontologies, called WEESA, that can be used to automatically generate RDF meta-data from XML content documents. We further show how we integrated the WEESA mapping into an Apache Cocoon transformer to easily extend XML based Web applications to semantically annotated Web application.
    Keywords: Web engineering, ontology, semantic Web, semantic annotation

    Architecture and implementation of Web sites

    A multi-threaded PIPELINED Web server architecture for SMP/SoC machines BIBAKFull-Text 730-739
      Gyu Sang Choi; Jin-Ha Kim; Deniz Ersoz; Chita R. Das
    Design of high performance Web servers has become a recent research thrust to meet the increasing demand of network-based services. In this paper, we propose a new Web server architecture, called multi-threaded PIPELINED Web server, suitable for Symmetric Multi-Processor (SMP) or System-on-Chip (SoC) architectures. The proposed PIPELINED model consists of multiple thread pools, where each thread pool consists of five basic threads and two helper threads. The main advantages of the proposed model are global information sharing by the threads, minimal synchronization overhead due to less number of threads, and non-blocking I/O operations, possible with the helper threads.
       We have conducted an in-depth performance analysis of the proposed server model along with four prior Web server models (Multi-Process (MP), Multi-Thread (MT), Single-Process Event-Driven (SPED) and Asynchronous Multi-Process Event-Driven (AMPED)) via simulation using six Web server workloads. The experiments are conducted to investigate the impact of various factors such as the memory size, disk speed and numbers of clients. The simulation results indicate that the proposed PIPELINED Web server architecture shows the best performance across all system and workload parameters compared to the MP, MT, SPED and AMPED models. Although the MT and AMPED models show competitive performance with less number of processors, the advantage of the PIPELINED model becomes obvious as the number of processors or clients in an SMP/SoC machine increases. The MP model shows the worst performance in most of the cases. The results indicate that the proposed server architecture can be used in future large-scale SMP/SoC machines to boost system performance.
    Keywords: asynchronous multi-process event-driven, multi-process, multi-thread, single event-driven process, symmetric multi-processor, system-on-chip
    Cataclysm: policing extreme overloads in internet applications BIBAKFull-Text 740-749
      Bhuvan Urgaonkar; Prashant Shenoy
    In this paper we present the Cataclysm server platform for handling extreme overloads in hosted Internet applications. The primary contribution of our work is to develop a low overhead, highly scalable admission control technique for Internet applications. Cataclysm provides several desirable features, such as guarantees on response time by conducting accurate size-based admission control, revenue maximization at multiple time-scales via preferential admission of important requests and dynamic capacity provisioning, and the ability to be operational even under extreme overloads. Cataclysm can transparently trade-off the accuracy of its decision making with the intensity of the workload allowing it to handle incoming rates of several tens of thousands of requests/second. We implement a prototype Cataclysm hosting platform on a Linux cluster and demonstrate the benefits of our integrated approach using a variety of workloads.
    Keywords: internet application, overload, sentry
    Design for verification for asynchronously communicating Web services BIBAKFull-Text 750-759
      Aysu Betin-Can; Tevfik Bultan; Xiang Fu
    We present a design for verification approach to developing reliable web services. We focus on composite web services which consist of asynchronously communicating peers. Our goal is to automatically verify properties of interactions among such peers. We propose a design pattern that eases the development of such web services and enables a modular, assume-guarantee style verification strategy. In the proposed design pattern, each peer is associated with a behavioral interface description which specifies how that peer will interact with other peers. Using these peer interfaces we automatically generate BPEL specifications to publish for interoperability. Assuming that the participating peers behave according to their interfaces, we verify safety and liveness properties about the global behavior of the composite web service during behavior verification. During interface verification, we check that each peer implementation conforms to its interface. Using the modularity in the proposed design pattern, we are able to perform the interface verification of each peer and the behavior verification as separate steps. Our experiments show that, using this modular approach, one can automatically and efficiently verify web service implementations.
    Keywords: BPEL, asynchronous communication, composite web services, design patterns