HCI Bibliography Home | HCI Journals | About TWEB | Journal Info | TWEB Journal Volumes | Detailed Records | RefWorks | EndNote | Hide Abstracts
TWEB Tables of Contents: 0102030405060708

ACM Transactions on The Web 3

Editors:Helen Ashman; Arun Iyengar
Dates:2009
Volume:3
Publisher:ACM
Standard No:ISSN:1559-1131 EISSN:1559-114X
Papers:14
Links:Journal Home Page | ACM Digital Library | Table of Contents
  1. TWEB 2009-01 Volume 3 Issue 1
  2. TWEB 2009-04 Volume 3 Issue 2
  3. TWEB 2009-06 Volume 3 Issue 3
  4. TWEB 2009-09 Volume 3 Issue 4

TWEB 2009-01 Volume 3 Issue 1

Methods for extracting place semantics from Flickr tags BIBAFull-Text 1
  Tye Rattenbury; Mor Naaman
We describe an approach for extracting semantics for tags, unstructured text-labels assigned to resources on the Web, based on each tag's usage patterns. In particular, we focus on the problem of extracting place semantics for tags that are assigned to photos on Flickr, a popular-photo sharing Web site that supports location (latitude/longitude) metadata for photos. We propose the adaptation of two baseline methods, inspired by well-known burst-analysis techniques, for the task; we also describe two novel methods, TagMaps and scale-structure identification. We evaluate the methods on a subset of Flickr data. We show that our scale-structure identification method outperforms existing techniques and that a hybrid approach generates further improvements (achieving 85% precision at 81% recall). The approach and methods described in this work can be used in other domains such as geo-annotated Web pages, where text terms can be extracted and associated with usage patterns.
Protecting browsers from DNS rebinding attacks BIBAFull-Text 2
  Collin Jackson; Adam Barth; Andrew Bortz; Weidong Shao; Dan Boneh
DNS rebinding attacks subvert the same-origin policy of browsers, converting them into open network proxies. Using DNS rebinding, an attacker can circumvent organizational and personal firewalls, send spam email, and defraud pay-per-click advertisers. We evaluate the cost effectiveness of mounting DNS rebinding attacks, finding that an attacker requires less than $100 to hijack 100,000 IP addresses. We analyze defenses to DNS rebinding attacks, including improvements to the classic "DNS pinning," and recommend changes to browser plug-ins, firewalls, and Web servers. Our defenses have been adopted by plug-in vendors and by a number of open-source firewall implementations.
Do not crawl in the DUST: Different URLs with similar text BIBAFull-Text 3
  Ziv Bar-Yossef; Idit Keidar; Uri Schonfeld
We consider the problem of DUST: Different URLs with Similar Text. Such duplicate URLs are prevalent in Web sites, as Web server software often uses aliases and redirections, and dynamically generates the same page from various different URL requests. We present a novel algorithm, DustBuster, for uncovering DUST; that is, for discovering rules that transform a given URL to others that are likely to have similar content. DustBuster mines DUST effectively from previous crawl logs or Web server logs, without/examining page contents. Verifying these rules via sampling requires fetching few actual Web pages. Search engines can benefit from information about DUST to increase the effectiveness of crawling, reduce indexing overhead, and improve the quality of popularity statistics such as PageRank.
Browsing on small displays by transforming Web pages into hierarchically structured subpages BIBAFull-Text 4
  Xiangye Xiao; Qiong Luo; Dan Hong; Hongbo Fu; Xing Xie; Wei-Ying Ma
We propose a new Web page transformation method to facilitate Web browsing on handheld devices such as Personal Digital Assistants (PDAs). In our approach, an original Web page that does not fit on the screen is transformed into a set of subpages, each of which fits on the screen. This transformation is done through slicing the original page into page blocks iteratively, with several factors considered. These factors include the size of the screen, the size of each page block, the number of blocks in each transformed page, the depth of the tree hierarchy that the transformed pages form, as well as the semantic coherence between blocks. We call the tree hierarchy of the transformed pages an SP-tree. In an SP-tree, an internal node consists of a textually enhanced thumbnail image with hyperlinks, and a leaf node is a block extracted from a subpage of the original Web page. We adaptively adjust the fanout and the height of the SP-tree so that each thumbnail image is clear enough for users to read, while at the same time, the number of clicks needed to reach a leaf page is few. Through this transformation algorithm, we preserve the contextual information in the original Web page and reduce scrolling. We have implemented this transformation module on a proxy server and have conducted usability studies on its performance. Our system achieved a shorter task completion time compared with that of transformations from the Opera browser in nine of ten tasks. The average improvement on familiar pages was 44%. The average improvement on unfamiliar pages was 37%. Subjective responses were positive.

TWEB 2009-04 Volume 3 Issue 2

Classifying search queries using the Web as a source of knowledge BIBAFull-Text 5
  Evgeniy Gabrilovich; Andrei Broder; Marcus Fontoura; Amruta Joshi; Vanja Josifovski; Lance Riedel; Tong Zhang
We propose a methodology for building a robust query classification system that can identify thousands of query classes, while dealing in real time with the query volume of a commercial Web search engine. We use a pseudo relevance feedback technique: given a query, we determine its topic by classifying the Web search results retrieved by the query. Motivated by the needs of search advertising, we primarily focus on rare queries, which are the hardest from the point of view of machine learning, yet in aggregate account for a considerable fraction of search engine traffic. Empirical evaluation confirms that our methodology yields a considerably higher classification accuracy than previously reported. We believe that the proposed methodology will lead to better matching of online ads to rare queries and overall to a better user experience.
A large-scale empirical study of P3P privacy policies: Stated actions vs. legal obligations BIBAFull-Text 6
  Ian Reay; Scott Dick; James Miller
Numerous studies over the past ten years have shown that concern for personal privacy is a major impediment to the growth of e-commerce. These concerns are so serious that most if not all consumer watchdog groups have called for some form of privacy protection for Internet users. In response, many nations around the world, including all European Union nations, Canada, Japan, and Australia, have enacted national legislation establishing mandatory safeguards for personal privacy. However, recent evidence indicates that Web sites might not be adhering to the requirements of this legislation. The goal of this study is to examine the posted privacy policies of Web sites, and compare these statements to the legal mandates under which the Web sites operate. We harvested all available P3P (Platform for Privacy Preferences Protocol) documents from the 100,000 most popular Web sites (over 3,000 full policies, and another 3,000 compact policies). This allows us to undertake an automated analysis of adherence to legal mandates on Web sites that most impact the average Internet user. Our findings show that Web sites generally do not even claim to follow all the privacy-protection mandates in their legal jurisdiction (we do not examine actual practice, only posted policies). Furthermore, this general statement appears to be true for every jurisdiction with privacy laws and any significant number of P3P policies, including European Union nations, Canada, Australia, and Web sites in the USA Safe Harbor program.
Extraction and classification of dense implicit communities in the Web graph BIBAFull-Text 7
  Yon Dourisboure; Filippo Geraci; Marco Pellegrini
The World Wide Web (WWW) is rapidly becoming important for society as a medium for sharing data, information, and services, and there is a growing interest in tools for understanding collective behavior and emerging phenomena in the WWW. In this article we focus on the problem of searching and classifying communities in the Web. Loosely speaking a community is a group of pages related to a common interest. More formally, communities have been associated in the computer science literature with the existence of a locally dense subgraph of the Web graph (where Web pages are nodes and hyperlinks are arcs of the Web graph). The core of our contribution is a new scalable algorithm for finding relatively dense subgraphs in massive graphs. We apply our algorithm on Web graphs built on three publicly available large crawls of the Web (with raw sizes up to 120M nodes and 1G arcs). The effectiveness of our algorithm in finding dense subgraphs is demonstrated experimentally by embedding artificial communities in the Web graph and counting how many of these are blindly found. Effectiveness increases with the size and density of the communities: it is close to 100% for communities of thirty nodes or more (even at low density). It is still about 80% even for communities of twenty nodes with density over 50% of the arcs present. At the lower extremes the algorithm catches 35% of dense communities made of ten nodes. We also develop some sufficient conditions for the detection of a community under some local graph models and not-too-restrictive hypotheses. We complete our Community Watch system by clustering the communities found in the Web graph into homogeneous groups by topic and labeling each group by representative keywords.

TWEB 2009-06 Volume 3 Issue 3

IRLbot: Scaling to 6 billion pages and beyond BIBAFull-Text 8
  Hsin-Tsang Lee; Derek Leonard; Xiaoming Wang; Dmitri Loguinov
This article shares our experience in designing a Web crawler that can download billions of pages using a single-server implementation and models its performance. We first show that current crawling algorithms cannot effectively cope with the sheer volume of URLs generated in large crawls, highly branching spam, legitimate multimillion-page blog sites, and infinite loops created by server-side scripts. We then offer a set of techniques for dealing with these issues and test their performance in an implementation we call IRLbot. In our recent experiment that lasted 41 days, IRLbot running on a single server successfully crawled 6.3 billion valid HTML pages (7.6 billion connection requests) and sustained an average download rate of 319 mb/s (1,789 pages/s). Unlike our prior experiments with algorithms proposed in related work, this version of IRLbot did not experience any bottlenecks and successfully handled content from over 117 million hosts, parsed out 394 billion links, and discovered a subset of the Web graph with 41 billion unique nodes.
Cookies: A deployment study and the testing implications BIBAFull-Text 9
  Andrew F. Tappenden; James Miller
The results of an extensive investigation of cookie deployment amongst 100,000 Internet sites are presented. Cookie deployment is found to be approaching universal levels and hence there exists an associated need for relevant Web and software engineering processes, specifically testing strategies which actively consider cookies. The semi-automated investigation demonstrates that over two-thirds of the sites studied deploy cookies. The investigation specifically examines the use of first-party, third-party, sessional, and persistent cookies within Web-based applications, identifying the presence of a P3P policy and dynamic Web technologies as major predictors of cookie usage. The results are juxtaposed with the lack of testing strategies present in the literature. A number of real-world examples, including two case studies are presented, further accentuating the need for comprehensive testing strategies for Web-based applications. The use of antirandom test case generation is explored with respect to the testing issues discussed. Finally, a number of seeding vectors are presented, providing a basis for testing cookies within Web-based applications.
A framework for QoS-based Web service contracting BIBAFull-Text 10
  Marco Comuzzi; Barbara Pernici
The extensive adoption of Web service-based applications in dynamic business scenarios, such as on-demand computing or highly reconfigurable virtual enterprises, advocates for methods and tools for the management of Web service nonfunctional aspects, such as Quality of Service (QoS). Concerning contracts on Web service QoS, the literature has mostly focused on the contract definition and on mechanisms for contract enactment, such as the monitoring of the satisfaction of negotiated QoS guarantees. In this context, this article proposes a framework for the automation of the Web service contract specification and establishment. An extensible model for defining both domain-dependent and domain-independent Web service QoS dimensions and a method for the automation of the contract establishment phase are proposed. We describe a matchmaking algorithm for the ranking of functionally equivalent services, which orders services on the basis of their ability to fulfill the service requestor requirements, while maintaining the price below a specified budget. We also provide an algorithm for the configuration of the negotiable part of the QoS Service-Level Agreement (SLA), which is used to configure the agreement with the top-ranked service identified in the matchmaking phase. Experimental results show that, in a utility theory perspective, the contract establishment phase leads to efficient outcomes. We envision two advanced application scenarios for the Web service contracting framework proposed in this article. First, it can be used to enhance Web services self-healing properties in reaction to QoS-related service failures; second, it can be exploited in process optimization for the online reconfiguration of candidate Web services QoS SLAs.
Unified publication and discovery of semantic Web services BIBAFull-Text 11
  Thomi Pilioura; Aphrodite Tsalgatidou
The challenge of publishing and discovering Web services has recently received lots of attention. Various solutions to this problem have been proposed which, apart from their offered advantages, suffer the following disadvantages: (i) most of them are syntactic-based, leading to poor precision and recall, (ii) they are not scalable to large numbers of services, and (iii) they are incompatible, thus yielding in cumbersome service publication and discovery. This article presents the principles, the functionality, and the design of PYRAMID-S which addresses these disadvantages by providing a scalable framework for unified publication and discovery of semantically enhanced services over heterogeneous registries. PYRAMID-S uses a hybrid peer-to-peer topology to organize Web service registries based on domains. In such a topology, each Registry retains its autonomy, meaning that it can use the publication and discovery mechanisms as well as the ontology of its choice. The viability of this approach is demonstrated through the implementation and experimental analysis of a prototype.

TWEB 2009-09 Volume 3 Issue 4

Trust and nuanced profile similarity in online social networks BIBAFull-Text 12
  Jennifer Golbeck
Online social networks, where users maintain lists of friends and express their preferences for items like movies, music, or books, are very popular. The Web-based nature of this information makes it ideal for use in a variety of intelligent systems that can take advantage of the users' social and personal data. For those systems to be effective, however, it is important to understand the relationship between social and personal preferences. In this work we investigate features of profile similarity and how those relate to the way users determine trust. Through a controlled study, we isolate several profile features beyond overall similarity that affect how much subjects trust hypothetical users. We then use data from FilmTrust, a real social network where users rate movies, and show that the profile features discovered in the experiment allow us to more accurately predict trust than when using only overall similarity. In this article, we present these experimental results and discuss the potential implications for using trust in user interfaces.
Search-as-a-service: Outsourced search over outsourced storage BIBAFull-Text 13
  Aameek Singh; Mudhakar Srivatsa; Ling Liu
With fast-paced growth of digital data and exploding storage management costs, enterprises are looking for new ways to effectively manage their data. One such cost-effective paradigm is the cloud storage model also referred to as Storage-as-a-Service, in which enterprises outsource their storage to a storage service provider (SSP) by storing data (usually encrypted) at a remote SSP-managed site and accessing it over a high speed network. Along with storage capacity used, the SSP often charges clients on the amount of data that is accessed from the SSP site. Thus, it is in the interest of the client enterprise to download only relevant content. This makes search over outsourced storage an important capability. Searching over encrypted outsourced storage, however, is a complex challenge. Each enterprise has different access privileges for different users and this access control needs to be preserved during search (for example, ensuring that a user cannot search through data that is inaccessible from the filesystem due to its permissions). Secondly, the search mechanism has to preserve confidentiality from the SSP and indices can not be stored in plain text.
   In this article, we present a new filesystem search technique that integrates access control and indexing/search mechanisms into a unified framework to support access control aware search. Our approach performs indexing within the trusted enterprise domain and uses a novel access control barrel (ACB) primitive to encapsulate access control within these indices. The indices are then systematically encrypted and shipped to the SSP for hosting. Unlike existing enterprise search techniques, our approach is resilient to various common attacks that leak private information. Additionally, to the best of our knowledge, our approach is a first such technique that allows search indices to be hosted at the SSP site, thus effectively providing search-as-a-service. This does not require the client enterprise to fully trust the SSP for data confidentiality. We describe the architecture and implementation of our approach and a detailed experimental analysis comparing with other approaches.
Emergence of consensus and shared vocabularies in collaborative tagging systems BIBAFull-Text 14
  Valentin Robu; Harry Halpin; Hana Shepherd
This article uses data from the social bookmarking site del.icio.us to empirically examine the dynamics of collaborative tagging systems and to study how coherent categorization schemes emerge from unsupervised tagging by individual users.
   First, we study the formation of stable distributions in tagging systems, seen as an implicit form of "consensus" reached by the users of the system around the tags that best describe a resource. We show that final tag frequencies for most resources converge to power law distributions and we propose an empirical method to examine the dynamics of the convergence process, based on the Kullback-Leibler divergence measure. The convergence analysis is performed for both the most utilized tags at the top of tag distributions and the so-called long tail.
   Second, we study the information structures that emerge from collaborative tagging, namely tag correlation (or folksonomy) graphs. We show how community-based network techniques can be used to extract simple tag vocabularies from the tag correlation graphs by partitioning them into subsets of related tags. Furthermore, we also show, for a specialized domain, that shared vocabularies produced by collaborative tagging are richer than the vocabularies which can be extracted from large-scale query logs provided by a major search engine.
   Although the empirical analysis presented in this article is based on a set of tagging data obtained from del.icio.us, the methods developed are general, and the conclusions should be applicable across other websites that employ tagging.