HCI Bibliography Home | HCI Journals | About TWEB | Journal Info | TWEB Journal Volumes | Detailed Records | RefWorks | EndNote | Hide Abstracts
TWEB Tables of Contents: 0102030405060708

ACM Transactions on The Web 2

Editors:Helen Ashman; Arun Iyengar
Standard No:ISSN:1559-1131 EISSN:1559-114X
Links:Journal Home Page | ACM Digital Library | Table of Contents
  1. TWEB 2008-02 Volume 2 Issue 1
  2. TWEB 2008-04 Volume 2 Issue 2
  3. TWEB 2008-07 Volume 2 Issue 3
  4. TWEB 2008-10 Volume 2 Issue 4

TWEB 2008-02 Volume 2 Issue 1

Introduction to special section on adversarial issues in Web search BIBFull-Text 1
  Marc Najork; Brian D. Davison
Link analysis for Web spam detection BIBAFull-Text 2
  Luca Becchetti; Carlos Castillo; Debora Donato; Ricardo Baeza-YATES; Stefano Leonardi
We propose link-based techniques for automatic detection of Web spam, a term referring to pages which use deceptive techniques to obtain undeservedly high scores in search engines. The use of Web spam is widespread and difficult to solve, mostly due to the large size of the Web which means that, in practice, many algorithms are infeasible.
   We perform a statistical analysis of a large collection of Web pages. In particular, we compute statistics of the links in the vicinity of every Web page applying rank propagation and probabilistic counting over the entire Web graph in a scalable way. These statistical features are used to build Web spam classifiers which only consider the link structure of the Web, regardless of page contents. We then present a study of the performance of each of the classifiers alone, as well as their combined performance, by testing them over a large collection of Web link spam. After tenfold cross-validation, our best classifiers have a performance comparable to that of state-of-the-art spam classifiers that use content attributes, but are orthogonal to content-based methods.
Tracking Web spam with HTML style similarities BIBAFull-Text 3
  Tanguy Urvoy; Emmanuel Chauveau; Pascal Filoche; Thomas Lavergne
Automatically generated content is ubiquitous in the web: dynamic sites built using the three-tier paradigm are good examples (e.g., commercial sites, blogs and other sites edited using web authoring software), as well as less legitimate spamdexing attempts (e.g., link farms, faked directories).
   Those pages built using the same generating method (template or script) share a common "look and feel" that is not easily detected by common text classification methods, but is more related to stylometry.
   In this work we study and compare several HTML style similarity measures based on both textual and extra-textual features in HTML source code. We also propose a flexible algorithm to cluster a large collection of documents according to these measures. Since the proposed algorithm is based on locality sensitive hashing (LSH), we first review this technique.
   We then describe how to use the HTML style similarity clusters to pinpoint dubious pages and enhance the quality of spam classifiers. We present an evaluation of our algorithm on the WEBSPAM-UK2006 dataset.
Detecting splogs via temporal dynamics using self-similarity analysis BIBAFull-Text 4
  Yu-Ru Lin; Hari Sundaram; Yun Chi; Junichi Tatemura; Belle L. Tseng
This article addresses the problem of spam blog (splog) detection using temporal and structural regularity of content, post time and links. Splogs are undesirable blogs meant to attract search engine traffic, used solely for promoting affiliate sites. Blogs represent popular online media, and splogs not only degrade the quality of search engine results, but also waste network resources. The splog detection problem is made difficult due to the lack of stable content descriptors.
   We have developed a new technique for detecting splogs, based on the observation that a blog is a dynamic, growing sequence of entries (or posts) rather than a collection of individual pages. In our approach, splogs are recognized by their temporal characteristics and content. There are three key ideas in our splog detection framework. (a) We represent the blog temporal dynamics using self-similarity matrices defined on the histogram intersection similarity measure of the time, content, and link attributes of posts, to investigate the temporal changes of the post sequence. (b) We study the blog temporal characteristics using a visual representation derived from the self-similarity measures. The visual signature reveals correlation between attributes and posts, depending on the type of blogs (normal blogs and splogs). (c) We propose two types of novel temporal features to capture the splog temporal characteristics. In our splog detector, these novel features are combined with content based features. We extract a content based feature vector from blog home pages as well as from different parts of the blog. The dimensionality of the feature vector is reduced by Fisher linear discriminant analysis. We have tested an SVM-based splog detector using proposed features on real world datasets, with appreciable results (90% accuracy).
Not quite the average: An empirical study of Web use BIBAFull-Text 5
  Harald Weinreich; Hartmut Obendorf; Eelco Herder; Matthias Mayer
In the past decade, the World Wide Web has been subject to dramatic changes. Web sites have evolved from static information resources to dynamic and interactive applications that are used for a broad scope of activities on a daily basis. To examine the consequences of these changes on user behavior, we conducted a long-term client-side Web usage study with twenty-five participants. This report presents results of this study and compares the user behavior with previous long-term browser usage studies, which range in age from seven to thirteen years. Based on the empirical data and the interview results, various implications for the interface design of browsers and Web sites are discussed.
   A major finding is the decreasing prominence of backtracking in Web navigation. This can largely be attributed to the increasing importance of dynamic, service-oriented Web sites. Users do not navigate on these sites searching for information, but rather interact with an online application to complete certain tasks. Furthermore, the usage of multiple windows and tabs has partly replaced back button usage, posing new challenges for user orientation and backtracking. We found that Web browsing is a rapid activity even for pages with substantial content, which calls for page designs that allow for cursory reading. Click maps provide additional information on how users interact with the Web on page level. Finally, substantial differences were observed between users, and characteristic usage patterns for different types of Web sites emphasize the need for more adaptive and customizable Web browsers.
Framework for Web service query algebra and optimization BIBAFull-Text 6
  Qi Yu; Athman Bouguettaya
We present a query algebra that supports optimized access of Web services through service-oriented queries. The service query algebra is defined based on a formal service model that provides a high-level abstraction of Web services across an application domain. The algebra defines a set of algebraic operators. Algebraic service queries can be formulated using these operators. This allows users to query their desired services based on both functionality and quality. We provide the implementation of each algebraic operator. This enables the generation of Service Execution Plans (SEPs) that can be used by users to directly access services. We present an optimization algorithm by extending the Dynamic Programming (DP) approach to efficiently select the SEPs with the best user-desired quality. The experimental study validates the proposed algorithm by demonstrating significant performance improvement compared with the traditional DP approach.
Scalable semantic analytics on social networks for addressing the problem of conflict of interest detection BIBAFull-Text 7
  Boanerges Aleman-Meza; Meenakshi Nagarajan; Li Ding; Amit Sheth; I. Budak Arpinar; Anupam Joshi; Tim Finin
In this article, we demonstrate the applicability of semantic techniques for detection of Conflict of Interest (COI). We explain the common challenges involved in building scalable Semantic Web applications, in particular those addressing connecting-the-dots problems. We describe in detail the challenges involved in two important aspects on building Semantic Web applications, namely, data acquisition and entity disambiguation (or reference reconciliation). We extend upon our previous work where we integrated the collaborative network of a subset of DBLP researchers with persons in a Friend-of-a-Friend social network (FOAF). Our method finds the connections between people, measures collaboration strength, and includes heuristics that use friendship/affiliation information to provide an estimate of potential COI in a peer-review scenario. Evaluations are presented by measuring what could have been the COI between accepted papers in various conference tracks and their respective program committee members. The experimental results demonstrate that scalability can be achieved by using a dataset of over 3 million entities (all bibliographic data from DBLP and a large collection of FOAF documents).
Adaptive quality of service management for enterprise services BIBAFull-Text 8
  Daniel Gmach; Stefan Krompass; Andreas Scholz; Martin Wimmer; Alfons Kemper
In the past, enterprise resource planning systems were designed as monolithic software systems running on centralized mainframes. Today, these systems are (re-)designed as a repository of enterprise services that are distributed throughout the available computing infrastructure. These service oriented architectures (SOAs) require advanced automatic and adaptive management concepts in order to achieve a high quality of service level in terms of, for example, availability, responsiveness, and throughput. The adaptive management has to allocate service instances to computing resources, adapt the resource allocation to unforeseen load fluctuations, and intelligently schedule individual requests to guarantee negotiated service level agreements (SLAs). Our AutoGlobe platform provides such a comprehensive adaptive service management comprising
    -- static service-to-server allocation based on automatically detected service utilization patterns,
    -- adaptive service management based on a fuzzy controller that remedies exceptional situations by automatically initiating, for example, service migration, service replication (scale-out), and
    -- adaptive scheduling of individual service requests that prioritizes requests depending on the current degree of service level conformance.
   All three complementary control components are described in detail, and their effectiveness is analyzed by means of realistic business application scenarios.
Discovering global network communities based on local centralities BIBAFull-Text 9
  Bo Yang; Jiming Liu
One of the central problems in studying and understanding complex networks, such as online social networks or World Wide Web, is to discover hidden, either physically (e.g., interactions or hyperlinks) or logically (e.g., profiles or semantics) well-defined topological structures. From a practical point of view, a good example of such structures would be so-called network communities. Earlier studies have introduced various formulations as well as methods for the problem of identifying or extracting communities. While each of them has pros and cons as far as the effectiveness and efficiency are concerned, almost none of them has explicitly dealt with the potential relationship between the global topological property of a network and the local property of individual nodes. In order to study this problem, this paper presents a new algorithm, called ICS, which aims to discover natural network communities by inferring from the local information of nodes inherently hidden in networks based on a new centrality, that is, clustering centrality, which is a generalization of eigenvector centrality. As compared with existing methods, our method runs efficiently with a good clustering performance. Additionally, it is insensitive to its built-in parameters and prior knowledge.

TWEB 2008-04 Volume 2 Issue 2

Introduction to special issue on service oriented computing (SOC) BIBFull-Text 10
  Schahram Dustdar; Bernd J. Krämer
Automatic annotation of Web services based on workflow definitions BIBAFull-Text 11
  Khalid Belhajjame; Suzanne M. Embury; Norman W. Paton; Robert Stevens; Carole A. Goble
Semantic annotations of web services can support the effective and efficient discovery of services, and guide their composition into workflows. At present, however, the practical utility of such annotations is limited by the small number of service annotations available for general use. Manual annotation of services is a time consuming and thus expensive task, so some means are required by which services can be automatically (or semi-automatically) annotated. In this paper, we show how information can be inferred about the semantics of operation parameters based on their connections to other (annotated) operation parameters within tried-and-tested workflows. Because the data links in the workflows do not necessarily contain every possible connection of compatible parameters, we can infer only constraints on the semantics of parameters. We show that despite their imprecise nature these so-called loose annotations are still of value in supporting the manual annotation task, inspecting workflows and discovering services. We also show that derived annotations for already annotated parameters are useful. By comparing existing and newly derived annotations of operation parameters, we can support the detection of errors in existing annotations, the ontology used for annotation and in workflows. The derivation mechanism has been implemented, and its practical applicability for inferring new annotations has been established through an experimental evaluation. The usefulness of the derived annotations is also demonstrated.
Correctness-aware high-level functional matching approaches for semantic Web services BIBAFull-Text 12
  Islam Elgedawy; Zahir Tari; James A. Thom
Service matching approaches trade precision for recall, creating the need for users to choose the correct services, which obviously is a major obstacle for automating the service discovery and aggregation processes. Our approach to overcome this problem, is to eliminate the appearance of false positives by returning only the correct services. As different users have different semantics for what is correct, we argue that the correctness of the matching results must be determined according to the achievement of users' goals: that only services achieving users' goals are considered correct. To determine such correctness, we argue that the matching process should be based primarily on the high-level functional specifications (namely goals, achievement contexts, and external behaviors). In this article, we propose models, data structures, algorithms, and theorems required to correctly match such specifications. We propose a model called G+, to capture such specifications, for both services and users, in a machine-understandable format. We propose a data structure, called a Concepts Substitutability Graph (CSG), to capture the substitution semantics of application domain concepts in a context-based manner, in order to determine the semantic-preserving mapping transformations required to match different G+ models. We also propose a behavior matching approach that is able to match states in an m-to-n manner, such that behavior models with different numbers of state transitions can be matched. Finally, we show how services are matched and aggregated according to their G+ models. Results of supporting experiments demonstrate the advantages of the proposed service matching approaches.
Supporting the dynamic evolution of Web service protocols in service-oriented architectures BIBAFull-Text 13
  Seung Hwan Ryu; Fabio Casati; Halvard Skogsrud; Boualem Benatallah; Régis Saint-Paul
In service-oriented architectures, everything is a service and everyone is a service provider. Web services (or simply services) are loosely coupled software components that are published, discovered, and invoked across the Web. As the use of Web service grows, in order to correctly interact with them, it is important to understand the business protocols that provide clients with the information on how to interact with services. In dynamic Web service environments, service providers need to constantly adapt their business protocols for reflecting the restrictions and requirements proposed by new applications, new business strategies, and new laws, or for fixing problems found in the protocol definition. However, the effective management of such a protocol evolution raises critical problems: one of the most critical issues is how to handle instances running under the old protocol when it has been changed. Simple solutions, such as aborting them or allowing them to continue to run according to the old protocol, can be considered, but they are inapplicable for many reasons (for example, the loss of work already done and the critical nature of work). In this article, we present a framework that supports service managers in managing the business protocol evolution by providing several features, such as a variety of protocol change impact analyses automatically determining which ongoing instances can be migrated to the new version of protocol, and data mining techniques inferring interaction patterns used for classifying ongoing instances migrateable to the new protocol. To support the protocol evolution process, we have also developed database-backed GUI tools on top of our existing system. The proposed approach and tools can help service managers in managing the evolution of ongoing instances when the business protocols of services with which they are interacting have changed.
An environment for flexible advanced compensations of Web service transactions BIBAFull-Text 14
  Michael Schäfer; Peter Dolog; Wolfgang Nejdl
Business to business integration has recently been performed by employing Web service environments. Moreover, such environments are being provided by major players on the technology markets. Those environments are based on open specifications for transaction coordination. When a failure in such an environment occurs, a compensation can be initiated to recover from the failure. However, current environments have only limited capabilities for compensations, and are usually based on backward recovery. In this article, we introduce an environment to deal with advanced compensations based on forward recovery principles. We extend the existing Web service transaction coordination architecture and infrastructure in order to support flexible compensation operations. We use a contract-based approach, which allows the specification of permitted compensations at runtime. We introduce abstract service and adapter components, which allow us to separate the compensation logic from the coordination logic. In this way, we can easily plug in or plug out different compensation strategies based on a specification language defined on top of basic compensation activities and complex compensation types. Experiments with our approach and environment show that such an approach to compensation is feasible and beneficial. Additionally, we introduce a cost-benefit model to evaluate the proposed environment based on net value analysis. The evaluation shows in which circumstances the environment is economical.

TWEB 2008-07 Volume 2 Issue 3

Mitigating application-level denial of service attacks on Web servers: A client-transparent approach BIBAFull-Text 15
  Mudhakar Srivatsa; Arun Iyengar; Jian Yin; Ling Liu
Recently, we have seen increasing numbers of denial of service (DoS) attacks against online services and Web applications either for extortion reasons or for impairing and even disabling the competition. These DoS attacks have increasingly targeted the application level. Application-level DoS attacks emulate the same request syntax and network-level traffic characteristics as those of legitimate clients, thereby making the attacks much harder to detect and counter. Moreover, such attacks often target bottleneck resources such as disk bandwidth, database bandwidth, and CPU resources. In this article, we propose handling DoS attacks by using a twofold mechanism. First, we perform admission control to limit the number of concurrent clients served by the online service. Admission control is based on port hiding that renders the online service invisible to unauthorized clients by hiding the port number on which the service accepts incoming requests. Second, we perform congestion control on admitted clients to allocate more resources to good clients. Congestion control is achieved by adaptively setting a client's priority level in response to the client's requests in a way that can incorporate application-level semantics. We present a detailed evaluation of the proposed solution using two sample applications: Apache HTTPD and the TPCW benchmark (running on Apache Tomcat and IBM DB2). Our experiments show that the proposed solution incurs low performance overhead and is resilient to DoS attacks.
Leveraging popular destinations to enhance Web search interaction BIBAFull-Text 16
  Ryen W. White; Mikhail Bilenko; Silviu Cucerzan
This article presents a novel Web search interaction feature that for a given query provides links to Web sites frequently visited by other users with similar information needs. These popular destinations complement traditional search results, allowing direct navigation to authoritative resources for the query topic. Destinations are identified using the history of the search and browsing behavior of many users over an extended time period, and their collective behavior provides a basis for computing source authority. They are drawn from the end of users' postquery browse trails where users may cease searching once they find relevant information. We describe a user study that compared the suggestion of destinations with the previously proposed suggestion of related queries as well as with traditional, unaided Web search. Results show that search enhanced by query suggestions outperforms other systems in terms of subject perceptions and search effectiveness for fact-finding search tasks. However, search enhanced by destination suggestions performs best for exploratory tasks with its best performance obtained from mining past user behavior at query-level granularity. We discuss the implications of these and other findings from our study for the design of search systems that utilize user behavior, in particular, user browse trails and popular destinations.
Models and framework for supporting runtime decisions in Web-based systems BIBAFull-Text 17
  Mauro Andreolini; Sara Casolari; Michele Colajanni
Efficient management of distributed Web-based systems requires several mechanisms that decide on request dispatching, load balance, admission control, request redirection. The algorithms behind these mechanisms typically make fast decisions on the basis of the load conditions of the system resources. The architecture complexity and workloads characterizing most Web-based services make it extremely difficult to deduce a representative view of a resource load from collected measures that show extreme variability even at different time scales. Hence, any decision based on instantaneous or average views of the system load may lead to useless or even wrong actions. As an alternative, we propose a two-phase strategy that first aims to obtain a representative view of the load trend from measured system values and then applies this representation to support runtime decision systems. We consider two classical problems behind decisions: how to detect significant and nontransient load changes of a system resource and how to predict its future load behavior. The two-phase strategy is based on stochastic functions that are characterized by a computational complexity that is compatible with runtime decisions. We describe, test, and tune the two-phase strategy by considering as a first example a multitier Web-based system that is subject to different classes of realistic and synthetic workloads. Also, we integrate the proposed strategy into a framework that we validate by applying it to support runtime decisions in a cluster Web system and in a locally distributed Network Intrusion Detection System.

TWEB 2008-10 Volume 2 Issue 4

Introduction to special issue on query log analysis: Technology and ethics BIBFull-Text 18
  Einat Amitay; Andrei Broder
A survey of query log privacy-enhancing techniques from a policy perspective BIBAFull-Text 19
  Alissa Cooper
As popular search engines face the sometimes conflicting interests of protecting privacy while retaining query logs for a variety of uses, numerous technical measures have been suggested to both enhance privacy and preserve at least a portion of the utility of query logs. This article seeks to assess seven of these techniques against three sets of criteria: (1) how well the technique protects privacy, (2) how well the technique preserves the utility of the query logs, and (3) how well the technique might be implemented as a user control. A user control is defined as a mechanism that allows individual Internet users to choose to have the technique applied to their own query logs.
Design trade-offs for search engine caching BIBAFull-Text 20
  Ricardo Baeza-Yates; Aristides Gionis; Flavio P. Junqueira; Vanessa Murdock; Vassilis Plachouras; Fabrizio Silvestri
In this article we study the trade-offs in designing efficient caching systems for Web search engines. We explore the impact of different approaches, such as static vs. dynamic caching, and caching query results vs. caching posting lists. Using a query log spanning a whole year, we explore the limitations of caching and we demonstrate that caching posting lists can achieve higher hit rates than caching query answers. We propose a new algorithm for static caching of posting lists, which outperforms previous methods. We also study the problem of finding the optimal way to split the static cache between answers and posting lists. Finally, we measure how the changes in the query log influence the effectiveness of static caching, given our observation that the distribution of the queries changes slowly over time. Our results and observations are applicable to different levels of the data-access hierarchy, for instance, for a memory/disk layer or a broker/remote server layer.
Learning about the world through long-term query logs BIBAFull-Text 21
  Matthew Richardson
In this article, we demonstrate the value of long-term query logs. Most work on query logs to date considers only short-term (within-session) query information. In contrast, we show that long-term query logs can be used to learn about the world we live in. There are many applications of this that lead not only to improving the search engine for its users, but also potentially to advances in other disciplines such as medicine, sociology, economics, and more. In this article, we will show how long-term query logs can be used for these purposes, and that their potential is severely reduced if the logs are limited to short time horizons. We show that query effects are long-lasting, provide valuable information, and might be used to automatically make medical discoveries, build concept hierarchies, and generally learn about the sociological behavior of users. We believe these applications are only the beginning of what can be done with the information contained in long-term query logs, and see this work as a step toward unlocking their potential.
Combating spam in tagging systems: An evaluation BIBAFull-Text 22
  Georgia Koutrika; Frans Adjie Effendi; Zoltán Gyöngyi; Paul Heymann; Hector Garcia-Molina
Tagging systems allow users to interactively annotate a pool of shared resources using descriptive strings called tags. Tags are used to guide users to interesting resources and help them build communities that share their expertise and resources. As tagging systems are gaining in popularity, they become more susceptible to tag spam: misleading tags that are generated in order to increase the visibility of some resources or simply to confuse users. Our goal is to understand this problem better. In particular, we are interested in answers to questions such as: How many malicious users can a tagging system tolerate before results significantly degrade? What types of tagging systems are more vulnerable to malicious attacks? What would be the effort and the impact of employing a trusted moderator to find bad postings? Can a system automatically protect itself from spam, for instance, by exploiting user tag patterns? In a quest for answers to these questions, we introduce a framework for modeling tagging systems and user tagging behavior. We also describe a method for ranking documents matching a tag based on taggers' reliability. Using our framework, we study the behavior of existing approaches under malicious attacks and the impact of a moderator and our ranking method.