HCI Bibliography Home | HCI Journals | About TWEB | Journal Info | TWEB Journal Volumes | Detailed Records | RefWorks | EndNote | Hide Abstracts
TWEB Tables of Contents: 0102030405060708

ACM Transactions on The Web 4

Editors:Helen Ashman; Arun Iyengar
Dates:2010
Volume:4
Publisher:ACM
Standard No:ISSN:1559-1131 EISSN:1559-114X
Papers:17
Links:Journal Home Page | ACM Digital Library | Table of Contents
  1. TWEB 2010-01 Volume 4 Issue 1
  2. TWEB 2010-04 Volume 4 Issue 2
  3. TWEB 2010-07 Volume 4 Issue 3
  4. TWEB 2010-09 Volume 4 Issue 4

TWEB 2010-01 Volume 4 Issue 1

Understanding transportation modes based on GPS data for web applications BIBAFull-Text 1
  Yu Zheng; Yukun Chen; Quannan Li; Xing Xie; Wei-Ying Ma
User mobility has given rise to a variety of Web applications, in which the global positioning system (GPS) plays many important roles in bridging between these applications and end users. As a kind of human behavior, transportation modes, such as walking and driving, can provide pervasive computing systems with more contextual information and enrich a user's mobility with informative knowledge. In this article, we report on an approach based on supervised learning to automatically infer users' transportation modes, including driving, walking, taking a bus and riding a bike, from raw GPS logs. Our approach consists of three parts: a change point-based segmentation method, an inference model and a graph-based post-processing algorithm. First, we propose a change point-based segmentation method to partition each GPS trajectory into separate segments of different transportation modes. Second, from each segment, we identify a set of sophisticated features, which are not affected by differing traffic conditions (e.g., a person's direction when in a car is constrained more by the road than any change in traffic conditions). Later, these features are fed to a generative inference model to classify the segments of different modes. Third, we conduct graph-based postprocessing to further improve the inference performance. This postprocessing algorithm considers both the commonsense constraints of the real world and typical user behaviors based on locations in a probabilistic manner. The advantages of our method over the related works include three aspects. (1) Our approach can effectively segment trajectories containing multiple transportation modes. (2) Our work mined the location constraints from user-generated GPS logs, while being independent of additional sensor data and map information like road networks and bus stops. (3) The model learned from the dataset of some users can be applied to infer GPS data from others. Using the GPS logs collected by 65 people over a period of 10 months, we evaluated our approach via a set of experiments. As a result, based on the change-point-based segmentation method and Decision Tree-based inference model, we achieved prediction accuracy greater than 71 percent. Further, using the graph-based post-processing algorithm, the performance attained a 4-percent enhancement.
A distributed service-oriented architecture for business process execution BIBAFull-Text 2
  Guoli Li; Vinod Muthusamy; Hans-Arno Jacobsen
The Business Process Execution Language (BPEL) standardizes the development of composite enterprise applications that make use of software components exposed as Web services. BPEL processes are currently executed by a centralized orchestration engine, in which issues such as scalability, platform heterogeneity, and division across administrative domains can be difficult to manage. We propose a distributed agent-based orchestration engine in which several lightweight agents execute a portion of the original business process and collaborate in order to execute the complete process. The complete set of standard BPEL activities are supported, and the transformations of several BPEL activities to the agent-based architecture are described. Evaluations of an implementation of this architecture demonstrate that agent-based execution scales better than a non-distributed approach, with at least 70% and 120% improvements in process execution time, and throughput, respectively, even with a large number of concurrent process instances. In addition, the distributed architecture successfully executes large processes that are shown to be infeasible to execute with a nondistributed engine.
Declarative specification and verification of service choreographiess BIBAFull-Text 3
  Marco Montali; Maja Pesic; Wil M. P. van der Aalst; Federico Chesani; Paola Mello; Sergio Storari
Service-oriented computing, an emerging paradigm for architecting and implementing business collaborations within and across organizational boundaries, is currently of interest to both software vendors and scientists. While the technologies for implementing and interconnecting basic services are reaching a good level of maturity, modeling service interaction from a global viewpoint, that is, representing service choreographies, is still an open challenge. The main problem is that, although declarativeness has been identified as a key feature, several proposed approaches specify choreographies by focusing on procedural aspects, leading to over-constrained and over-specified models.
   To overcome these limits, we propose to adopt DecSerFlow, a truly declarative language, to model choreographies. Thanks to its declarative nature, DecSerFlow semantics can be given in terms of logic-based languages. In particular, we present how DecSerFlow can be mapped onto Linear Temporal Logic and onto Abductive Logic Programming. We show how the mappings onto both formalisms can be concretely exploited to address the enactment of DecSerFlow models, to enrich its expressiveness and to perform a variety of different verification tasks. We illustrate the advantages of using a declarative language in conjunction with logic-based semantics by applying our approach to a running example.

TWEB 2010-04 Volume 4 Issue 2

Ads-portal domains: Identification and measurements BIBAFull-Text 4
  Mishari Almishari; Xiaowei Yang
An ads-portal domain refers to a Web domain that shows only advertisements, served by a third-party advertisement syndication service, in the form of ads listing. We develop a machine-learning-based classifier to identify ads-portal domains, which has 96% accuracy. We use this classifier to measure the prevalence of ads-portal domains on the Internet. Surprisingly, 28.3/25% of the (two-level) *.com/*.net web domains are ads-portal domains. Also, 41/39.8% of *.com/*.net ads-portal domains are typos of well-known domains, also known as typo-squatting domains. In addition, we use the classifier along with DNS trace files to estimate how often Internet users visit ads-portal domains. It turns out that 5% of the two-level *.com, *.net, *.org, *.biz and *.info web domains on the traces are ads-portal domains and 50% of these accessed ads-portal domains are typos. These numbers show that ads-portal domains and typo-squatting ads-portal domains are prevalent on the Internet and successful in attracting many visits. Our classifier represents a step towards better categorizing the web documents. It can also be helpful to search engines ranking algorithms, helpful in identifying web spams that redirects to ads-portal domains, and used to discourage access to typo-squatting ads-portal domains.
Reporting incentives and biases in online review forums BIBAFull-Text 5
  Radu Jurca; Florent Garcin; Arjun Talwar; Boi Faltings
Online reviews have become increasingly popular as a way to judge the quality of various products and services. However, recent work demonstrates that the absence of reporting incentives leads to a biased set of reviews that may not reflect the true quality. In this paper, we investigate underlying factors that influence users when reporting feedback. In particular, we study both reporting incentives and reporting biases observed in a widely used review forum, the Tripadvisor Web site. We consider three sources of information: first, the numerical ratings left by the user for different aspects of quality; second, the textual comment accompanying a review; third, the patterns in the time sequence of reports. We first show that groups of users who discuss a certain feature at length are more likely to agree in their ratings. Second, we show that users are more motivated to give feedback when they perceive a greater risk involved in a transaction. Third, a user's rating partly reflects the difference between true quality and prior expectation of quality, as inferred from previous reviews. We finally observe that because of these biases, when averaging review scores there are strong differences between the mean and the median. We speculate that the median may be a better way to summarize the ratings.
Optimal distance bounds for fast search on compressed time-series query logs BIBAFull-Text 6
  Michail Vlachos; Suleyman S. Kozat; Philip S. Yu
Consider a database of time-series, where each datapoint in the series records the total number of users who asked for a specific query at an internet search engine. Storage and analysis of such logs can be very beneficial for a search company from multiple perspectives. First, from a data organization perspective, because query Weblogs capture important trends and statistics, they can help enhance and optimize the search experience (keyword recommendation, discovery of news events). Second, Weblog data can provide an important polling mechanism for the microeconomic aspects of a search engine, since they can facilitate and promote the advertising facet of the search engine (understand what users request and when they request it).
   Due to the sheer amount of time-series Weblogs, manipulation of the logs in a compressed form is an impeding necessity for fast data processing and compact storage requirements. Here, we explicate how to compute the lower and upper distance bounds on the time-series logs when working directly on their compressed form. Optimal distance estimation means tighter bounds, leading to better candidate selection/elimination and ultimately faster search performance. Our derivation of the optimal distance bounds is based on the careful analysis of the problem using optimization principles. The experimental evaluation suggests a clear performance advantage of the proposed method, compared to previous compression/search techniques. The presented method results in a 10-30% improvement on distance estimations, which in turn leads to 25-80% improvement on the search performance.
Engineering rich internet applications with a model-driven approach BIBAFull-Text 7
  Piero Fraternali; Sara Comai; Alessandro Bozzon; Giovanni Toffetti Carughi
Rich Internet Applications (RIAs) have introduced powerful novel functionalities into the Web architecture, borrowed from client-server and desktop applications. The resulting platforms allow designers to improve the user's experience, by exploiting client-side data and computation, bidirectional client-server communication, synchronous and asynchronous events, and rich interface widgets. However, the rapid evolution of RIA technologies challenges the Model-Driven Development methodologies that have been successfully applied in the past decade to traditional Web solutions. This paper illustrates an evolutionary approach for incorporating a wealth of RIA features into an existing Web engineering methodology and notation. The experience demonstrates that it is possible to model RIA application requirements at a high-level using a platform-independent notation, and generate the client-side and server-side code automatically. The resulting approach is evaluated in terms of expressive power, ease of use, and implementability.

TWEB 2010-07 Volume 4 Issue 3

A large-scale study on map search logs BIBAFull-Text 8
  Xiangye Xiao; Qiong Luo; Zhisheng Li; Xing Xie; Wei-Ying Ma
Map search engines, such as Google Maps, Yahoo! Maps, and Microsoft Live Maps, allow users to explicitly specify a target geographic location, either in keywords or on the map, and to search businesses, people, and other information of that location. In this article, we report a first study on a million-entry map search log. We identify three key attributes of a map search record -- the keyword query, the target location and the user location, and examine the characteristics of these three dimensions separately as well as the associations between them. Comparing our results with those previously reported on logs of general search engines and mobile search engines, including those for geographic queries, we discover the following unique features of map search: (1) People use longer queries and modify queries more frequently in a session than in general search and mobile search; People view fewer result pages per query than in general search; (2) The popular query topics in map search are different from those in general search and mobile search; (3) The target locations in a session change within 50 kilometers for almost 80% of the sessions; (4) Queries, search target locations and user locations (both at the city level) all follow the power law distribution; (5) One third of queries are issued for target locations within 50 kilometers from the user locations; (6) The distribution of a query over target locations appears to follow the geographic location of the queried entity.
Modeling web quality using a probabilistic approach: An empirical validation BIBAFull-Text 9
  Ghazwa Malak; Houari Sahraoui; Linda Badri; Mourad Badri
Web-based applications are software systems that continuously evolve to meet users' needs and to adapt to new technologies. Assuring their quality is then a difficult, but essential task. In fact, a large number of factors can affect their quality. Considering these factors and their interaction involves managing uncertainty and subjectivity inherent to this kind of applications. In this article, we present a probabilistic approach for building Web quality models and the associated assessment method. The proposed approach is based on Bayesian Networks. A model is built following a four-step process consisting in collecting quality characteristics, refining them, building a model structure, and deriving the model parameters.
   The feasibility of the approach is illustrated on the important quality characteristic of Navigability design. To validate the produced model, we conducted an experimental study with 20 subjects and 40 web pages. The results obtained show that the scores given by the used model are strongly correlated with navigability as perceived and experienced by the users.
Privacy-preserving query log mining for business confidentiality protection BIBAFull-Text 10
  Barbara Poblete; Myra Spiliopoulou; Ricardo Baeza-Yates
We introduce the concern of confidentiality protection of business information for the publication of search engine query logs and derived data. We study business confidentiality, as the protection of nonpublic data from institutions, such as companies and people in the public eye. In particular, we relate this concern to the involuntary exposure of confidential Web site information, and we transfer this problem into the field of privacy-preserving data mining. We characterize the possible adversaries interested in disclosing Web site confidential data and the attack strategies that they could use. These attacks are based on different vulnerabilities found in query log for which we present several anonymization heuristics to prevent them. We perform an experimental evaluation to estimate the remaining utility of the log after the application of our anonymization techniques. Our experimental results show that a query log can be anonymized against these specific attacks while retaining a significant volume of useful data.
Exploring XML web collections with DescribeX BIBAFull-Text 11
  Mariano P. Consens; Renée J. Miller; Flavio Rizzolo; Alejandro A. Vaisman
As Web applications mature and evolve, the nature of the semistructured data that drives these applications also changes. An important trend is the need for increased flexibility in the structure of Web documents. Hence, applications cannot rely solely on schemas to provide the complex knowledge needed to visualize, use, query and manage documents. Even when XML Web documents are valid with regard to a schema, the actual structure of such documents may exhibit significant variations across collections for several reasons: the schema may be very lax (e.g., RSS feeds), the schema may be large and different subsets of it may be used in different documents (e.g., industry standards like UBL), or open content models may allow arbitrary schemas to be mixed (e.g., RSS extensions like those used for podcasting). For these reasons, many applications that incorporate XPath queries to process a large Web document collection require an understanding of the actual structure present in the collection, and not just the schema.
   To support modern Web applications, we introduce DescribeX, a powerful framework that is capable of describing complex XML summaries of Web collections. DescribeX supports the construction of heterogenous summaries that can be declaratively defined and refined by means of axis path regular expression (AxPREs). AxPREs provide the flexibility necessary for declaratively defining complex mappings between instance nodes (in the documents) and summary nodes. These mappings are capable of expressing order and cardinality, among other properties, which can significantly help in the understanding of the structure of large collections of XML documents and enhance the performance of Web applications over these collections. DescribeX captures most summary proposals in the literature by providing (for the first time) a common declarative definition for them. Experimental results demonstrate the scalability of DescribeX summary operations (summary creation, as well as refinement and stabilization, two key enablers for tailoring summaries) on multi-gigabyte Web collections.
Discovery of latent subcommunities in a blog's readership BIBAFull-Text 12
  Brett Adams; Dinh Phung; Svetha Venkatesh
The blogosphere has grown to be a mainstream forum of social interaction as well as a commercially attractive source of information and influence. Tools are needed to better understand how communities that adhere to individual blogs are constituted in order to facilitate new personal, socially-focused browsing paradigms, and understand how blog content is consumed, which is of interest to blog authors, big media, and search. We present a novel approach to blog subcommunity characterization by modeling individual blog readers using mixtures of an extension to the LDA family that jointly models phrases and time, Ngram Topic over Time (NTOT), and cluster with a number of similarity measures using Affinity Propagation. We experiment with two datasets: a small set of blogs whose authors provide feedback, and a set of popular, highly commented blogs, which provide indicators of algorithm scalability and interpretability without prior knowledge of a given blog. The results offer useful insight to the blog authors about their commenting community, and are observed to offer an integrated perspective on the topics of discussion and members engaged in those discussions for unfamiliar blogs. Our approach also holds promise as a component of solutions to related problems, such as online entity resolution and role discovery.

TWEB 2010-09 Volume 4 Issue 4

AjaxScope: A Platform for Remotely Monitoring the Client-Side Behavior of Web 2.0 Applications BIBAFull-Text 13
  Emre Kiciman; Benjamin Livshits
The rise of the software-as-a-service paradigm has led to the development of a new breed of sophisticated, interactive applications often called Web 2.0. While Web applications have become larger and more complex, Web application developers today have little visibility into the end-to-end behavior of their systems. This article presents AjaxScope, a dynamic instrumentation platform that enables cross-user monitoring and just-in-time control of Web application behavior on end-user desktops. AjaxScope is a proxy that performs on-the-fly parsing and instrumentation of JavaScript code as it is sent to users" browsers. AjaxScope provides facilities for distributed and adaptive instrumentation in order to reduce the client-side overhead, while giving fine-grained visibility into the code-level behavior of Web applications. We present a variety of policies demonstrating the power of AjaxScope, ranging from simple error reporting and performance profiling to more complex memory leak detection and optimization analyses. We also apply our prototype to analyze the behavior of over 90 Web 2.0 applications and sites that use significant amounts of JavaScript.
Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data BIBAFull-Text 14
  Geert Jan Bex; Wouter Gelade; Frank Neven; Stijn Vansummeren
Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of learning the complete class of deterministic regular expressions from positive examples only, as we will show. The regular expressions occurring in practical DTDs and XSDs, however, are such that every alphabet symbol occurs only a small number of times. As such, in practice it suffices to learn the subclass of deterministic regular expressions in which each alphabet symbol occurs at most k times, for some small k. We refer to such expressions as k-occurrence regular expressions (k-OREs for short). Motivated by this observation, we provide a probabilistic algorithm that learns k-OREs for increasing values of k, and selects the deterministic one that best describes the sample based on a Minimum Description Length argument. The effectiveness of the method is empirically validated both on real world and synthetic data. Furthermore, the method is shown to be conservative over the simpler classes of expressions considered in previous work.
Mining Historic Query Trails to Label Long and Rare Search Engine Queries BIBAFull-Text 15
  Peter Bailey; Ryen W. White; Han Liu; Giridhar Kumaran
Web search engines can perform poorly for long queries (i.e., those containing four or more terms), in part because of their high level of query specificity. The automatic assignment of labels to long queries can capture aspects of a user's search intent that may not be apparent from the terms in the query. This affords search result matching or reranking based on queries and labels rather than the query text alone. Query labels can be derived from interaction logs generated from many users" search result clicks or from query trails comprising the chain of URLs visited following query submission. However, since long queries are typically rare, they are difficult to label in this way because little or no historic log data exists for them. A subset of these queries may be amenable to labeling by detecting similarities between parts of a long and rare query and the queries which appear in logs. In this article, we present the comparison of four similarity algorithms for the automatic assignment of Open Directory Project category labels to long and rare queries, based solely on matching against similar satisfied query trails extracted from log data. Our findings show that although the similarity-matching algorithms we investigated have tradeoffs in terms of coverage and accuracy, one algorithm that bases similarity on a popular search result ranking function (effectively regarding potentially-similar queries as "documents") outperforms the others. We find that it is possible to correctly predict the top label better than one in five times, even when no past query trail exactly matches the long and rare query. We show that these labels can be used to reorder top-ranked search results leading to a significant improvement in retrieval performance over baselines that do not utilize query labeling, but instead rank results using content-matching or click-through logs. The outcomes of our research have implications for search providers attempting to provide users with highly-relevant search results for long queries.
Fast and Compact Web Graph Representations BIBAFull-Text 16
  Francisco Claude; Gonzalo Navarro
Compressed graph representations, in particular for Web graphs, have become an attractive research topic because of their applications in the manipulation of huge graphs in main memory. The state of the art is well represented by the WebGraph project, where advantage is taken of several particular properties of Web graphs to offer a trade-off between space and access time. In this paper we show that the same properties can be exploited with a different and elegant technique that builds on grammar-based compression. In particular, we focus on Re-Pair and on Ziv-Lempel compression, which, although cannot reach the best compression ratios of WebGraph, achieve much faster navigation of the graph when both are tuned to use the same space. Moreover, the technique adapts well to run on secondary memory and in distributed scenarios. As a byproduct, we introduce an approximate Re-Pair version that works efficiently with severely limited main memory.
Relating Reputation and Money in Online Markets BIBAFull-Text 17
  Ashwin Swaminathan; Renan G. Cattelan; Ydo Wexler; Cherian V. Mathew; Darko Kirovski
Reputation in online economic systems is typically quantified using counters that specify positive and negative feedback from past transactions and/or some form of transaction network analysis that aims to quantify the likelihood that a network user will commit a fraudulent transaction. These approaches can be deceiving to honest users from numerous perspectives. We take a radically different approach with the goal of guaranteeing to a buyer that a fraudulent seller cannot disappear from the system with profit following a set of fabricated transactions that total a certain monetary limit. Even in the case of stolen identity, such an adversary cannot produce illegal profit unless a buyer decides to pay over the suggested limit.