HCI Bibliography Home | HCI Journals | About TWEB | Journal Info | TWEB Journal Volumes | Detailed Records | RefWorks | EndNote | Hide Abstracts
TWEB Tables of Contents: 0102030405060708

ACM Transactions on The Web 5

Editors:Helen Ashman; Arun Iyengar
Standard No:ISSN:1559-1131 EISSN:1559-114X
Links:Journal Home Page | ACM Digital Library | Table of Contents
  1. TWEB 2011-02 Volume 5 Issue 1
  2. TWEB 2011-05 Volume 5 Issue 2
  3. TWEB 2011-07 Volume 5 Issue 3
  4. TWEB 2011-10 Volume 5 Issue 4

TWEB 2011-02 Volume 5 Issue 1

Introduction to special issue on recommender systems BIBFull-Text 1
  John Riedl; Barry Smyth
Comparison of collaborative filtering algorithms: Limitations of current techniques and proposals for scalable, high-performance recommender systems BIBAFull-Text 2
  Fidel Cacheda; Víctor Carneiro; Diego Fernández; Vreixo Formoso
The technique of collaborative filtering is especially successful in generating personalized recommendations. More than a decade of research has resulted in numerous algorithms, although no comparison of the different strategies has been made. In fact, a universally accepted way of evaluating a collaborative filtering algorithm does not exist yet. In this work, we compare different techniques found in the literature, and we study the characteristics of each one, highlighting their principal strengths and weaknesses. Several experiments have been performed, using the most popular metrics and algorithms. Moreover, two new metrics designed to measure the precision on good items have been proposed.
   The results have revealed the weaknesses of many algorithms in extracting information from user profiles especially under sparsity conditions. We have also confirmed the good results of SVD-based techniques already reported by other authors. As an alternative, we present a new approach based on the interpretation of the tendencies or differences between users and items. Despite its extraordinary simplicity, in our experiments, it obtained noticeably better results than more complex algorithms. In fact, in the cases analyzed, its results are at least equivalent to those of the best approaches studied. Under sparsity conditions, there is more than a 20% improvement in accuracy over the traditional user-based algorithms, while maintaining over 90% coverage. Moreover, it is much more efficient computationally than any other algorithm, making it especially adequate for large amounts of data.
Using external aggregate ratings for improving individual recommendations BIBAFull-Text 3
  Akhmed Umyarov; Alexander Tuzhilin
This article describes an approach for incorporating externally specified aggregate ratings information into certain types of recommender systems, including two types of collaborating filtering and a hierarchical linear regression model. First, we present a framework for incorporating aggregate rating information and apply this framework to the aforementioned individual rating models. Then we formally show that this additional aggregate rating information provides more accurate recommendations of individual items to individual users. Further, we experimentally confirm this theoretical finding by demonstrating on several datasets that the aggregate rating information indeed leads to better predictions of unknown ratings. We also propose scalable methods for incorporating this aggregate information and test our approaches on large datasets. Finally, we demonstrate that the aggregate rating information can also be used as a solution to the cold start problem of recommender systems.
Automatic tag recommendation algorithms for social recommender systems BIBAFull-Text 4
  Yang Song; Lu Zhang; C. Lee Giles
The emergence of Web 2.0 and the consequent success of social network Web sites such as Del.icio.us and Flickr introduce us to a new concept called social bookmarking, or tagging. Tagging is the action of connecting a relevant user-defined keyword to a document, image, or video, which helps the user to better organize and share their collections of interesting stuff. With the rapid growth of Web 2.0, tagged data is becoming more and more abundant on the social network Web sites. An interesting problem is how to automate the process of making tag recommendations to users when a new resource becomes available.
   In this article, we address the issue of tag recommendation from a machine learning perspective. From our empirical observation of two large-scale datasets, we first argue that the user-centered approach for tag recommendation is not very effective in practice. Consequently, we propose two novel document-centered approaches that are capable of making effective and efficient tag recommendations in real scenarios. The first, graph-based, method represents the tagged data in two bipartite graphs, (document, tag) and (document, word), then finds document topics by leveraging graph partitioning algorithms. The second, prototype-based, method aims at finding the most representative documents within the data collections and advocates a sparse multiclass Gaussian process classifier for efficient document classification. For both methods, tags are ranked within each topic cluster/class by a novel ranking method. Recommendations are performed by first classifying a new document into one or more topic clusters/classes, and then selecting the most relevant tags from those clusters/classes as machine-recommended tags.
   Experiments on real-world data from Del.icio.us, CiteULike, and BibSonomy examine the quality of tag recommendation as well as the efficiency of our recommendation algorithms. The results suggest that our document-centered models can substantially improve the performance of tag recommendations when compared to the user-centered methods, as well as topic models LDA and SVM classifiers.
Recommending friends and locations based on individual location history BIBAFull-Text 5
  Yu Zheng; Lizhu Zhang; Zhengxin Ma; Xing Xie; Wei-Ying Ma
The increasing availability of location-acquisition technologies (GPS, GSM networks, etc.) enables people to log the location histories with spatio-temporal data. Such real-world location histories imply, to some extent, users' interests in places, and bring us opportunities to understand the correlation between users and locations. In this article, we move towards this direction and report on a personalized friend and location recommender for the geographical information systems (GIS) on the Web. First, in this recommender system, a particular individual's visits to a geospatial region in the real world are used as their implicit ratings on that region. Second, we measure the similarity between users in terms of their location histories and recommend to each user a group of potential friends in a GIS community. Third, we estimate an individual's interests in a set of unvisited regions by involving his/her location history and those of other users. Some unvisited locations that might match their tastes can be recommended to the individual. A framework, referred to as a hierarchical-graph-based similarity measurement (HGSM), is proposed to uniformly model each individual's location history, and effectively measure the similarity among users. In this framework, we take into account three factors: 1) the sequence property of people's outdoor movements, 2) the visited popularity of a geospatial region, and 3) the hierarchical property of geographic spaces. Further, we incorporated a content-based method into a user-based collaborative filtering algorithm, which uses HGSM as the user similarity measure, to estimate the rating of a user on an item. We evaluated this recommender system based on the GPS data collected by 75 subjects over a period of 1 year in the real world. As a result, HGSM outperforms related similarity measures, namely similarity-by-count, cosine similarity, and Pearson similarity measures. Moreover, beyond the item-based CF method and random recommendations, our system provides users with more attractive locations and better user experiences of recommendation.

TWEB 2011-05 Volume 5 Issue 2

Topic Distillation with Query-Dependent Link Connections and Page Characteristics BIBAFull-Text 6
  Mingfang Wu; Falk Scholer; Andrew Turpin
Searchers on the Web often aim to find key resources about a topic. Finding such results is called topic distillation. Previous research has shown that the use of sources of evidence such as page indegree and URL structure can significantly improve search performance on interconnected collections such as the Web, beyond the use of simple term distribution statistics. This article presents a new approach to improve topic distillation by exploring the use of external sources of evidence: link structure, including query dependent indegree and outdegree; and web page characteristics, such as the density of anchor links.
   Our experiments with the TREC .GOV collection, an 18GB crawl of the US .gov domain from 2002, show that using such evidence can significantly improve search effectiveness, with combinations of evidence leading to significant performance gains over both full-text and anchor-text baselines. Moreover, we demonstrate that, at a different scope level, both local query-dependent outdegree and query-dependent indegree out-performed their global query-independent counterparts; and at the same scope level, outdegree out-performed indegree. Adding query-dependent indegree or page characteristics to query-dependent outdegree could have a small, but not significant, improvement.
Host-Based P2P Flow Identification and Use in Real-Time BIBAFull-Text 7
  John Hurley; Emi Garcia-Palacios; Sakir Sezer
Data identification and classification is a key task for any Internet Service Provider (ISP) or network administrator. As port fluctuation and encryption become more common in P2P applications wishing to avoid identification, new strategies must be developed to detect and classify their flows. This article introduces a method of separating P2P and standard web traffic that can be applied as part of an offline data analysis process, based on the activity of the hosts on the network. Heuristics are analyzed and a classification system proposed that focuses on classifying those "long" flows that transfer most of the bytes across a network. The accuracy of the system is then tested using real network traffic from a core Internet router showing misclassification rates as low as 0.54% of flows in some cases. We expand on this proposed strategy to investigate its relevance to real-time, early classification problems. New proposals are made and the results of real-time experiments are compared to those obtained in the offline analysis. It is shown that classification accuracies in the real-time strategy are similar to those achieved in offline analysis with a large portion of the total web and P2P flows correctly identified.
Characterizing Web-Based Video Sharing Workloads BIBAFull-Text 8
  Siddharth Mitra; Mayank Agrawal; Amit Yadav; Niklas Carlsson; Derek Eager; Anirban Mahanti
Video sharing services that allow ordinary Web users to upload video clips of their choice and watch video clips uploaded by others have recently become very popular. This article identifies invariants in video sharing workloads, through comparison of the workload characteristics of four popular video sharing services. Our traces contain metadata on approximately 1.8 million videos which together have been viewed approximately 6 billion times. Using these traces, we study the similarities and differences in use of several Web 2.0 features such as ratings, comments, favorites, and propensity of uploading content. In general, we find that active contribution, such as video uploading and rating of videos, is much less prevalent than passive use. While uploaders in general are skewed with respect to the number of videos they upload, the fraction of multi-time uploaders is found to differ by a factor of two between two of the sites. The distributions of lifetime measures of video popularity are found to have heavy-tailed forms that are similar across the four sites. Finally, we consider implications for system design of the identified invariants. To gain further insight into caching in video sharing systems, and the relevance to caching of lifetime popularity measures, we gathered an additional dataset tracking views to a set of approximately 1.3 million videos from one of the services, over a twelve-week period. We find that lifetime popularity measures have some relevance for large cache (hot set) sizes (i.e., a hot set defined according to one of these measures is indeed relatively "hot"), but that this relevance substantially decreases as cache size decreases, owing to churn in video popularity.
Cost-Aware Strategies for Query Result Caching in Web Search Engines BIBAFull-Text 9
  Rifat Ozcan; Ismail Sengor Altingovde; Özgür Ulusoy
Search engines and large-scale IR systems need to cache query results for efficiency and scalability purposes. Static and dynamic caching techniques (as well as their combinations) are employed to effectively cache query results. In this study, we propose cost-aware strategies for static and dynamic caching setups. Our research is motivated by two key observations: (i) query processing costs may significantly vary among different queries, and (ii) the processing cost of a query is not proportional to its popularity (i.e., frequency in the previous logs). The first observation implies that cache misses have different, that is, nonuniform, costs in this context. The latter observation implies that typical caching policies, solely based on query popularity, can not always minimize the total cost. Therefore, we propose to explicitly incorporate the query costs into the caching policies. Simulation results using two large Web crawl datasets and a real query log reveal that the proposed approach improves overall system performance in terms of the average query execution time.
A Survey of Requirements Specification in Model-Driven Development of Web Applications BIBAFull-Text 10
  Pedro Valderas; Vicente Pelechano
Model-driven development has become more and more important in the last few years. In the context of web application development, many web Engineering methods that propose model-driven development processes have appeared. However, earlier stages of these processes are seldom considered and few of these methods rigorously face the problems of specifying web application requirements and translating them into the proper conceptual model. However, it is widely recognized that requirements engineering activities are essential to obtain quality software products.
   This article surveys Model-driven web engineering methods in a comparative study and analyzes the techniques proposed for specifying functional, data and navigational requirements as well as the mechanisms provided for automatically translating these requirements into conceptual models. Our main goal is to provide a critical view of the support that is provided by these methods for handling web application requirements in order to show their current limitations and strengths.
Designing and Implementing the OP and OP2 Web Browsers BIBAFull-Text 11
  Chris Grier; Shuo Tang; Samuel T. King
Current web browsers are plagued with vulnerabilities, providing hackers with easy access to computer systems via browser-based attacks. Browser security efforts that retrofit existing browsers have had limited success because the design of modern browsers is fundamentally flawed. To enable more secure web browsing, we design and implement a new browser, called the OP web browser, that attempts to improve the state-of-the-art in browser security. We combine operating system design principles with formal methods to design a more secure web browser by drawing on the expertise of both communities. Our design philosophy is to partition the browser into smaller subsystems and make all communication between subsystems simple and explicit. At the core of our design is a small browser kernel that manages the browser subsystems and interposes on all communications between them to enforce our new browser security features.
   To show the utility of our browser architecture, we design and implement three novel security features. First, we develop flexible security policies that allow us to include browser plugins within our security framework. Second, we use formal methods to prove useful security properties including user interface invariants and browser security policy. Third, we design and implement a browser-level information-flow tracking system to enable post-mortem analysis of browser-based attacks.
   In addition to presenting the OP browser architecture, we discuss the design and implementation of a second version of OP, OP2, that includes features from other secure web browser designs to improve on the overall security and performance of OP. To evaluate our design, we implemented OP2 and tested both performance, memory, and filesystem impact while browsing popular pages. We show that the additional security features in OP and OP2 introduce minimal overhead.

TWEB 2011-07 Volume 5 Issue 3

A Clustering-Driven LDAP Framework BIBAFull-Text 12
  Vassiliki Koutsonikola; Athena Vakali
LDAP directories have proliferated as the appropriate storage framework for various and heterogeneous data sources, operating under a wide range of applications and services. Due to the increased amount and heterogeneity of the LDAP data, there is a requirement for appropriate data organization schemes. The LPAIR & LMERGE (LP-LM) algorithm, presented in this article, is a hierarchical agglomerative structure-based clustering algorithm which can be used for the LDAP directory information tree definition. A thorough study of the algorithm's performance is provided, which designates its efficiency. Moreover, the Relative Link as an alternative merging criterion is proposed, since as indicated by the experimentation, it can result in more balanced clusters. Finally, the LP and LM Query Engine is presented, which considering the clustering-based LDAP data organization, results in the enhancement of the LDAP server's performance.
ACConv -- An Access Control Model for Conversational Web Services BIBAFull-Text 13
  Federica Paci; Massimo Mecella; Mourad Ouzzani; Elisa Bertino
With organizations increasingly depending on Web services to build complex applications, security and privacy concerns including the protection of access control policies are becoming a serious issue. Ideally, service providers would like to make sure that clients have knowledge of only portions of the access control policy relevant to their interactions to the extent to which they are entrusted by the Web service and without restricting the client's choices in terms of which operations to execute. We propose ACConv, a novel model for access control in Web services that is suitable when interactions between the client and the Web service are conversational and long-running. The conversation-based access control model proposed in this article allows service providers to limit how much knowledge clients have about the credentials specified in their access policies. This is achieved while reducing the number of times credentials are asked from clients and minimizing the risk that clients drop out of a conversation with the Web service before reaching a final state due to the lack of necessary credentials. Clients are requested to provide credentials, and hence are entrusted with part of the Web service access control policies, only for some specific granted conversations which are decided based on: (1) a level of trust that the Web service provider has vis-à-vis the client, (2) the operation that the client is about to invoke, and (3) meaningful conversations which represent conversations that lead to a final state from the current one. We have implemented the proposed approach in a software prototype and conducted extensive experiments to show its effectiveness.
On Computing Deltas of RDF/S Knowledge Bases BIBAFull-Text 14
  Dimitris Zeginis; Yannis Tzitzikas; Vassilis Christophides
The ability to compute the differences that exist between two RDF/S Knowledge Bases (KB) is an important step to cope with the evolving nature of the Semantic Web (SW). In particular, RDF/S deltas can be employed to reduce the amount of data that need to be exchanged and managed over the network in order to build SW synchronization and versioning services. By considering deltas as sets of change operations, in this article we introduce various RDF/S differential functions which take into account inferred knowledge from an RDF/S knowledge base. We first study their correctness in transforming a source to a target RDF/S knowledge base in conjunction with the semantics of the employed change operations (i.e., with or without side-effects on inferred knowledge). Then we formally analyze desired properties of RDF/S deltas such as size minimality, semantic identity, redundancy elimination, reversibility, and composability, as well as identify those RDF/S differential functions that satisfy them. Subsequently, we experimentally evaluate the computing time and size of the produced deltas over real and synthetic RDF/S knowledge bases.
A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification BIBAFull-Text 15
  Eda Baykan; Monika Henzinger; Ludmila Marian; Ingmar Weber
Given only the URL of a Web page, can we identify its topic? We study this problem in detail by exploring a large number of different feature sets and algorithms on several datasets. We also show that the inherent overlap between topics and the sparsity of the information in URLs makes this a very challenging problem. Web page classification without a page's content is desirable when the content is not available at all, when a classification is needed before obtaining the content, or when classification speed is of utmost importance. For our experiments we used five different corpora comprising a total of about 3 million (URL, classification) pairs. We evaluated several techniques for feature generation and classification algorithms. The individual binary classifiers were then combined via boosting into metabinary classifiers. We achieve typical F-measure values between 80 and 85, and a typical precision of around 86. The precision can be pushed further over 90 while maintaining a typical level of recall between 30 and 40.
Building Mashups by Demonstration BIBAFull-Text 16
  Rattapoom Tuchinda; Craig A. Knoblock; Pedro Szekely
The latest generation of WWW tools and services enables Web users to generate applications that combine content from multiple sources. This type of Web application is referred to as a mashup. Many of the tools for constructing mashups rely on a widget paradigm, where users must select, customize, and connect widgets to build the desired application. While this approach does not require programming, the users must still understand programming concepts to successfully create a mashup. As a result, they are put off by the time, effort, and expertise needed to build a mashup. In this article, we describe our programming-by-demonstration approach to building mashup by example. Instead of requiring a user to select and customize a set of widgets, the user simply demonstrates the integration task by example. Our approach addresses the problems of extracting data from Web sources, cleaning and modeling the extracted data, and integrating the data across sources. We implemented these ideas in a system called Karma, and evaluated Karma on a set of 23 users. The results show that, compared to other mashup construction tools, Karma allows more of the users to successfully build mashups and makes it possible to build these mashups significantly faster compared to using a widget-based approach.

TWEB 2011-10 Volume 5 Issue 4

A Practical Architecture for an Anycast CDN BIBAFull-Text 17
  Hussein A. Alzoubi; Seungjoon Lee; Michael Rabinovich; Oliver Spatscheck; Jacobus Van Der Merwe
IP Anycast has many attractive features for any service that involve the replication of multiple instances across the Internet. IP Anycast allows multiple instances of the same service to be "naturally" discovered, and requests for this service to be delivered to the closest instance. However, while briefly considered as an enabler for content delivery networks (CDNs) when they first emerged, IP Anycast was deemed infeasible in that environment. The main reasons for this decision were the lack of load awareness of IP Anycast and unwanted side effects of Internet routing changes on the IP Anycast mechanism.
   In this article we re-evaluate IP Anycast for CDNs by proposing a load-aware IP Anycast CDN architecture. Our architecture is prompted by recent developments in route control technology, as well as better understanding of the behavior of IP Anycast in operational settings. Our architecture makes use of route control mechanisms to take server and network load into account to realize load-aware Anycast. We show that the resulting redirection requirements can be formulated as a Generalized Assignment Problem and present practical algorithms that address these requirements while at the same time limiting connection disruptions that plague regular IP Anycast. We evaluate our algorithms through trace based simulation using traces obtained from a production CDN network.
Efficient Search Engine Measurements BIBAFull-Text 18
  Ziv Bar-Yossef; Maxim Gurevich
We address the problem of externally measuring aggregate functions over documents indexed by search engines, like corpus size, index freshness, and density of duplicates in the corpus. State of the art estimators for such quantities [Bar-Yossef and Gurevich 2008b; Broder et al. 2006] are biased due to inaccurate approximation of the so called "document degrees". In addition, the estimators in Bar-Yossef and Gurevich [2008b] are quite costly, due to their reliance on rejection sampling.
   We present new estimators that are able to overcome the bias introduced by approximate degrees. Our estimators are based on a careful implementation of an approximate importance sampling procedure. Comprehensive theoretical and empirical analysis of the estimators demonstrates that they have essentially no bias even in situations where document degrees are poorly approximated.
   By avoiding the costly rejection sampling approach, our new importance sampling estimators are significantly more efficient than the estimators proposed in Bar-Yossef and Gurevich [2008b]. Furthermore, building on an idea from Broder et al. [2006], we discuss Rao-Blackwellization as a generic method for reducing variance in search engine estimators. We show that Rao-Blackwellizing our estimators results in performance improvements, without compromising accuracy.
Characterizing Organizational Use of Web-Based Services: Methodology, Challenges, Observations, and Insights BIBAFull-Text 19
  Phillipa Gill; Martin Arlitt; Niklas Carlsson; Anirban Mahanti; Carey Williamson
Today's Web provides many different functionalities, including communication, entertainment, social networking, and information retrieval. In this article, we analyze traces of HTTP activity from a large enterprise and from a large university to identify and characterize Web-based service usage. Our work provides an initial methodology for the analysis of Web-based services. While it is nontrivial to identify the classes, instances, and providers for each transaction, our results show that most of the traffic comes from a small subset of providers, which can be classified manually. Furthermore, we assess both qualitatively and quantitatively how the Web has evolved over the past decade, and discuss the implications of these changes.
Camera Brand Congruence and Camera Model Propagation in the Flickr Social Graph BIBAFull-Text 20
  Adish Singla; Ingmar Weber
Given that my friends on Flickr use cameras of brand X, am I more likely to also use a camera of brand X? Given that one of these friends changes her brand, am I likely to do the same? Do new camera models pop up uniformly in the friendship graph? Or do early adopters then "convert" their friends? Which factors influence the conversion probability of a user? These are the kind of questions addressed in this work. Direct applications involve personalized advertising in social networks.
   For our study, we crawled a complete connected component of the Flickr friendship graph with a total of 67M edges and 3.9M users. 1.2M of these users had at least one public photograph with valid model metadata, which allowed us to assign camera brands and models to users and time slots. Similarly, we used, where provided in a user's profile, information about a user's geographic location and the groups joined on Flickr.
   Concerning brand congruence, our main findings are the following. First, a pair of friends on Flickr has a higher probability of being congruent, that is, using the same brand, compared to two random users (27% vs. 19%). Second, the degree of congruence goes up for pairs of friends (i) in the same country (29%), (ii) who both only have very few friends (30%), and (iii) with a very high cliqueness (38%). Third, given that a user changes her camera model between March-May 2007 and March-May 2008, high cliqueness friends are more likely than random users to do the same (54% vs. 48%). Fourth, users using high-end cameras are far more loyal to their brand than users using point-and-shoot cameras, with a probability of staying with the same brand of 60% vs 33%, given that a new camera is bought. Fifth, these "expert" users" brand congruence reaches 66% for high cliqueness friends. All these differences are statistically significant at 1%.
   As for the propagation of new models in the friendship graph, we observe the following. First, the growth of connected components of users converted to a particular, new camera model differs distinctly from random growth. Second, the decline of dissemination of a particular model is close to random decline. This illustrates that users influence their friends to change to a particular new model, rather than from a particular old model. Third, having many converted friends increases the probability of the user to convert herself. Here differences between friends from the same or from different countries are more pronounced for point-and-shoot than for digital single-lens reflex users. Fourth, there was again a distinct difference between arbitrary friends and high cliqueness friends in terms of prediction quality for conversion.
A Specialized Search Assistant for Learning Objects BIBAFull-Text 21
  Cecilia Curlango-Rosas; Gregorio A. Ponce; Gabriel A. Lopez-Morteo
The Web holds a great quantity of material that can be used to enhance classroom instruction. However, it is not easy to retrieve this material with the search engines currently available. This study produced a specialized search assistant based on Google that significantly increases the number of instances in which teachers find the desired learning objects as compared to using this popular public search engine directly. Success in finding learning objects by study participants went from 80% using Google alone to 96% when using our search assistant in one scenario and, in another scenario, from a 40% success rate with Google alone to 66% with our assistant. This specialized search assistant implements features such as bilingual search and term suggestion which were requested by teacher participants to help improve their searches. Study participants evaluated the specialized search assistant and found it significantly easier to use and more useful than the popular search engine for the purpose of finding learning objects.