HCI Bibliography Home | HCI Journals | About TWEB | Journal Info | TWEB Journal Volumes | Detailed Records | RefWorks | EndNote | Hide Abstracts
TWEB Tables of Contents: 0102030405060708

ACM Transactions on The Web 8

Editors:Marc Najork
Standard No:ISSN:1559-1131 EISSN:1559-114X
Links:Journal Home Page | ACM Digital Library | Table of Contents
  1. TWEB 2013-12 Volume 8 Issue 1
  2. TWEB 2014-03 Volume 8 Issue 2
  3. TWEB 2014-06 Volume 8 Issue 3
  4. TWEB 2014-10 Volume 8 Issue 4

TWEB 2013-12 Volume 8 Issue 1

UsageQoS: Estimating the QoS of Web Services through Online User Communities BIBAFull-Text 1
  Xiaodi Huang
Services are an indispensable component in cloud computing. Web services are particularly important. As an increasing number of Web services provides equivalent functions, one common issue faced by users is the selection of the most appropriate one based on quality. This article presents a conceptual framework that characterizes the quality of Web services, an algorithm that quantifies them, and a system architecture that ranks Web services by using the proposed algorithm. In particular, the algorithm, called UsageQoS that computes the scores of quality of service (QoS) of Web services within a community, makes use of the usage frequencies of Web services. The frequencies are defined as the numbers of times invoked by other services in a given time period. The UsageQoS algorithm is able to optionally take user ratings as its initial input. The proposed approach has been validated by extensively experimenting on several datasets, including two real datasets. The results of the experiments have demonstrated that our approach is capable of estimating QoS parameters of Web services, regardless of whether user ratings are available or not.
Form-Based Web Service Composition for Domain Experts BIBAFull-Text 2
  Ingo Weber; Hye-Young Paik; Boualem Benatallah
In many cases, it is not cost effective to automate business processes which affect a small number of people and/or change frequently. We present a novel approach for enabling domain experts to model and deploy such processes from their respective domain as Web service compositions. The approach builds on user-editable service, naming and representing Web services as forms. On this basis, the approach provides a visual composition language with a targeted restriction of control-flow expressivity, process simulation, automated process verification mechanisms, and code generation for executing orchestrations. A Web-based service composition prototype implements this approach, including a WS-BPEL code generator. A small lab user study with 14 participants showed promising results for the usability of the system, even for nontechnical domain experts.
Second Chance: A Hybrid Approach for Dynamic Result Caching and Prefetching in Search Engines BIBAFull-Text 3
  Rifat Ozcan; Ismail Sengor Altingovde; B. Barla Cambazoglu; Özgür Ulusoy
Web search engines are known to cache the results of previously issued queries. The stored results typically contain the document summaries and some data that is used to construct the final search result page returned to the user. An alternative strategy is to store in the cache only the result document IDs, which take much less space, allowing results of more queries to be cached. These two strategies lead to an interesting trade-off between the hit rate and the average query response latency. In this work, in order to exploit this trade-off, we propose a hybrid result caching strategy where a dynamic result cache is split into two sections: an HTML cache and a docID cache. Moreover, using a realistic cost model, we evaluate the performance of different result prefetching strategies for the proposed hybrid cache and the baseline HTML-only cache. Finally, we propose a machine learning approach to predict singleton queries, which occur only once in the query stream. We show that when the proposed hybrid result caching strategy is coupled with the singleton query predictor, the hit rate is further improved.
Efficient Time-Stamped Event Sequence Anonymization BIBAFull-Text 4
  Reza Sherkat; Jing Li; Nikos Mamoulis
With the rapid growth of applications which generate timestamped sequences (click streams, GPS trajectories, RFID sequences), sequence anonymization has become an important problem, in that should such data be published or shared. Existing trajectory anonymization techniques disregard the importance of time or the sensitivity of events. This article is the first, to our knowledge, thorough study on time-stamped event sequence anonymization. We propose a novel and tunable generalization framework tailored to event sequences. We generalize time stamps using time intervals and events using a taxonomy which models the domain semantics. We consider two scenarios: (i) sharing the data with a single receiver (the SSR setting), where the receiver's background knowledge is confined to a set of time stamps and time generalization suffices, and (ii) sharing the data with colluding receivers (the SCR setting), where time generalization should be combined with event generalization. For both cases, we propose appropriate anonymization methods that prevent both user identification and event prediction. To achieve computational efficiency and scalability, we propose optimization techniques for both cases using a utility-based index, compact summaries, fast to compute bounds for utility, and a novel taxonomy-aware distance function. Extensive experiments confirm the effectiveness of our approach compared with state of the art, in terms of information loss, range query distortion, and preserving temporal causality patterns. Furthermore, our experiments demonstrate efficiency and scalability on large-scale real and synthetic datasets.
Control-Flow Patterns for Decentralized RESTful Service Composition BIBAFull-Text 5
  Jesus Bellido; Rosa Alarcón; Cesare Pautasso
The REST architectural style has attracted a lot of interest from industry due to the nonfunctional properties it contributes to Web-based solutions. SOAP/WSDL-based services, on the other hand, provide tools and methodologies that allow the design and development of software supporting complex service arrangements, enabling complex business processes which make use of well-known control-flow patterns. It is not clear if and how such patterns should be modeled, considering RESTful Web services that comply with the statelessness, uniform interface and hypermedia constraints. In this article, we analyze a set of fundamental control-flow patterns in the context of stateless compositions of RESTful services. We propose a means of enabling their implementation using the HTTP protocol and discuss the impact of our design choices according to key REST architectural principles. We hope to shed new light on the design of basic building blocks for RESTful business processes.
Analyzing, Detecting, and Exploiting Sentiment in Web Queries BIBAFull-Text 6
  Sergiu Chelaru; Ismail Sengor Altingovde; Stefan Siersdorfer; Wolfgang Nejdl
The Web contains an increasing amount of biased and opinionated documents on politics, products, and polarizing events. In this article, we present an indepth analysis of Web search queries for controversial topics, focusing on query sentiment. To this end, we conduct extensive user assessments and discriminative term analyses, as well as a sentiment analysis using the SentiWordNet thesaurus, a lexical resource containing sentiment annotations. Furthermore, in order to detect the sentiment expressed in queries, we build different classifiers based on query texts, query result titles, and snippets. We demonstrate the virtue of query sentiment detection in two different use cases. First, we define a query recommendation scenario that employs sentiment detection of results to recommend additional queries for polarized queries issued by search engine users. The second application scenario is controversial topic discovery, where query sentiment classifiers are employed to discover previously unknown topics that trigger both highly positive and negative opinions among the users of a search engine. For both use cases, the results of our evaluations on real-world data are promising and show the viability and potential of query sentiment analysis in practical scenarios.

TWEB 2014-03 Volume 8 Issue 2

Analysis of Search and Browsing Behavior of Young Users on the Web BIBAFull-Text 7
  Sergio Duarte Torres; Ingmar Weber; Djoerd Hiemstra
The Internet is increasingly used by young children for all kinds of purposes. Nonetheless, there are not many resources especially designed for children on the Internet and most of the content online is designed for grown-up users. This situation is problematic if we consider the large differences between young users and adults since their topic interests, computer skills, and language capabilities evolve rapidly during childhood. There is little research aimed at exploring and measuring the difficulties that children encounter on the Internet when searching for information and browsing for content. In the first part of this work, we employed query logs from a commercial search engine to quantify the difficulties children of different ages encounter on the Internet and to characterize the topics that they search for. We employed query metrics (e.g., the fraction of queries posed in natural language), session metrics (e.g., the fraction of abandoned sessions), and click activity (e.g., the fraction of ad clicks). The search logs were also used to retrace stages of child development. Concretely, we looked for changes in interests (e.g., the distribution of topics searched) and language development (e.g., the readability of the content accessed and the vocabulary size).
   In the second part of this work, we employed toolbar logs from a commercial search engine to characterize the browsing behavior of young users, particularly to understand the activities on the Internet that trigger search. We quantified the proportion of browsing and search activity in the toolbar sessions and we estimated the likelihood of a user to carry out search on the Web vertical and multimedia verticals (i.e., videos and images) given that the previous event is another search event or a browsing event.
   We observed that these metrics clearly demonstrate an increased level of confusion and unsuccessful search sessions among children. We also found a clear relation between the reading level of the clicked pages and characteristics of the users such as age and educational attainment.
   In terms of browsing behavior, children were found to start their activities on the Internet with a search engine (instead of directly browsing content) more often than adults. We also observed a significantly larger amount of browsing activity for the case of teenager users. Interestingly we also found that if children visit knowledge-related Web sites (i.e., information-dense pages such as Wikipedia articles), they subsequently do more Web searches than adults. Additionally, children and especially teenagers were found to have a greater tendency to engage in multimedia search, which calls to improve the aggregation of multimedia results into the current search result pages.
How to Improve Your Search Engine Ranking: Myths and Reality BIBAFull-Text 8
  Ao-Jan Su; Y. Charlie Hu; Aleksandar Kuzmanovic; Cheng-Kok Koh
Search engines have greatly influenced the way people access information on the Internet, as such engines provide the preferred entry point to billions of pages on the Web. Therefore, highly ranked Web pages generally have higher visibility to people and pushing the ranking higher has become the top priority for Web masters. As a matter of fact, Search Engine Optimization (SEO) has became a sizeable business that attempts to improve their clients' ranking. Still, the lack of ways to validate SEO's methods has created numerous myths and fallacies associated with ranking algorithms.
   In this article, we focus on two ranking algorithms, Google's and Bing's, and design, implement, and evaluate a ranking system to systematically validate assumptions others have made about these popular ranking algorithms. We demonstrate that linear learning models, coupled with a recursive partitioning ranking scheme, are capable of predicting ranking results with high accuracy. As an example, we manage to correctly predict 7 out of the top 10 pages for 78% of evaluated keywords. Moreover, for content-only ranking, our system can correctly predict 9 or more pages out of the top 10 ones for 77% of search terms. We show how our ranking system can be used to reveal the relative importance of ranking features in a search engine's ranking function, provide guidelines for SEOs and Web masters to optimize their Web pages, validate or disprove new ranking features, and evaluate search engine ranking results for possible ranking bias.
Leveraging Social Feedback to Verify Online Identity Claims BIBAFull-Text 9
  Michael Sirivianos; Kyungbaek Kim; Jian Wei Gan; Xiaowei Yang
Anonymity is one of the main virtues of the Internet, as it protects privacy and enables users to express opinions more freely. However, anonymity hinders the assessment of the veracity of assertions that online users make about their identity attributes, such as age or profession. We propose FaceTrust, a system that uses online social networks to provide lightweight identity credentials while preserving a user's anonymity. FaceTrust employs a "game with a purpose" design to elicit the opinions of the friends of a user about the user's self-claimed identity attributes, and uses attack-resistant trust inference to assign veracity scores to identity attribute assertions. FaceTrust provides credentials, which a user can use to corroborate his assertions. We evaluate our proposal using a live Facebook deployment and simulations on a crawled social graph. The results show that our veracity scores are strongly correlated with the ground truth, even when dishonest users make up a large fraction of the social network and employ the Sybil attack.
Efficient Multiview Maintenance under Insertion in Huge Social Networks BIBAFull-Text 10
  Andrea Pugliese; Matthias Bröcheler; V. S. Subrahmanian; Michael Ovelgönne
Applications to monitor various aspects of social networks are becoming increasingly popular. For instance, marketers want to look for semantic patterns relating to the content of tweets and Facebook posts relating to their products. Law enforcement agencies want to track behaviors involving potential criminals on the Internet by looking for certain patterns of behavior. Music companies want to track patterns of spread of illegal music. These applications allow multiple users to specify patterns of interest and monitor them in real time as new data gets added to the Web or to a social network. In this article we develop the concept of social network view servers in which all of these types of applications can be simultaneously monitored. The patterns of interest are expressed as views over an underlying graph or social network database. We show that a given set of views can be compiled in multiple possible ways to take advantage of common substructures and define the concept of an optimal merge. Though finding an optimal merge is shown to be NP-hard, we develop the AddView to find very good merges quickly. We develop a very fast MultiView algorithm that scalably and efficiently maintains multiple subgraph views when insertions are made to the social network database. We show that our algorithm is correct, study its complexity, and experimentally demonstrate that our algorithm can scalably handle updates to hundreds of views on 6 real-world social network databases with up to 540M edges.
Textual and Content-Based Search in Repositories of Web Application Models BIBAFull-Text 11
  Bojana Bislimovska; Alessandro Bozzon; Marco Brambilla; Piero Fraternali
Model-driven engineering relies on collections of models, which are the primary artifacts for software development. To enable knowledge sharing and reuse, models need to be managed within repositories, where they can be retrieved upon users' queries. This article examines two different techniques for indexing and searching model repositories, with a focus on Web development projects encoded in a domain-specific language. Keyword-based and content-based search (also known as query-by-example) are contrasted with respect to the architecture of the system, the processing of models and queries, and the way in which metamodel knowledge can be exploited to improve search. A thorough experimental evaluation is conducted to examine what parameter configurations lead to better accuracy and to offer an insight in what queries are addressed best by each system.
Neighbor Selection and Weighting in User-Based Collaborative Filtering: A Performance Prediction Approach BIBAFull-Text 12
  Alejandro Bellogín; Pablo Castells; Iván Cantador
User-based collaborative filtering systems suggest interesting items to a user relying on similar-minded people called neighbors. The selection and weighting of these neighbors characterize the different recommendation approaches. While standard strategies perform a neighbor selection based on user similarities, trust-aware recommendation algorithms rely on other aspects indicative of user trust and reliability. In this article we restate the trust-aware recommendation problem, generalizing it in terms of performance prediction techniques, whose goal is to predict the performance of an information retrieval system in response to a particular query. We investigate how to adopt the preceding generalization to define a unified framework where we conduct an objective analysis of the effectiveness (predictive power) of neighbor scoring functions. The proposed framework enables discriminating whether recommendation performance improvements are caused by the used neighbor scoring functions or by the ways these functions are used in the recommendation computation. We evaluated our approach with several state-of-the-art and novel neighbor scoring functions on three publicly available datasets. By empirically comparing four neighbor quality metrics and thirteen performance predictors, we found strong predictive power for some of the predictors with respect to certain metrics. This result was then validated by checking the final performance of recommendation strategies where predictors are used for selecting and/or weighting user neighbors. As a result, we have found that, by measuring the predictive power of neighbor performance predictors, we are able to anticipate which predictors are going to perform better in neighbor-scoring-powered versions of a user-based collaborative filtering algorithm.

TWEB 2014-06 Volume 8 Issue 3

Foundations of Trust and Distrust in Networks: Extended Structural Balance Theory BIBAFull-Text 13
  Yi Qian; Sibel Adali
Modeling trust in very large social networks is a hard problem due to the highly noisy nature of these networks that span trust relationships from many different contexts, based on judgments of reliability, dependability, and competence. Furthermore, relationships in these networks vary in their level of strength. In this article, we introduce a novel extension of structural balance theory as a foundational theory of trust and distrust in networks. Our theory preserves the distinctions between trust and distrust as suggested in the literature, but also incorporates the notion of relationship strength that can be expressed as either discrete categorical values, as pairwise comparisons, or as metric distances. Our model is novel, has sound social and psychological basis, and captures the classical balance theory as a special case. We then propose a convergence model, describing how an imbalanced network evolves towards new balance, and formulate the convergence problem of a social network as a Metric Multidimensional Scaling (MDS) optimization problem. Finally, we show how the convergence model can be used to predict edge signs in social networks and justify our theory through extensive experiments on real datasets.
Conceptual Development of Custom, Domain-Specific Mashup Platforms BIBAFull-Text 14
  Stefano Soi; Florian Daniel; Fabio Casati
Despite the common claim by mashup platforms that they enable end-users to develop their own software, in practice end-users still don't develop their own mashups, as the highly technical or inexistent user bases of today's mashup platforms testify. The key shortcoming of current platforms is their general-purpose nature, that privileges expressive power over intuitiveness. In our prior work, we have demonstrated that a domain-specific mashup approach, which privileges intuitiveness over expressive power, has much more potential to enable end-user development (EUD). The problem is that developing mashup platforms -- domain-specific or not -- is complex and time consuming. In addition, domain-specific mashup platforms by their very nature target only a small user basis, that is, the experts of the target domain, which makes their development not sustainable if it is not adequately supported and automated.
   With this article, we aim to make the development of custom, domain-specific mashup platforms cost-effective. We describe a mashup tool development kit (MDK) that is able to automatically generate a mashup platform (comprising custom mashup and component description languages and design-time and runtime environments) from a conceptual design and to provision it as a service. We equip the kit with a dedicated development methodology and demonstrate the applicability and viability of the approach with the help of two case studies.
Propagating Both Trust and Distrust with Target Differentiation for Combating Link-Based Web Spam BIBAFull-Text 15
  Xianchao Zhang; You Wang; Nan Mou; Wenxin Liang
Semi-automatic anti-spam algorithms propagate either trust through links from a good seed set (e.g., TrustRank) or distrust through inverse links from a bad seed set (e.g., Anti-TrustRank) to the entire Web. These kinds of algorithms have shown their powers in combating link-based Web spam since they integrate both human judgement and machine intelligence. Nevertheless, there is still much space for improvement. One issue of most existing trust/distrust propagation algorithms is that only trust or distrust is propagated and only a good seed set or a bad seed set is used. According to Wu et al. [2006a], a combined usage of both trust and distrust propagation can lead to better results, and an effective framework is needed to realize this insight. Another more serious issue of existing algorithms is that trust or distrust is propagated in nondifferential ways, that is, a page propagates its trust or distrust score uniformly to its neighbors, without considering whether each neighbor should be trusted or distrusted. Such kinds of blind propagating schemes are inconsistent with the original intention of trust/distrust propagation. However, it seems impossible to implement differential propagation if only trust or distrust is propagated. In this article, we take the view that each Web page has both a trustworthy side and an untrustworthy side, and we thusly assign two scores to each Web page: T-Rank, scoring the trustworthiness of the page, and D-Rank, scoring the untrustworthiness of the page. We then propose an integrated framework that propagates both trust and distrust. In the framework, the propagation of T-Rank/D-Rank is penalized by the target's current D-Rank/T-Rank. In other words, the propagation of T-Rank/D-Rank is decided by the target's current (generalized) probability of being trustworthy/untrustworthy; thus a page propagates more trust/distrust to a trustworthy/untrustworthy neighbor than to an untrustworthy/trustworthy neighbor. In this way, propagating both trust and distrust with target differentiation is implemented. We use T-Rank scores to realize spam demotion and D-Rank scores to accomplish spam detection. The proposed Trust-DistrustRank (TDR) algorithm regresses to TrustRank and Anti-TrustRank when the penalty factor is set to 1 and 0, respectively. Thus TDR could be seen as a combinatorial generalization of both TrustRank and Anti-TrustRank. TDR not only makes full use of both trust and distrust propagation, but also overcomes the disadvantages of both TrustRank and Anti-TrustRank. Experimental results on benchmark datasets show that TDR outperforms other semi-automatic anti-spam algorithms for both spam demotion and spam detection tasks under various criteria.
Incremental Text Indexing for Fast Disk-Based Search BIBAFull-Text 16
  Giorgos Margaritis; Stergios V. Anastasiadis
Real-time search requires to incrementally ingest content updates and almost immediately make them searchable while serving search queries at low latency. This is currently feasible for datasets of moderate size by fully maintaining the index in the main memory of multiple machines. Instead, disk-based methods for incremental index maintenance substantially increase search latency with the index fragmented across multiple disk locations. For the support of fast search over disk-based storage, we take a fresh look at incremental text indexing in the context of current architectural features. We introduce a greedy method called Selective Range Flush (SRF) to contiguously organize the index over disk blocks and dynamically update it at low cost. We show that SRF requires substantial experimental effort to tune specific parameters for performance efficiency. Subsequently, we propose the Unified Range Flush (URF) method, which is conceptually simpler than SRF, achieves similar or better performance with fewer parameters and less tuning, and is amenable to I/O complexity analysis. We implement interesting variations of the two methods in the Proteus prototype search engine that we developed and do extensive experiments with three different Web datasets of size up to 1TB. Across different systems, we show that our methods offer search latency that matches or reduces up to half the lowest achieved by existing disk-based methods. In comparison to an existing method of comparable search latency on the same system, our methods reduce by a factor of 2.0-2.4 the I/O part of build time and by 21-24% the total build time.
Analyzing and Mining Comments and Comment Ratings on the Social Web BIBAFull-Text 17
  Stefan Siersdorfer; Sergiu Chelaru; Jose San Pedro; Ismail Sengor Altingovde; Wolfgang Nejdl
An analysis of the social video sharing platform YouTube and the news aggregator Yahoo! News reveals the presence of vast amounts of community feedback through comments for published videos and news stories, as well as through metaratings for these comments. This article presents an in-depth study of commenting and comment rating behavior on a sample of more than 10 million user comments on YouTube and Yahoo! News. In this study, comment ratings are considered first-class citizens. Their dependencies with textual content, thread structure of comments, and associated content (e.g., videos and their metadata) are analyzed to obtain a comprehensive understanding of the community commenting behavior. Furthermore, this article explores the applicability of machine learning and data mining to detect acceptance of comments by the community, comments likely to trigger discussions, controversial and polarizing content, and users exhibiting offensive commenting behavior. Results from this study have potential application in guiding the design of community-oriented online discussion platforms.
Ten Years of Rich Internet Applications: A Systematic Mapping Study, and Beyond BIBAFull-Text 18
  Sven Casteleyn; Irene Garrig'os; Jose-Norberto Maz'on
BACKGROUND. The term Rich Internet Applications (RIAs) is generally associated with Web applications that provide the features and functionality of traditional desktop applications. Ten years after the introduction of the term, an ample amount of research has been carried out to study various aspects of RIAs. It has thus become essential to summarize this research and provide an adequate overview.
   OBJECTIVE. The objective of our study is to assemble, classify, and analyze all RIA research performed in the scientific community, thus providing a consolidated overview thereof, and to identify well-established topics, trends, and open research issues. Additionally, we provide a qualitative discussion of the most interesting findings. This work therefore serves as a reference work for beginning and established RIA researchers alike, as well as for industrial actors that need an introduction in the field, or seek pointers to (a specific subset of) the state-of-the-art.
   METHOD. A systematic mapping study is performed in order to identify all RIA-related publications, define a classification scheme, and categorize, analyze, and discuss the identified research according to it.
   RESULTS. Our source identification phase resulted in 133 relevant, peer-reviewed publications, published between 2002 and 2011 in a wide variety of venues. They were subsequently classified according to four facets: development activity, research topic, contribution type, and research type. Pie, stacked bar, and bubble charts were used to depict and analyze the results. A deeper analysis is provided for the most interesting and/or remarkable results.
   CONCLUSION. Analysis of the results shows that, although the RIA term was coined in 2002, the first RIA-related research appeared in 2004. From 2007 there was a significant increase in research activity, peaking in 2009 and decreasing to pre-2009 levels afterwards. All development phases are covered in the identified research, with emphasis on "design" (33%) and "implementation" (29%). The majority of research proposes a "method" (44%), followed by "model" (22%), "methodology" (18%), and "tools" (16%); no publications in the category "metrics" were found. The preponderant research topic is "models, methods and methodologies" (23%) and, to a lesser extent, "usability and accessibility" and "user interface" (11% each). On the other hand, the topic "localization, internationalization and multilinguality" received no attention at all, and topics such as "deep Web" (under 1%), "business processing", "usage analysis", "data management", "quality and metrics" (all under 2%), "semantics", and "performance" (slightly above 2%) received very little attention. Finally, there is a large majority of "solution proposals" (66%), few "evaluation research" (14%), and even fewer "validation" (6%), although the latter have been increasing in recent years.
A Model-Based Approach for Crawling Rich Internet Applications BIBAFull-Text 19
  Mustafa Emre Dincturk; Guy-Vincent Jourdan; Gregor V. Bochmann; Iosif Viorel Onut
New Web technologies, like AJAX, result in more responsive and interactive Web applications, sometimes called Rich Internet Applications (RIAs). Crawling techniques developed for traditional Web applications are not sufficient for crawling RIAs. The inability to crawl RIAs is a problem that needs to be addressed for at least making RIAs searchable and testable. We present a new methodology, called "model-based crawling", that can be used as a basis to design efficient crawling strategies for RIAs. We illustrate model-based crawling with a sample strategy, called the "hypercube strategy". The performances of our model-based crawling strategies are compared against existing standard crawling strategies, including breadth-first, depth-first, and a greedy strategy. Experimental results show that our model-based crawling approach is significantly more efficient than these standard strategies.

TWEB 2014-10 Volume 8 Issue 4

Merging Query Results From Local Search Engines for Georeferenced Objects BIBAFull-Text 20
  Eduard C. Dragut; Bhaskar Dasgupta; Brian P. Beirne; Ali Neyestani; Badr Atassi; Clement Yu; Weiyi Meng
The emergence of numerous online sources about local services presents a need for more automatic yet accurate data integration techniques. Local services are georeferenced objects and can be queried by their locations on a map, for instance, neighborhoods. Typical local service queries (e.g., "French Restaurant in The Loop") include not only information about "what" ("French Restaurant") a user is searching for (such as cuisine) but also "where" information, such as neighborhood ("The Loop"). In this article, we address three key problems: query translation, result merging and ranking. Most local search engines provide a (hierarchical) organization of (large) cities into neighborhoods. A neighborhood in one local search engine may correspond to sets of neighborhoods in other local search engines. These make the query translation challenging. To provide an integrated access to the query results returned by the local search engines, we need to combine the results into a single list of results.
   Our contributions include: (1) An integration algorithm for neighborhoods. (2) A very effective business listing resolution algorithm. (3) A ranking algorithm that takes into consideration the user criteria, user ratings and rankings. We have created a prototype system, Yumi, over local search engines in the restaurant domain. The restaurant domain is a representative case study for the local services. We conducted a comprehensive experimental study to evaluate Yumi. A prototype version of Yumi is available online.
Constructing and Comparing User Mobility Profiles BIBAFull-Text 21
  Xihui Chen; Jun Pang; Ran Xue
Nowadays, the accumulation of people's whereabouts due to location-based applications has made it possible to construct their mobility profiles. This access to users' mobility profiles subsequently brings benefits back to location-based applications. For instance, in on-line social networks, friends can be recommended not only based on the similarity between their registered information, for instance, hobbies and professions but also referring to the similarity between their mobility profiles.
   In this article, we propose a new approach to construct and compare users' mobility profiles. First, we improve and apply frequent sequential pattern mining technologies to extract the sequences of places that a user frequently visits and use them to model his mobility profile. Second, we present a new method to calculate the similarity between two users using their mobility profiles. More specifically, we identify the weaknesses of a similarity metric in the literature, and propose a new one which not only fixes the weaknesses but also provides more precise and effective similarity estimation. Third, we consider the semantics of spatio-temporal information contained in user mobility profiles and add them into the calculation of user similarity. It enables us to measure users' similarity from different perspectives. Two specific types of semantics are explored in this article: location semantics and temporal semantics. Last, we validate our approach by applying it to two real-life datasets collected by Microsoft Research Asia and Yonsei University, respectively. The results show that our approach outperforms the existing works from several aspects.
Sentiment-Focused Web Crawling BIBAFull-Text 22
  A. Gural Vural; B. Barla Cambazoglu; Pinar Karagoz
Sentiments and opinions expressed in Web pages towards objects, entities, and products constitute an important portion of the textual content available in the Web. In the last decade, the analysis of such content has gained importance due to its high potential for monetization. Despite the vast interest in sentiment analysis, somewhat surprisingly, the discovery of sentimental or opinionated Web content is mostly ignored. This work aims to fill this gap and addresses the problem of quickly discovering and fetching the sentimental content present in the Web. To this end, we design a sentiment-focused Web crawling framework. In particular, we propose different sentiment-focused Web crawling strategies that prioritize discovered URLs based on their predicted sentiment scores. Through simulations, these strategies are shown to achieve considerable performance improvement over general-purpose Web crawling strategies in discovery of sentimental Web content.
EXIP: A Framework for Embedded Web Development BIBAFull-Text 23
  Rumen Kyusakov; Pablo Puñal Pereira; Jens Eliasson; Jerker Delsing
Developing and deploying Web applications on networked embedded devices is often seen as a way to reduce the development cost and time to market for new target platforms. However, the size of the messages and the processing requirements of today's Web protocols, such as HTTP and XML, are challenging for the most resource-constrained class of devices that could also benefit from Web connectivity.
   New Web protocols using binary representations have been proposed for addressing this issue. Constrained Application Protocol (CoAP) reduces the bandwidth and processing requirements compared to HTTP while preserving the core concepts of the Web architecture. Similarly, Efficient XML Interchange (EXI) format has been standardized for reducing the size and processing time for XML structured information. Nevertheless, the adoption of these technologies is lagging behind due to lack of support from Web browsers and current Web development toolkits.
   Motivated by these problems, this article presents the design and implementation techniques for the EXIP framework for embedded Web development. The framework consists of a highly efficient EXI processor, a tool for EXI data binding based on templates, and a CoAP/EXI/XHTML Web page engine. A prototype implementation of the EXI processor is herein presented and evaluated. It can be applied to Web browsers or thin server platforms using XHTML and Web services for supporting human-machine interactions in the Internet of Things.
   This article contains four major results: (1) theoretical and practical evaluation of the use of binary protocols for embedded Web programming; (2) a novel method for generation of EXI grammars based on XML Schema definitions; (3) an algorithm for grammar concatenation that produces normalized EXI grammars directly, and hence reduces the number of iterations during grammar generation; (4) an algorithm for efficient representation of possible deviations from the XML schema.
Using Interaction Data to Explain Difficulty Navigating Online BIBAFull-Text 24
  Paul Thomas
A user's behaviour when browsing a Web site contains clues to that user's experience. It is possible to record some of these behaviours automatically, and extract signals that indicate a user is having trouble finding information. This allows for Web site analytics based on user experiences, not just page impressions.
   A series of experiments identified user browsing behaviours -- such as time taken and amount of scrolling up a page -- which predict navigation difficulty and which can be recorded with minimal or no changes to existing sites or browsers. In turn, patterns of page views correlate with these signals and these patterns can help Web authors understand where and why their sites are hard to navigate. A new software tool, "LATTE," automates this analysis and makes it available to Web authors in the context of the site itself.
Content Bias in Online Health Search BIBAFull-Text 25
  Ryen W. White; Ahmed Hassan
Search engines help people answer consequential questions. Biases in retrieved and indexed content (e.g., skew toward erroneous outcomes that represent deviations from reality), coupled with searchers' biases in how they examine and interpret search results, can lead people to incorrect answers. In this article, we seek to better understand biases in search and retrieval, and in particular those affecting the accuracy of content in search results, including the search engine index, features used for ranking, and the formulation of search queries. Focusing on the important domain of online health search, this research broadens previous work on biases in search to examine the role of search systems in contributing to biases. To assess bias, we focus on questions about medical interventions and employ reliable ground truth data from authoritative medical sources. In the course of our study, we utilize large-scale log analysis using data from a popular Web search engine, deep probes of result lists on that search engine, and crowdsourced human judgments of search result captions and landing pages. Our findings reveal bias in results, amplifying searchers' existing biases that appear evident in their search activity. We also highlight significant bias in indexed content and show that specific ranking signals and specific query terms support bias. Both of these can degrade result accuracy and increase skewness in search results. Our analysis has implications for bias mitigation strategies in online search systems, and we offer recommendations for search providers based on our findings.