HCI Bibliography Home | HCI Journals | About TWEB | Journal Info | TWEB Journal Volumes | Detailed Records | RefWorks | EndNote | Hide Abstracts
TWEB Tables of Contents: 0102030405060708

ACM Transactions on The Web 7

Editors:Marc Najork
Standard No:ISSN:1559-1131 EISSN:1559-114X
Links:Journal Home Page | ACM Digital Library | Table of Contents
  1. TWEB 2013-03 Volume 7 Issue 1
  2. TWEB 2013-05 Volume 7 Issue 2
  3. TWEB 2013-09 Volume 7 Issue 3
  4. TWEB 2013-10 Volume 7 Issue 4

TWEB 2013-03 Volume 7 Issue 1

Measuring the Visual Complexities of Web Pages BIBAFull-Text 1
  Ou Wu; Weiming Hu; Lei Shi
Visual complexities (VisComs) of Web pages significantly affect user experience, and automatic evaluation can facilitate a large number of Web-based applications. The construction of a model for measuring the VisComs of Web pages requires the extraction of typical features and learning based on labeled Web pages. However, as far as the authors are aware, little headway has been made on measuring VisCom in Web mining and machine learning. The present article provides a new approach combining Web mining techniques and machine learning algorithms for measuring the VisComs of Web pages. The structure of a Web page is first analyzed, and the layout is then extracted. Using a Web page as a semistructured image, three classes of features are extracted to construct a feature vector. The feature vector is fed into a learned measuring function to calculate the VisCom of the page.
   In the proposed approach of the present study, the type of the measuring function and its learning depend on the quantification strategy for VisCom. Aside from using a category and a score to represent VisCom as existing work, this study presents a new strategy utilizing a distribution to quantify the VisCom of a Web page. Empirical evaluation suggests the effectiveness of the proposed approach in terms of both features and learning algorithms.
Progress on Website Accessibility? BIBAFull-Text 2
  Vicki L. Hanson; John T. Richards
Over 100 top-traffic and government websites from the United States and United Kingdom were examined for evidence of changes on accessibility indicators over the 14-year period from 1999 to 2012, the longest period studied to date. Automated analyses of WCAG 2.0 Level A Success Criteria found high percentages of violations overall. Unlike more circumscribed studies, however, these sites exhibited improvements over the years on a number of accessibility indicators, with government sites being less likely than topsites to have accessibility violations. Examination of the causes of success and failure suggests that improving accessibility may be due, in part, to changes in website technologies and coding practices rather than a focus on accessibility per se.
A Comprehensive Study of Techniques for URL-Based Web Page Language Classification BIBAFull-Text 3
  Eda Baykan; Monika Henzinger; Ingmar Weber
Given only the URL of a Web page, can we identify its language? In this article we examine this question. URL-based language classification is useful when the content of the Web page is not available or downloading the content is a waste of bandwidth and time.
   We built URL-based language classifiers for English, German, French, Spanish, and Italian by applying a variety of algorithms and features. As algorithms we used machine learning algorithms which are widely applied for text classification and state-of-art algorithms for language identification of text. As features we used words, various sized n-grams, and custom-made features (our novel feature set). We compared our approaches with two baseline methods, namely classification by country code top-level domains and classification by IP addresses of the hosting Web servers.
   We trained and tested our classifiers in a 10-fold cross-validation setup on a dataset obtained from the Open Directory Project and from querying a commercial search engine. We obtained the lowest F1-measure for English (94) and the highest F1-measure for German (98) with the best performing classifiers.
   We also evaluated the performance of our methods: (i) on a set of Web pages written in Adobe Flash and (ii) as part of a language-focused crawler. In the first case, the content of the Web page is hard to extract and in the second page downloading pages of the "wrong" language constitutes a waste of bandwidth. In both settings the best classifiers have a high accuracy with an F1-measure between 95 (for English) and 98 (for Italian) for the Adobe Flash pages and a precision between 90 (for Italian) and 97 (for French) for the language-focused crawler.
HTML Automatic Table Layout BIBAFull-Text 4
  Kim Marriott; Peter Moulder; Nathan Hurst
Automatic layout of tables is required in online applications because of the need to tailor the layout to the viewport width, choice of font, and dynamic content. However, if the table contains text, minimizing the height of the table for a fixed maximum width is NP-hard. Thus, more efficient heuristic algorithms are required. We evaluate the HTML table layout recommendation and find that while it generally produces quite compact layout it is brittle and can lead to quite uncompact layout. We present an alternate heuristic algorithm. It uses a greedy strategy that starts from the widest reasonable layout and repeatedly chooses to narrow the column for which narrowing leads to the least increase in table height. The algorithm is simple, fast enough to be used in online applications, and gives significantly more compact layout than is obtained with HTML's recommended table layout algorithm.

TWEB 2013-05 Volume 7 Issue 2

A test-based security certification scheme for web services BIBAFull-Text 5
  Marco Anisetti; Claudio A. Ardagna; Ernesto Damiani; Francesco Saonara
The Service-Oriented Architecture (SOA) paradigm is giving rise to a new generation of applications built by dynamically composing loosely coupled autonomous services. Clients (i.e., software agents acting on behalf of human users or service providers) implementing such complex applications typically search and integrate services on the basis of their functional requirements and of their trust in the service suppliers. A major issue in this scenario relates to the definition of an assurance technique allowing clients to select services on the basis of their nonfunctional requirements and increasing their confidence that the selected services will satisfy such requirements. In this article, we first present an assurance solution that focuses on security and supports a test-based security certification scheme for Web services. The certification scheme is driven by the security properties to be certified and relies upon a formal definition of the service model. The evidence supporting a certified property is computed using a model-based testing approach that, starting from the service model, automatically generates the test cases to be used in the service certification. We also define a set of indexes and metrics that evaluate the assurance level and the quality of the certification process. Finally, we present our evaluation toolkit and experimental results obtained applying our certification solution to a financial service implementing the Interactive Financial eXchange (IFX) standard.
Enhancing the trust-based recommendation process with explicit distrust BIBAFull-Text 6
  Patricia Victor; Nele Verbiest; Chris Cornelis; Martine De Cock
When a Web application with a built-in recommender offers a social networking component which enables its users to form a trust network, it can generate more personalized recommendations by combining user ratings with information from the trust network. These are the so-called trust-enhanced recommendation systems. While research on the incorporation of trust for recommendations is thriving, the potential of explicitly stated distrust remains almost unexplored. In this article, we introduce a distrust-enhanced recommendation algorithm which has its roots in Golbeck's trust-based weighted mean. Through experiments on a set of reviews from Epinions.com, we show that our new algorithm outperforms its standard trust-only counterpart with respect to accuracy, thereby demonstrating the positive effect that explicit distrust can have on trust-based recommendations.
A measurement study of insecure javascript practices on the web BIBAFull-Text 7
  Chuan Yue; Haining Wang
JavaScript is an interpreted programming language most often used for enhancing webpage interactivity and functionality. It has powerful capabilities to interact with webpage documents and browser windows, however, it has also opened the door for many browser-based security attacks. Insecure engineering practices of using JavaScript may not directly lead to security breaches, but they can create new attack vectors and greatly increase the risks of browser-based attacks. In this article, we present the first measurement study on insecure practices of using JavaScript on the Web. Our focus is on the insecure practices of JavaScript inclusion and dynamic generation, and we examine their severity and nature on 6,805 unique websites. Our measurement results reveal that insecure JavaScript practices are common at various websites: (1) at least 66.4% of the measured websites manifest the insecure practices of including JavaScript files from external domains into the top-level documents of their webpages; (2) over 44.4% of the measured websites use the dangerous eval() function to dynamically generate and execute JavaScript code on their webpages; and (3) in JavaScript dynamic generation, using the document.write() method and the innerHTML property is much more popular than using the relatively secure technique of creating script elements via DOM methods. Our analysis indicates that safe alternatives to these insecure practices exist in common cases and ought to be adopted by website developers and administrators for reducing potential security risks.
Understanding query interfaces by statistical parsing BIBAFull-Text 8
  Weifeng Su; Hejun Wu; Yafei Li; Jing Zhao; Frederick H. Lochovsky; Hongmin Cai; Tianqiang Huang
Users submit queries to an online database via its query interface. Query interface parsing, which is important for many applications, understands the query capabilities of a query interface. Since most query interfaces are organized hierarchically, we present a novel query interface parsing method, StatParser (Statistical Parser), to automatically extract the hierarchical query capabilities of query interfaces. StatParser automatically learns from a set of parsed query interfaces and parses new query interfaces. StatParser starts from a small grammar and enhances the grammar with a set of probabilities learned from parsed query interfaces under the maximum-entropy principle. Given a new query interface, the probability-enhanced grammar identifies the parse tree with the largest global probability to be the query capabilities of the query interface. Experimental results show that StatParser very accurately extracts the query capabilities and can effectively overcome the problems of existing query interface parsers.
A language for end-user web augmentation: Caring for producers and consumers alike BIBAFull-Text 9
  Oscar Díaz; Cristóbal Arellano; Maider Azanza
Web augmentation is to the Web what augmented reality is to the physical world: layering relevant content/layout/navigation over the existing Web to customize the user experience. This is achieved through JavaScript (JS) using browser weavers (e.g., Greasemonkey). To date, over 43 million of downloads of Greasemonkey scripts ground the vitality of this movement. However, Web augmentation is hindered by being programming intensive and prone to malware. This prevents end-users from participating as both producers and consumers of scripts: producers need to know JS, consumers need to trust JS. This article aims at promoting end-user participation in both roles. The vision is for end-users to prosume (the act of simultaneously caring for producing and consuming) scripts as easily as they currently prosume their pictures or videos. Encouraging production requires more "natural" and abstract constructs. Promoting consumption calls for augmentation scripts to be easier to understand, share, and trust upon. To this end, we explore the use of Domain-Specific Languages (DSLs) by introducing Sticklet. Sticklet is an internal DSL on JS, where JS generality is reduced for the sake of learnability and reliability. Specifically, Web augmentation is conceived as fixing in existing web sites (i.e., the wall) HTML fragments extracted from either other sites or Web services (i.e., the stickers). Sticklet targets hobby programmers as producers, and computer literates as consumers. From a producer perspective, benefits are threefold. As a restricted grammar on top of JS, Sticklet expressions are domain oriented and more declarative than their JS counterparts, hence speeding up development. As syntactically correct JS expressions, Sticklet scripts can be installed as traditional scripts and hence, programmers can continue using existing JS tools. As declarative expressions, they are easier to maintain, and amenable for optimization. From a consumer perspective, domain specificity brings understandability (due to declarativeness), reliability (due to built-in security), and "consumability" (i.e., installation/enactment/sharing of Sticklet expressions are tuned to the shortage of time and skills of the target audience). Preliminary evaluations indicate that 77% of the subjects were able to develop new Sticklet scripts in less than thirty minutes while 84% were able to consume these scripts in less than ten minutes. Sticklet is available to download as a Mozilla add-on.
Coordinating the web of services for a smart home BIBAFull-Text 10
  Eirini Kaldeli; Ehsan Ullah Warriach; Alexander Lazovik; Marco Aiello
Domotics, concerned with the realization of intelligent home environments, is a novel field which can highly benefit from solutions inspired by service-oriented principles to enhance the convenience and security of modern home residents. In this work, we present an architecture for a smart home, starting from the lower device interconnectivity level up to the higher application layers that undertake the load of complex functionalities and provide a number of services to end-users. We claim that in order for smart homes to exhibit a genuinely intelligent behavior, the ability to compute compositions of individual devices automatically and dynamically is paramount. To this end, we incorporate into the architecture a composition component that employs artificial intelligence domain-independent planning to generate compositions at runtime, in a constantly evolving environment. We have implemented a fully working prototype that realizes such an architecture, and have evaluated it both in terms of performance as well as from the end-user point of view. The results of the evaluation show that the service-oriented architectural design and the support for dynamic compositions is quite efficient from the technical point of view, and that the system succeeds in satisfying the expectations and objectives of the users.
Assessing relevance and trust of the deep web sources and results based on inter-source agreement BIBAFull-Text 11
  Raju Balakrishnan; Subbarao Kambhampati; Manishkumar Jha
Deep web search engines face the formidable challenge of retrieving high-quality results from the vast collection of searchable databases. Deep web search is a two-step process of selecting the high-quality sources and ranking the results from the selected sources. Though there are existing methods for both the steps, they assess the relevance of the sources and the results using the query-result similarity. When applied to the deep web these methods have two deficiencies. First is that they are agnostic to the correctness (trustworthiness) of the results. Second, the query-based relevance does not consider the importance of the results and sources. These two considerations are essential for the deep web and open collections in general. Since a number of deep web sources provide answers to any query, we conjuncture that the agreements between these answers are helpful in assessing the importance and the trustworthiness of the sources and the results. For assessing source quality, we compute the agreement between the sources as the agreement of the answers returned. While computing the agreement, we also measure and compensate for the possible collusion between the sources. This adjusted agreement is modeled as a graph with sources at the vertices. On this agreement graph, a quality score of a source, that we call SourceRank, is calculated as the stationary visit probability of a random walk. For ranking results, we analyze the second-order agreement between the results. Further extending SourceRank to multidomain search, we propose a source ranking sensitive to the query domains. Multiple domain-specific rankings of a source are computed, and these ranks are combined for the final ranking. We perform extensive evaluations on online and hundreds of Google Base sources spanning across domains. The proposed result and source rankings are implemented in the deep web search engine Factal. We demonstrate that the agreement analysis tracks source corruption. Further, our relevance evaluations show that our methods improve precision significantly over Google Base and the other baseline methods. The result ranking and the domain-specific source ranking are evaluated separately.

TWEB 2013-09 Volume 7 Issue 3

A feature-word-topic model for image annotation and retrieval BIBAFull-Text 12
  Cam-Tu Nguyen; Natsuda Kaothanthong; Takeshi Tokuyama; Xuan-Hieu Phan
Image annotation is a process of finding appropriate semantic labels for images in order to obtain a more convenient way for indexing and searching images on the Web. This article proposes a novel method for image annotation based on combining feature-word distributions, which map from visual space to word space, and word-topic distributions, which form a structure to capture label relationships for annotation. We refer to this type of model as Feature-Word-Topic models. The introduction of topics allows us to efficiently take word associations, such as {ocean, fish, coral} or {desert, sand, cactus}, into account for image annotation. Unlike previous topic-based methods, we do not consider topics as joint distributions of words and visual features, but as distributions of words only. Feature-word distributions are utilized to define weights in computation of topic distributions for annotation. By doing so, topic models in text mining can be applied directly in our method. Our Feature-word-topic model, which exploits Gaussian Mixtures for feature-word distributions, and probabilistic Latent Semantic Analysis (pLSA) for word-topic distributions, shows that our method is able to obtain promising results in image annotation and retrieval.
Improving contextual advertising by adopting collaborative filtering BIBAFull-Text 13
  Eloisa Vargiu; Alessandro Giuliani; Giuliano Armano
Contextual advertising can be viewed as an information filtering task aimed at selecting suitable ads to be suggested to the final "user", that is, the Web page in hand. Starting from this insight, in this article we propose a novel system, which adopts a collaborative filtering approach to perform contextual advertising. In particular, given a Web page, the system relies on collaborative filtering to classify the page content and to suggest suitable ads accordingly. Useful information is extracted from "inlinks", that is, similar pages that link to the Web page in hand. In so doing, collaborative filtering is used in a content-based setting, giving rise to a hybrid contextual advertising system. After being implemented, the system has been experimented with about 15000 Web pages extracted from the Open Directory Project. Comparative experiments with a content-based system have been performed. The corresponding results highlight that the proposed system performs better. A suitable case study is also provided to enable the reader to better understand how the system works and its effectiveness.
Virtual private social networks and a Facebook implementation BIBAFull-Text 14
  Mauro Conti; Arbnor Hasani; Bruno Crispo
The popularity of Social Networking Sites (SNS) is growing rapidly, with the largest sites serving hundreds of millions of users and their private information. The privacy settings of these SNSs do not allow the user to avoid sharing some information (e.g., name and profile picture) with all the other users. Also, no matter the privacy settings, this information is always shared with the SNS (that could sell this information or be hacked). To mitigate these threats, we recently introduced the concept of Virtual Private Social Networks (VPSNs).
   In this work we propose the first complete architecture and implementation of VPSNs for Facebook. In particular, we address an important problem left unexplored in our previous research -- that is the automatic propagation of updated profiles to all the members of the same VPSN. Furthermore, we made an in-depth study on performance and implemented several optimization to reduce the impact of VPSN on user experience.
   The proposed solution is lightweight, completely distributed, does not depend on the collaboration from Facebook, does not have a central point of failure, it offers (with some limitations) the same functionality as Facebook, and apart from some simple settings, the solution is almost transparent to the user. Thorough experiments, with an extended set of parameters, we have confirmed the feasibility of the proposal and have shown a very limited time-overhead experienced by the user while browsing Facebook pages.
A term-based inverted index partitioning model for efficient distributed query processing BIBAFull-Text 15
  B. Barla Cambazoglu; Enver Kayaaslan; Simon Jonassen; Cevdet Aykanat
In a shared-nothing, distributed text retrieval system, queries are processed over an inverted index that is partitioned among a number of index servers. In practice, the index is either document-based or term-based partitioned. This choice is made depending on the properties of the underlying hardware infrastructure, query traffic distribution, and some performance and availability constraints. In query processing on retrieval systems that adopt a term-based index partitioning strategy, the high communication overhead due to the transfer of large amounts of data from the index servers forms a major performance bottleneck, deteriorating the scalability of the entire distributed retrieval system. In this work, to alleviate this problem, we propose a novel inverted index partitioning model that relies on hypergraph partitioning. In the proposed model, concurrently accessed index entries are assigned to the same index servers, based on the inverted index access patterns extracted from the past query logs. The model aims to minimize the communication overhead that will be incurred by future queries while maintaining the computational load balance among the index servers. We evaluate the performance of the proposed model through extensive experiments using a real-life text collection and a search query sample. Our results show that considerable performance gains can be achieved relative to the term-based index partitioning strategies previously proposed in literature. In most cases, however, the performance remains inferior to that attained by document-based partitioning.
The parallel path framework for entity discovery on the web BIBAFull-Text 16
  Tim Weninger; Thomas J. Johnston; Jiawei Han
It has been a dream of the database and Web communities to reconcile the unstructured nature of the World Wide Web with the neat, structured schemas of the database paradigm. Even though databases are currently used to generate Web content in some sites, the schemas of these databases are rarely consistent across a domain. This makes the comparison and aggregation of information from different domains difficult. We aim to make an important step towards resolving this disparity by using the structural and relational information on the Web to (1) extract Web lists, (2) find entity-pages, (3) map entity-pages to a database, and (4) extract attributes of the entities. Specifically, given a Web site and an entity-page (e.g., university department and faculty member home page) we seek to find all of the entity-pages of the same type (e.g., all faculty members in the department), as well as attributes of the specific entities (e.g., their phone numbers, email addresses, office numbers). To do this, we propose a Web structure mining method which grows parallel paths through the Web graph and DOM trees and propagates relevant attribute information forward. We show that by utilizing these parallel paths we can efficiently discover entity-pages and attributes. Finally, we demonstrate the accuracy of our method with a large case study.
Semantic content-based recommendation of software services using context BIBAFull-Text 17
  Liwei Liu; Freddy Lecue; Nikolay Mehandjiev
The current proliferation of software services means users should be supported when selecting one service out of the many which meet their needs. Recommender Systems provide such support for selecting products and conventional services, yet their direct application to software services is not straightforward, because of the current scarcity of available user feedback, and the need to fine-tune software services to the context of intended use. In this article, we address these issues by proposing a semantic content-based recommendation approach that analyzes the context of intended service use to provide effective recommendations in conditions of scarce user feedback. The article ends with two experiments based on a realistic set of semantic services. The first experiment demonstrates how the proposed semantic content-based approach can produce effective recommendations using semantic reasoning over service specifications by comparing it with three other approaches. The second experiment demonstrates the effectiveness of the proposed context analysis mechanism by comparing the performance of both context-aware and plain versions of our semantic content-based approach, benchmarked against user-performed selection informed by context.

TWEB 2013-10 Volume 7 Issue 4

Understanding latent interactions in online social networks BIBAFull-Text 18
  Jing Jiang; Christo Wilson; Xiao Wang; Wenpeng Sha; Peng Huang; Yafei Dai; Ben Y. Zhao
Popular online social networks (OSNs) like Facebook and Twitter are changing the way users communicate and interact with the Internet. A deep understanding of user interactions in OSNs can provide important insights into questions of human social behavior and into the design of social platforms and applications. However, recent studies have shown that a majority of user interactions on OSNs are latent interactions, that is, passive actions, such as profile browsing, that cannot be observed by traditional measurement techniques.
   In this article, we seek a deeper understanding of both active and latent user interactions in OSNs. For quantifiable data on latent user interactions, we perform a detailed measurement study on Renren, the largest OSN in China with more than 220 million users to date. All friendship links in Renren are public, allowing us to exhaustively crawl a connected graph component of 42 million users and 1.66 billion social links in 2009. Renren also keeps detailed, publicly viewable visitor logs for each user profile. We capture detailed histories of profile visits over a period of 90 days for users in the Peking University Renren network and use statistics of profile visits to study issues of user profile popularity, reciprocity of profile visits, and the impact of content updates on user popularity. We find that latent interactions are much more prevalent and frequent than active events, are nonreciprocal in nature, and that profile popularity is correlated with page views of content rather than with quantity of content updates. Finally, we construct latent interaction graphs as models of user browsing behavior and compare their structural properties, evolution, community structure, and mixing times against those of both active interaction graphs and social graphs.
A bottom-up, knowledge-aware approach to integrating and querying web data services BIBAFull-Text 19
  Silvia Quarteroni; Marco Brambilla; Stefano Ceri
As a wealth of data services is becoming available on the Web, building and querying Web applications that effectively integrate their content is increasingly important. However, schema integration and ontology matching with the aim of registering data services often requires a knowledge-intensive, tedious, and error-prone manual process.
   We tackle this issue by presenting a bottom-up, semi-automatic service registration process that refers to an external knowledge base and uses simple text processing techniques in order to minimize and possibly avoid the contribution of domain experts in the annotation of data services. The first by-product of this process is a representation of the domain of data services as an entity-relationship diagram, whose entities are named after concepts of the external knowledge base matching service terminology rather than being manually created to accommodate an application-specific ontology. Second, a three-layer annotation of service semantics (service interfaces, access patterns, service marts) describing how services "play" with such domain elements is also automatically constructed at registration time. When evaluated against heterogeneous existing data services and with a synthetic service dataset constructed using Google Fusion Tables, the approach yields good results in terms of data representation accuracy.
   We subsequently demonstrate that natural language processing methods can be used to decompose and match simple queries to the data services represented in three layers according to the preceding methodology with satisfactory results. We show how semantic annotations are used at query time to convert the user's request into an executable logical query. Globally, our findings show that the proposed registration method is effective in creating a uniform semantic representation of data services, suitable for building Web applications and answering search queries.
Web browsing behavior analysis and interactive hypervideo BIBAFull-Text 20
  Luis A. Leiva; Roberto Vivó
Processing data on any sort of user interaction is well known to be cumbersome and mostly time consuming. In order to assist researchers in easily inspecting fine-grained browsing data, current tools usually display user interactions as mouse cursor tracks, a video-like visualization scheme. However, to date, traditional online video inspection has not explored the full capabilities of hypermedia and interactive techniques. In response to this need, we have developed SMT2ε, a Web-based tracking system for analyzing browsing behavior using feature-rich hypervideo visualizations. We compare our system to related work in academia and the industry, showing that ours features unprecedented visualization capabilities. We also show that SMT2ε efficiently captures browsing data and is perceived by users to be both helpful and usable. A series of prediction experiments illustrate that raw cursor data are accessible and can be easily handled, providing evidence that the data can be used to construct and verify research hypotheses. Considering its limitations, it is our hope that SMT2ε will assist researchers, usability practitioners, and other professionals interested in understanding how users browse the Web.
Robust detection of semi-structured web records using a DOM structure-knowledge-driven model BIBAFull-Text 21
  Lidong Bing; Wai Lam; Tak-Lam Wong
Web data record extraction aims at extracting a set of similar object records from a single webpage. These records have similar attributes or fields and are presented with a regular format in a coherent region of the page. To tackle this problem, most existing works analyze the DOM tree of an input page. One major limitation of these methods is that the lack of a global view in detecting data records from an input page results in a myopic decision. Their brute-force searching manner in detecting various types of records degrades the flexibility and robustness. We propose a Structure-Knowledge-Oriented Global Analysis (Skoga) framework which can perform robust detection of different-kinds of data records and record regions. The major component of the Skoga framework is a DOM structure-knowledge-driven detection model which can conduct a global analysis on the DOM structure to achieve effective detection. The DOM structure knowledge consists of background knowledge as well as statistical knowledge capturing different characteristics of data records and record regions, as exhibited in the DOM structure. The background knowledge encodes the semantics of labels indicating general constituents of data records and regions. The statistical knowledge is represented by some carefully designed features that capture different characteristics of a single node or a node group in the DOM. The feature weights are determined using a development dataset via a parameter estimation algorithm based on a structured output support vector machine. An optimization method based on the divide-and-conquer principle is developed making use of the DOM structure knowledge to quantitatively infer and recognize appropriate records and regions for a page. Extensive experiments have been conducted on four datasets. The experimental results demonstrate that our framework achieves higher accuracy compared with state-of-the-art methods.
A vlHMM approach to context-aware search BIBAFull-Text 22
  Zhen Liao; Daxin Jiang; Jian Pei; Yalou Huang; Enhong Chen; Huanhuan Cao; Hang Li
Capturing the context of a user's query from the previous queries and clicks in the same session leads to a better understanding of the user's information need. A context-aware approach to document reranking, URL recommendation, and query suggestion may substantially improve users' search experience. In this article, we propose a general approach to context-aware search by learning a variable length hidden Markov model (vlHMM) from search sessions extracted from log data. While the mathematical model is powerful, the huge amounts of log data present great challenges. We develop several distributed learning techniques to learn a very large vlHMM under the map-reduce framework. Moreover, we construct feature vectors for each state of the vlHMM model to handle users' novel queries not covered by the training data. We test our approach on a raw dataset consisting of 1.9 billion queries, 2.9 billion clicks, and 1.2 billion search sessions before filtering, and evaluate the effectiveness of the vlHMM learned from the real data on three search applications: document reranking, query suggestion, and URL recommendation. The experiment results validate the effectiveness of vlHMM in the applications of document reranking, URL recommendation, and query suggestion.
Captions and biases in diagnostic search BIBAFull-Text 23
  Ryen W. White; Eric Horvitz
People frequently turn to the Web with the goal of diagnosing medical symptoms. Studies have shown that diagnostic search can often lead to anxiety about the possibility that symptoms are explained by the presence of rare, serious medical disorders, rather than far more common benign syndromes. We study the influence of the appearance of potentially-alarming content, such as severe illnesses or serious treatment options associated with the queried for symptoms, in captions comprising titles, snippets, and URLs. We explore whether users are drawn to results with potentially-alarming caption content, and if so, the implications of such attraction for the design of search engines. We specifically study the influence of the content of search result captions shown in response to symptom searches on search-result click-through behavior. We show that users are significantly more likely to examine and click on captions containing potentially-alarming medical terminology such as "heart attack" or "medical emergency" independent of result rank position and well-known positional biases in users' search examination behaviors. The findings provide insights about the possible effects of displaying implicit correlates of searchers' goals in search-result captions, such as unexpressed concerns and fears. As an illustration of the potential utility of these results, we developed and evaluated an enhanced click prediction model that incorporates potentially-alarming caption features and show that it significantly outperforms models that ignore caption content. Beyond providing additional understanding of the effects of Web content on medical concerns, the methods and findings have implications for search engine design. As part of our discussion on the implications of this research, we propose procedures for generating more representative captions that may be less likely to cause alarm, as well as methods for learning to more appropriately rank search results from logged search behavior, for examples, by also considering the presence of potentially-alarming content in the captions that motivate observed clicks and down-weighting clicks seemingly driven by searchers' health anxieties.
Semantic contextual advertising based on the open directory project BIBAFull-Text 24
  Jung-Hyun Lee; Jongwoo Ha; Jin-Yong Jung; Sangkeun Lee
Contextual advertising seeks to place relevant textual ads within the content of generic webpages. In this article, we explore a novel semantic approach to contextual advertising. This consists of three tasks: (1) building a well-organized hierarchical taxonomy of topics, (2) developing a robust classifier for effectively finding the topics of pages and ads, and (3) ranking ads based on the topical relevance to pages. First, we heuristically build our own taxonomy of topics from the Open Directory Project (ODP). Second, we investigate how to increase classification accuracy by taking the unique characteristics of the ODP into account. Last, we measure the topical relevance of ads by applying a link analysis technique to the similarity graph carefully derived from our taxonomy. Experiments show that our classification method improves the performance of Ma-F1 by as much as 25.7% over the baseline classifier. In addition, our ranking method enhances the relevance of ads substantially, up to 10% in terms of precision at k, compared to a representative strategy.